Saturday, 14 December 2013

Bioinformatics useful bits

A) Converting phred64 to phred33 fastq reads using trimmomatic:


java -jar trimmomatic-0.30.jar SE -threads 40 -phred64 /home/data/wheat_ems_mutation/rawReads/D3_R1.fastq.gz /home/data/wheat_ems_mutation/rawReads/D3_R1P33.fastq TOPHRED33

java -jar trimmomatic-0.30.jar SE -threads 40 -phred64 /home/data/wheat_ems_mutation/rawReads/D3_R2.fastq.gz /home/data/wheat_ems_mutation/rawReads/D3_R2P33.fastq TOPHRED33


B) Extracting reads from SAM file using list of read names:
Put read names into file "Reads.txt"
dos2unix Reads.txt
LC_ALL=C grep -w -F -f Reads.txt  < in.sam > subset.sam


C) Reducing a bam file using a list of contigs from an VCF excel file:

Reduce the ids by removing duplicates
copy and paste the resulting ids to a file called "scaffoldstokeep_single.txt"
convert the file to linux
dos2unix 

Using this file template copy the scaffolds just before the ">" character leaving a space. and call it "reduceSam.sh" as shown below

samtools view -F 4 C6_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup.bam IWGSC_CSS_5AL_scaff_2425045 (more ids??)  > C6_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced.sam

Remove return characters:



awk '{ sub("\r$", " "); print }' reduceSam.sh > reduceSam2.sh
cat reduceSam2.sh | tr -d "\n" > reduceSam3.sh

Run the script and change for each bam file you have.
sh reduceSam3.sh

Produce a scaffold id list to strip the header you need to replace.
perl -ne 'chop; print "SN:$_\n"' scaffoldstokeep_single.txt > scaffold_ids.txt
grep -f scaffold_ids.txt D3_R12P33_IWGSP1.21_novo_H15_t60.sam > headers.txt

Add the header to the reduced file and sort index etc.


cat headers.txt D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced.sam > D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.sam

samtools view -bS D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.sam > D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.bam

samtools sort D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.bam D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.sort

samtools index D3_R12P33_IWGSP1.21_novo_H15_t60.sort.rmdup_reduced_h.sort.bam



D) unpack a SRA downloaded file using fastq-dump.2.3.4 when illumina paired files:


./fastq-dump.2.3.4 --split-files SRR1027330.sra



E) Get mRNA transcripts from a GFF file. Need the fasta file too. Use the tool provided by cufflinks like this:
gffread -w transcripts.fa -g p3_p13839_Fus_grami_v32_chromosomal.fasta p3_p13839_Fus_grami_v32_chromosomal.gff3


F) Get mRNA, Fasta file, GFF3 file from a combined Fasta/GFF3 file:
Get fasta file
sed -n '/>/,$p' BrachV2013_10.gff3 > BrachV2013_10.fasta 
count lines:
wc -l BrachV2013_10.gff3
3832477 BrachV2013_10.gff3
3399164 BrachV2013_10.fasta
Get GFF file:
cat BrachV2013_10.gff3 | head -n 433313 > BrachV2013_10_2.gff3 
Get transcripts:
gffread -w transcripts.fa -g BrachV2013_10.fasta BrachV2013_10_2.gff3 





G) Adding Read groups for gatk tools to bam files when you forget during the mapping:

java -Xmx4g -Djava.io.tmpdir=/tmp -jar /home/data/wheat_ems_mutation/tools/picard-tools-1.105/AddOrReplaceReadGroups.jar INPUT=/home/data/wheat_ems_mutation/Final_results/References/A6_P33_Triticum_aestivum.IWGSP1.21.dna_RM.genome_novo_H15_t60.sort.rmdup_reduced.bam OUTPUT=/home/data/wheat_ems_mutation/Final_results/References/A6_P33_Triticum_aestivum.IWGSP1.21.dna_RM.genome_novo_H15_t60.sort.rmdup_reduced_GATK.bam RGID=1 RGLB=illumina RGPL=illumina RGPU=1 RGSM=A6 SORT_ORDER=coordinate
Need to index again too:
samtools index /home/data/wheat_ems_mutation/Final_results/References/A6_P33_Triticum_aestivum.IWGSP1.21.dna_RM.genome_novo_H15_t60.sort.rmdup_reduced_GATK.bam
H) Get Exon; length, GC content, read depth:
java -Xmx50g -jar GenomeAnalysisTK.jar --minMappingQuality 20 -R /home/data/bioinf_resources/prebuild_index/Athaliana/genome/bwa/Triticum_aestivum.IWGSP1.21.dna_rm.genome.fa -T DepthOfCoverage -o Triticum_aestivum_MinQual20 -I ../input_bams.list -L /home/data/rking/Exon.bed
java -Xmx50g -jar GenomeAnalysisTK.jar -R /home/data/bioinf_resources/prebuild_index/Athaliana/genome/bwa/Triticum_aestivum.IWGSP1.21.dna_rm.genome.fa -T DepthOfCoverage -o Triticum_aestivum_MinQual0 -I ../input_bams.list -L /home/data/rking/Exon.bed
java -Xmx2g -jar GenomeAnalysisTK.jar -T GCContentByInterval -R /home/data/bioinf_resources/prebuild_index/Athaliana/genome/bwa/Triticum_aestivum.IWGSP1.21.dna.genome.fa -o Triticum_GC_nonM.txt -L /home/data/rking/Exon.bed


I) Get GC content and number of bases with a given coverage from a bam file:
samtools mpileup -IBA -q 20 A6_P33_IWGSC_CSS_ALL_scaff_novo_H15_t60.sort.rmdup_Gen.bam -f /home/data/wheat_ems_mutation/reference/iwgsc_css_assembly/IWGSC_CSS_ALL_scaff.fa > A6.mpileup
awk '$4 >= 8' A6.mpileup > A6.covfilter.mpilup
perl -lane'print if $F[2] eq A' A6.covfilter.mpilup > A6_A.covfilter.mpilup
perl -lane'print if $F[2] eq C' A6.covfilter.mpilup > A6_C.covfilter.mpilup
perl -lane'print if $F[2] eq G' A6.covfilter.mpilup > A6_G.covfilter.mpilup
perl -lane'print if $F[2] eq T' A6.covfilter.mpilup > A6_T.covfilter.mpilup
wc -l A6_C.covfilter.mpilup
1715059
wc -l A6_T.covfilter.mpilup
1962907
wc -l A6_G.covfilter.mpilup
1717822
wc -l A6_A.covfilter.mpilup
1953160
C+G=3432881
A+T=3916067
total= 7348948
=46.71% GC content of sample A6 for reads 8 or above coverage
Count total coverage of which can be divided by total number of bases to get average.
cat D3.mpileup | awk '{ sum+=$4} END {print sum}'

Mapping stats:

#Number of mapped reads (count multi-mapped reads only once)

samtools view accepted_hits.bam | cut -f1 | sort -u | wc -l

16591458



#Unique mapped reads

samtools view accepted_hits.bam | fgrep -cw NH:i:1

14597784



#Multi mapped reads

samtools view accepted_hits.bam | fgrep -vw NH:i:1 | cut -f1 | sort -u | wc -l

1993674

Z) Eval script use:
evaluate_gtf.pl -g -q ../MIPS.gtf ../sd.gtf > comparison.txt

Monday, 22 July 2013

Bioinformatics Open Learning Using Community Tutorials

Next Generation Sequencing (NGS) is a rapidly expanding and evolving field with periodical software developments. The lack of up-to-date comprehensive NGS training and consequently individuals with the skills to analyse the information generated is a major concern. A website with a set of core practical tutorials using case study examples to provide comprehensive training for NGS, which never becomes dated by the contribution of tutorials using the bioinformatics community, would be advantageous to correct these concerns. RNA-Seq, ChIP-Seq, SNP calling, and de novo assembly tutorials using leading software, selected through a literature search are placed on a “dynamic” website that can have new tutorials uploaded, grouped, and reviewed by users, see http://elvis.ccc.cranfield.ac.uk/CUBELP/faces/login-page.xhtml. Note: these pdf tutorials have been removed temporarily while updated, see link at end of post for alternative.

Possible future developments of this learning platform are to use it to form the foundation of a global independent bioinformatics training and learning community of which universities can contribute and use, to establish distance learning modules. The tutorials generated provide a comprehensive training in leading techniques by the use of screenshots of output and detailed commands used to reproduce them. Future developments could be to expand the range of tutorials to demonstrate: different quality data sets because different actions are recommended, data produced from different technologies because different options are provided by software, other topics within NGS such as genome wide association studies, topics other than NGS but in the bioinformatics umbrella such as machine learning, and educational games/videos to demonstrate/train bioinformatics techniques to provide interactive training suites. Overall this is a representation of the continual movement of adult learning from traditional University campus to online student led learning and community driven teaching.

I have a static version of the tutorials available here:
http://elvis.ccc.cranfield.ac.uk/CUBELP2/

Thursday, 10 May 2012

When to say I love you and love poetry

The use of the phrase "I love you" can be used freely and often but do we always mean it? Does it really matter? Can we show that we love someone without saying it? And how does this compare with saying "I love you" to someone?

Women expect the phrase I love you after a suitable number of dates and when she starts to have feelings of affection. Quite often it is not really "I love you" which is being said but the expression of I really like you and want to be with you, which is what the woman really hears in her head. This is because it is routine for men to say it without really meaning it as a consequence of the man knowing the woman wants and expects it to be said, so to accommodate, the man say's it to proceed with getting closer to the woman. This initial I love you, is not as significant as that later on which is unsolicited (not expected). When a man says it and she is not expecting it after the initial trial dating phase that’s when he is no longer saying yes I could have feelings for you too but I adore you and think of you all the time.

If a person is in love, it is much easier and effective to rely upon their actions to show this, as actions are a lot more difficult to fake than just saying three words. If a man loves a woman then it will be expressed in the things he does, like walking on the side closest to the road, preparing surprises to ensure that they spend time together in different ways and thus keep the woman's interest, subtle body language which is often affectionate and comforting. It’s not just a matter of words, as words are cheap and are of no great effort but to express yourself, with what we call romance then that shows intent of feeling.

One form that a man can use to express these feelings is the old tradition of poetry although popularly thought of as cheesy, if a woman already is interested and wants conformation of a man’s intentions then it is an excellent expression to use. If one does not know the woman enough or is unsure of her character then you wouldn't be in love with them and there would not be the need to express it using poetry. If you have strong feelings towards someone then why not try it, you don't have to be Shakespeare but follow the English tradition set down by many of our great men of history. I enclose a poem I once wrote for someone so that others that although cannot find the words would like to express themselves to another. Note: If you get the desired result after using this poem, you owe me a comment and a click on Google plus one icon, a donation wouldn't hurt either :)



My love for you,

Is like the sun,

Hot and strong,

Shining all day long,



It does not cease,

Even out of sight,

It burns bright,

Right through the night,



I hold you,

Often as I do,

I think of love,

And I think of you,



To live my life,

Without your smile,

Would be to hold my breath,

All the while,



For all I am,

Though just a man,

I give to you,

All that makes me,

Now belongs to you.