===================================== Example: assemble phage genome ===================================== After installing Sprai, assemble a small genome. For example, we showed how to assemble the phage genome. Prepare data ======================= Go to `pacbiotoca wiki `_ and download phage data ( \http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz ). :: mkdir tmp cd tmp wget http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz tar xvzf sampleData.tar.gz Convert fasta to fastq. You can use a fa2fq.pl script in Sprai. (Sprai version 0.9.9.13 or newer can be fed both fastq and fasta format. So you can skip converting fasta to fastq.) :: fa2fq.pl sampleData/pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fq Prepare Sprai ========================================== :: mkdir tmp2 cd tmp2 ln -s ../pacbio.filtered_subreads.fq all.fq cp /your/sprai/directory/asm.spec . cp /your/sprai/directory/ec.spec . Open ec.spec and change values. :: #>- params -<# input_fastq all.fq estimated_genome_size 50000 estimated_depth 100 partition 12 evalue 1e-50 trim 42 ca_path /path/to/your/wgs/Linux-amd64/bin/ word_size 18 #>- params -<# *input_fastq* is your input file name. *estimated_genome_size* is the number of nucleotides of your target. If you do not know it, set large number. For example, set 1e+12. *estimated_depth* is the depth of coverage of input_fastq of your target. If you do not know it, set 0. *partition* is the number of processors Sprai uses. *evalue* is used by blastn. *trim* is the number of nucleotides Sprai cut from both sides of alignments. *ca_path* is the path to your wgs-assembler (Celera Assembler) installed. *word_size* is used by blastn. Correct errors & assemble ============================== :: ezez_vx1.pl ec.spec pbasm.spec > log.txt 2>&1 & ezez_vx1.pl outputs into a *result_yyyymmdd_hhmmss* directory. The improved reads will be in a *c01.fin.idfq.gz* file and contigs will be in a *CA/9-terminator/asm.ctg.fasta* file. We explain temporary files in a *tmp* directory. Sprai gives ID to each read in *all.fq* and outputs *c00.idfq.gz*. Sprai partitions *all.fq* to *c00_00xx.fa*. *c00.nin*, *c00.nhr* and *c00.nsq* are output by makeblastdb. Sprai aligns *c00_00xx.fa* to *c00.nxx* database by using blastn, correct errors and output *c00_00xx.dfq.gz*. dfq format contains the IDs of reads, bases, aligned depths and quality values. (Current Sprai outputs dummy quality values.) Sprai cuts low depth regions of each read, deletes'-' characters to convert to FASTQ format and outputs corrected reads in *c01.fin.idfq.gz*. Sprai extracts longest 20X reads of the *estimated_genome_size* from *c01.fin.idfq.gz*. And feed them to Celera Assembler. Celera Assembler outputs files into *CA* directory. If you only correct errors and don't assemble, do :: ezez_vx1.pl ec.spec -ec_only > log.txt 2>&1 & or :: ezez_vx1.pl ec.spec > log.txt 2>&1 & After error correction, if you want to assemble corrected reads using Celera Assembler, do :: ca_ikki_v5.pl pbasm.spec estimated_genome_size \ -d directory in which fin.idfq.gzs exist \ -ca_path /path/to/your/wgs/Linux-amd64/bin \ -sprai_path the path to get_top_20x_fa.pl installed Find contigs =================== You will find contigs in a *CA/9-terminator/asm.ctg.fasta* file. Files in *CA* are produced by the wgs-assembler (Celera Assembler). Read `the wgs-assembler document `_ for details. Get N50 value ================ If you install Statistics::Descriptive module to a directory Celera Assembler can see, Celera Assembler outputs N50 value and so on in a CA/do_*_c01.fin.top20x.log file. You will find lines in the file like below: :: [Contigs] TotalContigsInScaffolds 1 TotalBasesInScaffolds 49848 TotalVarRecords 0 MeanContigLength 49848 MinContigLength 49848 MaxContigLength 49848 N25ContigBases 49848 N50ContigBases 49848 N75ContigBases 49848 These are the statistics of contigs of the assembly. Notes ========================================== If you would like to use more than about 1000 processors, we recommend to use a *pre_partition* parameter in the ec.spec file like: :: # ec.spec pre_partition 4 partition 300 And :: ezez4qsub_vx1.pl ec.spec pbasm.spec > log 2>&1 & In this example, Sprai will use 4*300 = 1200 processors.