Example: assemble phage genome

After installing Sprai, assemble a small genome.

For example, we showed how to assemble the phage genome.

Prepare data

Go to pacbiotoca wiki and download phage data ( http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz ).

mkdir tmp
cd tmp
wget http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz
tar xvzf sampleData.tar.gz

Convert fasta to fastq. You can use a fa2fq.pl script in Sprai.

(Sprai version 0.9.9.13 or newer can be fed both fastq and fasta format. So you can skip converting fasta to fastq.)

fa2fq.pl sampleData/pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fq

Prepare Sprai

mkdir tmp2
cd tmp2
ln -s ../pacbio.filtered_subreads.fq all.fq
cp /your/sprai/directory/asm.spec .
cp /your/sprai/directory/ec.spec .

Open ec.spec and change values.

#>- params -<#
input_fastq all.fq
estimated_genome_size 50000
estimated_depth 100
partition 12
evalue 1e-50
trim 42
ca_path /path/to/your/wgs/Linux-amd64/bin/
word_size 18
#>- params -<#

input_fastq is your input file name.

estimated_genome_size is the number of nucleotides of your target. If you do not know it, set large number. For example, set 1e+12.

estimated_depth is the depth of coverage of input_fastq of your target. If you do not know it, set 0.

partition is the number of processors Sprai uses.

evalue is used by blastn.

trim is the number of nucleotides Sprai cut from both sides of alignments.

ca_path is the path to your wgs-assembler (Celera Assembler) installed.

word_size is used by blastn.

Correct errors & assemble

ezez_vx1.pl ec.spec pbasm.spec > log.txt 2>&1 &

ezez_vx1.pl outputs into a result_yyyymmdd_hhmmss directory. The improved reads will be in a c01.fin.idfq.gz file and contigs will be in a CA/9-terminator/asm.ctg.fasta file.

We explain temporary files in a tmp directory. Sprai gives ID to each read in all.fq and outputs c00.idfq.gz. Sprai partitions all.fq to c00_00xx.fa. c00.nin, c00.nhr and c00.nsq are output by makeblastdb. Sprai aligns c00_00xx.fa to c00.nxx database by using blastn, correct errors and output c00_00xx.dfq.gz. dfq format contains the IDs of reads, bases, aligned depths and quality values. (Current Sprai outputs dummy quality values.) Sprai cuts low depth regions of each read, deletes’-’ characters to convert to FASTQ format and outputs corrected reads in c01.fin.idfq.gz. Sprai extracts longest 20X reads of the estimated_genome_size from c01.fin.idfq.gz. And feed them to Celera Assembler. Celera Assembler outputs files into CA directory.

If you only correct errors and don’t assemble, do

ezez_vx1.pl ec.spec -ec_only > log.txt 2>&1 &

or

ezez_vx1.pl ec.spec > log.txt 2>&1 &

After error correction, if you want to assemble corrected reads using Celera Assembler, do

ca_ikki_v5.pl pbasm.spec estimated_genome_size \
  -d directory in which fin.idfq.gzs exist \
  -ca_path /path/to/your/wgs/Linux-amd64/bin \
  -sprai_path the path to get_top_20x_fa.pl installed

Find contigs

You will find contigs in a CA/9-terminator/asm.ctg.fasta file. Files in CA are produced by the wgs-assembler (Celera Assembler). Read the wgs-assembler document for details.

Get N50 value

If you install Statistics::Descriptive module to a directory Celera Assembler can see, Celera Assembler outputs N50 value and so on in a CA/do_*_c01.fin.top20x.log file. You will find lines in the file like below:

[Contigs]
TotalContigsInScaffolds         1
TotalBasesInScaffolds           49848
TotalVarRecords                 0
MeanContigLength                49848
MinContigLength                 49848
MaxContigLength                 49848
N25ContigBases                  49848
N50ContigBases                  49848
N75ContigBases                  49848

These are the statistics of contigs of the assembly.

Notes

If you would like to use more than about 1000 processors, we recommend to use a pre_partition parameter in the ec.spec file like:

# ec.spec
pre_partition 4
partition 300

And

ezez4qsub_vx1.pl ec.spec pbasm.spec > log 2>&1 &

In this example, Sprai will use 4*300 = 1200 processors.