======== README ======== Sprai (single-pass read accuracy improver) is a tool to correct sequencing errors in single-pass reads for de novo assembly. It is originally designed for correcting sequencing errors in single-molecule DNA sequencing reads, especially in Continuous Long Reads (CLRs) generated by PacBio RS sequencers. The goal of Sprai is not maximizing the accuracy of error-corrected reads; instead, Sprai aims at maximizing the continuity (i.e., N50 contig length) of assembled contigs after error correction. Introduction ============= Error correction in sequencing reads is critical for de novo assembly of reads. Most error correction tools for next-generation DNA sequencers are designed for reads by the second DNA generation sequencers, and therefore they expected up to 1-5 % of sequencing errors in reads. However, PacBio RS, a third-generation DNA sequencer, yields reads with ~15 % errors (at nucleotide level). Therefore, several error correction algorithms for PacBio long reads were developed. Since the second-generation DNA sequencers such as Illumina HiSeq yields reads of very good quality but they all have systematic sequencing errors. In other words, the infinite sequencing depth would not lead to the perfect consensus sequence of a target genome. PacBio RS is the first sequencer that deemed to be free from systematic sequencing errors; all sequencing errors (at nucleotide level) seem to occur randomly and independently. Given that all sequencing errors occur randomly and independently, the consensus sequence of multiple alignment of sequencing reads would give the perfect reconstruction of the original DNA sequence. HGAP, which is published in 2013, is a error-correction and assembly tool based on this idea, and it proved that de novo assembly of genomes is possible only with long reads with 15% errors. Sprai also follows the same idea, but uses different algorithms in generating multiple alignment and consensus generation. Similar to HGAP, Sprai takes long reads generated by PacBio RS and corrects sequencing errors in them. Sprai has a pipeline to assemble the corrected reads using Celera assembler, so it automatically assembles the corrected reads as HGAP. The major difference between HGAP and Sprai is that Sprai usually outputs longer contigs (in terms of N50 contig length) than HGAP at modest sequencing depth (< 50x). The number of misassemblies is usually less than that by HGAP, and the nucleotide-level accuracy of assembled contigs is also higher than that by HGAP. With 100x or more sequencing depth, HGAP and Sprai performs almost equally in terms of continuity, but the nucleotide-level accuracy by HGAP is higher presumably because Quiver module (in HGAP) takes more information (e.g., three types of Quality Values) into account and because it uses HMM instead of trusting input multiple alignment. To always give the best assembly result (in terms of continuity and accuracy), Sprai has a module to polish the consensus sequences using Quiver module for the best nucleotide quality. Run Sprai for error correction, then it passes the error corrected reads to Celera assembler. The assembled contigs will be polished by Quiver module if you wish. Then you will get the best (as of writing, at least.) contigs among assembler pipelines. Happy genome sequencing! Install ======== Here we describe how to install Sprai using no additional packaging tool. If you would like to use LPM, see the next section. Sprai requires the following packages. * python 2.6 or newer * `NCBI BLAST+ ver. 2.2.27 or newer `_ * `Celera Assembler ver. 8.1 or newer `_ (if you assemble reads after error-correction) Note that the legacy version of BLAST (without "a plus") is not compatible with Sprai. After installing the prerequisites, extract the tar ball of Sprai, and do the following :: cd sprai-some-version ./waf configure [--prefix=/some/dir/] ./waf build ./waf install If you specify ``--prefix=/some/dir/``, Sprai will be installed under ``/some/dir/bin/`` directory. Otherwise, it will be installed in ``/usr/local/bin``. If you do not have a root access, probably you want to add ``--prefix=$HOME/local`` or so. If you are creating RPM package, you would add ``--prefix=/usr``. Install (with LPM) =================== Alternatively, you can use `Local Package Manager (LPM) `_ to automete the whole installation process. LPM is a software management tool for non-root users. It can download/build/install packages like yum/apt-get/zypper/MacPort/HomeBrew, but does not require root previledge. LPM install any packages under a user's home directory. It also takes care of setting environmental variables. Another feature might be that it can switch a set of installed software; for example, if you have software A and B but they conflicts with each other, you can exclusively enable either of them. With LPM, you can just type as follows (assuming that your Linux distribution is Red Hat Enterprise Linux or a compatible distribution such as CentOS or Scientific Linux):: $ lpm install blast+ $ lpm install smrtanalysis-centos $ lpm install sprai For Ubuntu users, the second command will not work, so you have to install SMRTAnalysis pipline for yourself from PacBio DevNet. Normally you have zlib installed on your system by default, but in case there is not, you should do the following before the above three commands :: $ lpm install compiler-envs $ lpm install zlib The first one might have been executed already if you have installed some other libraries before. In such a case, you can skip the first command. Uninstall ================ If you manually installed Sprai, you can uninstall Sprai by the following command:: cd sprai-some-version ./waf uninstall If you installed Sprai by LPM, uninstalling Sprai would be easier:: lpm uninstall sprai Run Sprai ================ First, create a working directory. Here we assume that we use tmp:: mkdir tmp cd tmp Next, we prepare input *subreads* in FASTQ format. Note that Sprai takes *subreads* as input, not *reads*. The difference between *reads* and *subreads* is that reads (of PacBio) may contain adaptor sequences, while *subreads* do not. If you only have FASTQ files and do not know which ones they contain, do as follows. If the file name is ``filtered_subreads.fastq``, it is most likely that the file contains subreads. Otherwise, it is safer to start from .bas.h5 files, which contain raw reads. To convert .bas.h5 file into subreads in FASTQ format, there are two ways. The first way is to use SMRT Pipe. If you have PacBio RS, you usually have it installed on your system, so this might be the easiest choice. Run P_Filter (or S_Filter) with ReadScore threshold 0.75 (or 0.80 if the sequencing depth is higher than 120-150x), with MinReadLen threshold 500 bp. The resulted ``filtered_subreads.fastq`` contains subreads, which are compatible with Sprai. The installation manual of SMRT Analysis (including SMRT Pipe) says that it requires a bunch of daemons including MySQL, but when we use it for Sprai, you can just download and extract it. Command line utilities including SMRT Pipe works without any daemons. So, please download it, extract it, put some environmental variables in a startup script, and then finished. Downloading SMRT Analysis package may take a while, but the others take less than 10 minutes. The second way is to use ``bash5tools.py``. It is a standalone command line utility that works in the UNIX way; it does not take an XML configuration file as SMRT Pipe, and instead everything can be controlled by command line switches. Therefore ``bash5tools.py`` is very handy when we create a pipeline by our own. To use ``bash5tools.py``, you have to install it from `PacBio GitHub (pbh5tools) `_. Honestly speaking, it is not well-packaged (yet?) so that you may encounter a Python error even you exactly follow the installation instruction there. The problem we had was that newly installed pbcores went to a different directory than the old installation; the old one comes first in Python module search, so the newly installed ones were hidden. We had to remove the old pbcore manually. Please ask people in PacBio about how to install pbh5tools (and pbcores) because things change so quickly that we may know latest problems. Once you installed ``bash5tools.py``, you can convert .bas.h5 into FASTQ format by the following command:: bash5tools.py --outFilePrefix example_output --readType subreads --outType fastq --minReadScore 0.75 example.bas.h5 You will get ``example_output.fastq`` as output. Once we have all subreads, we combine all FASTQ files into one:: cat a000.fastq a001.fastq ... > all.fq We also need parameter files, with which we specify various paramters for error-correction and sequence assembly (by Celera assembler):: cp /path/to/sprai/pbasm.spec . cp /path/to/sprai/ec.spec . Then, we modify parameters in the both template files. *pbasm.spec* is a parameter file for Celera assembler; see the documents of Celera assembler for details. If you only want error-correction and do not assemble the error-corrected reads, you do not need this file. This file controls how much memory and CPU cores you will use, so it is very likely that you have to understand various parameters. *ec.spec* is a parameter file for Sprai. The most important parameter in this file is *estimated_genome_size*. If you have not estimated the length of your target genome, give a large number (e.g., 1e+12). After the first assembly, you can calculate the depth distribution of reads to estimate the genome size, after which you can try a second run, which might give a better result. Modify other parameters in ec.spec as well, following instructions in the file. However, the result is not so sensitive to this parameter in our experience. Single Node Mode (1) --------------------- Sprai has several execution modes. The first mode we describe is single node mode, with which we can use only a single machine. If you have more than one server, please see the next section. You can still use multiple CPU cores with single node mode, as long as the cores are on the same machine. Edit *ec.spec*, and give *ca_path* parameter, which is the directory (full-path) in which you put wgs-assembler binaries. .. Then, .. :: .. fs2ctg_v4.pl ec.spec asm.spec -n .. You can confirm what will happen by using fs2ctg_v4.pl with '-n' option. Then, type the following commands :: ezez_vx1.pl ec.spec pbasm.spec > log 2>&1 & This will do sequencing-error correction, and contigs will be created. Note that parameter files for Sprai and Celera assembler are independent; you can run Celera with multiple nodes (machines) even with Sprai single node mode. If you only need error-corrected reads and do not want Sprai (Celera assembler) to assemble them, do as follows .. :: .. fs2ctg_v4.pl ec.spec asm.spec -n -ec_only :: ezez_vx1.pl ec.spec /dev/null -ec_only > log 2>&1 & or :: ezez_vx1.pl ec.spec -ec_only > log 2>&1 & or :: ezez_vx1.pl ec.spec > log 2>&1 & After error correction, if you want to assemble corrected reads using Celera Assembler, do :: ca_ikki_v5.pl pbasm.spec estimated_genome_size \ -d directory in which fin.idfq.gzs exist \ -ca_path /path/to/your/wgs/Linux-amd64/bin \ -sprai_path the path to get_top_20x_fa.pl installed Single Node Mode (make mode) ----------------------------- This mode can detect unfinished jobs and restart at the appropriate stage using GNU make. :: ezez4makefile_v4.pl ec.spec pbasm.spec After this command, 'Makefile' will be created. Then, :: make -j -f Makefile > log 2>&1 & If you only want to error correction only, do as below :: ezez4makefile_v4.pl ec.spec make -j -f Makefile ec_only > log 2>&1 & The file result/c_l${libi}.1.fin.v$valid_voters.idfq.gz (eg. c_l0.1.fin.v30.idfq.gz) is the corrected read file. After error correction using ezez4makefile_v4.pl, if you want to assemble the corrected reads, do as below :: ezez4makefile_v4.pl ec.spec pbasm.spec make -j -f Makefile > log 2>&1 & Error correction will be skipped and only assmbling will start. Multi-node mode 1 (qsub mode) ------------------------------ There are two types of execution modes with Sprai. The first one is qsub mode; a single master process throws child jobs by qsub. .. This mode runs faster and more reliablely than the second mode. However, there is a drawbacks. .. The biggest problem might be that there is no way of restarting the process once a child process fails. .. Anyway, this mode is the most extensively tested, so you should use this mode if your target genome is small enough to be processed with a small number of nodes and thus with little chance of failure. Currently, Sprai supports Sun Grid Engine (SGE) or equivalents (e.g., N1GE, UGE). To correct sequencing errors of PacBio Continuous Long Reads and also would like to assemble them, specify *blast_path* and *sprai_path* in ec.spec, and do as follows :: ezez4qsub_vx1.pl ec.spec pbasm.spec > log 2>&1 & .. \or .. :: .. ezez4makefile.pl ec.spec asm.spec > ezez4makefile.log 2>&1 && make & If you only use error-corrected reads and do not want Sprai (Celera assembler) to assemble them, do as follows :: ezez4qsub_vx1.pl ec.spec /dev/null -ec_only > log 2>&1 & or :: ezez4qsub_vx1.pl ec.spec -ec_only > log 2>&1 & or :: ezez4qsub_vx1.pl ec.spec > log 2>&1 & .. \or .. :: .. ezez4makefile.pl ec.spec asm.spec > ezez4makefile.log 2>&1 && make ec_only & The file data_yyyymmdd_hhmmss/c01.fin.idfq.gz is the corrected read file. If some child processes fail, do as follows :: ezez4qsub_vx1.pl ec.spec pbasm.spec -now yyyymmdd_hhmmss The yyyymmdd_hhmmss is the suffix of your data directory ezez4qsub_vx1.pl made. This command will detect unfinished jobs and restart at the appropriate stage. Multi-node mode 2 (make mode) ------------------------------ This mode can detect unfinished jobs and restart at the appropriate stage using GNU make. Currently, Sprai supports Sun Grid Engine (SGE) or equivalents (e.g., N1GE, UGE). :: ezez4makefile_v4.pl ec.spec pbasm.spec -submit The command above will create Makefile, job files and log files and submit job files using qsub. .. The second mode works with `TGEW `_, which is a wrapper script of qsub. .. tge_make in the TGEW package interprets Makefile and submits jobs by qsub. .. :: .. ezez4makefile_v3.pl ec.spec pbasm.spec > log 2>&1 .. tge_make -bg > tge_make.log 2>&1 & .. ezez4makefile_v3.pl creates Makefile, and tge_make processes it. .. In the case of failure, you can just reexecute tge_make to restart. As make utility, tge_make compares the timestamps of files to see if any updates are needed. .. You can type the following command to see what would be reexecuted:: .. tge_make -n .. Since this mode submits a significant number of jobs at once to SGE, you may have to limit the number of partitions for not to exceed the job number limit. .. You might add a make target to limit the number of jobs being submitted simultaneously to SGE. .. For example, if you want only error-correction, you can specify ec_only target:: .. tge_make -bg ec_only .. tge_make actually calls GNU make to analyse dependencies between files, so you can give any valid target for GNU make. .. Before Sprai ver 0.9.6.1.4, Multi-node mode 2 considers only 'pre_partition'. .. Since Sprai ver 0.9.6.1.4, the number of jobs submitted to SGE became 'partition' * 'pre_partition' in Multi-node mode 2. Postprocessing =============== Once you get contigs from an external de novo assembler (here we assume Celera assembler), you might want to polish them up because you still have a number of ways to improve the assembly. Re-consensuscalling -------------------- Although Sprai can remove most sequencing errors, there often remain some sequencing errors in a systematic way. For example, two copies of a repetitive element with 0.1% of single nucleotide variants might have been collapsed into the same sequence during the error correction process. Even in such a case, you are often able to reconstruct the exact copies of the two repetitive elements by exploiting long reads that span the entire repetitive elements. To this end, we can use Quiver, which is a basecalling program developed by PacificBiosciences. Since you must have installed the prerequisites for Sprai, you must have Quiver on your system. You can manually prepare files necessary for running Quiver, though here we introduce an easy way with pbalign script, which is again developed by PacBio. Here are links to the software. * `pbalign `_ * `blasr `_ (pbalign uses new options of BLASR) You can see the documents on GitHub for installation. You might want to see `the installation log by a user `_ as well (in Japanese, but the command lines will help you even you cannot read the language). Anyway, we assume that both ``pbalign.py`` and blasr work well now. Then you can type as follows to align raw reads against the assembly:: pbalign.py --nproc 8 --forQuiver all.fofn result_here_comes_date_and_time/CA/9-terminator/asm.scf.fasta all_mapped_against_celera.cmp.h5 The option ``--nproc`` specifies the number of parallelisms so you can change the number according to the number of CPUs you have. The next argument, all.fofn, is a file in which input bax.h5 files (raw data produced by the PacBio primary pipeline) are described line by line. The third argument is a FASTA file that contains the assembly, and the last one is an output file of ``pbalign.py``. The output file is a kind of a "PacBio-variant of BAM file". It basically stores the alignment information as BAM files do. The difference between cmp.h5 files and BAM files is that it stores PacBio-specific raw data, which is useful for more accurate consensus calling. After creating all_mapped_against_celera.cmp.h5, run Quiver:: quiver -j 8 all_mapped_against_celera.cmp.h5 -r result_here_comes_date_and_time/CA/9-terminator/asm.scf.fasta -o consensus.fa -o consensus.fq -o variants.gff The first option specifies the number of parallelisms, so you might want to change it. The next argument specifies the path of the reference genome against which the raw reads are aligned. A bunch of ``-o`` options are output file names; note that the file format is determined by the suffix of the output files. ``consensus.fa`` will be in FASTA format, while "consensus.fq" will be in FASTQ format. The GFF file contains only difference with the reference assembly. You usually need a FASTA (or FASTQ) file for the final result, so ``-o consensus.fa`` might be sufficient. Circularization ---------------- If your target genome does not contain circular chromosomes, skip this subsection. Bacterial genomes often have circular chromosomes, but most de novo assemblers do not consider circular chromosomes. Since assembling shotgun sequences into circular chromosomes was just a dream a decade ago, no one felt that circular chromosomes must have been considered. Assemblers might break somewhere in a chromosome and output a linear contig. Now that PacBio long reads of 80x can usually be assembled into circular chromosomes, we must take care of them. The best thing we can do is obviously to develop a new de novo assembler that considers circular chromosomes seriously, but it takes time. So here we introduce an ad-hoc solution until we see the best solution in public. Sprai package contains a Perl script, ``check_circularity.pl``, which checks if a contig is likely to be circular. Here is how to run it:: check_circularity.pl consensus.fa tmpdir The first argument is usually an output from Quiver (or an output from a de novo assembler if you do not use Quiver). ``check_circularity.pl`` takes every contig in the input file and checks if the head and the tail of the contig overlap each other using BLAST. If that is the case, check_circularity.pl cuts the redundant part of the contig, and output the result to tmpdir. Output files with ``.cut`` in their name are (quite likely) circular chromosomes. Removing redundant contigs --------------------------- The combination of Sprai + Celera assembler often yields a lot of small, fragmented contigs which are likely to be redundant. Those contigs come off from the "real" long contigs for some reason, so they are often artificial and, therefore, redundant. ``check_redundancy.pl``, in Sprai package, is a tool to find such redundant contigs. It uses BLAST for finding potentially redundant contigs, and outputs the result to the standard output. There might be seemingly redundant but real contigs, so the elimination of seemingly redundant contigs is not done automatically. You might find `fatt `_ useful for contig removal:: fatt extract --reverse --seq contig_to_remove1 --seq contig_to_remove2 ... consensus.fa > consensus_with_some_removed.fa If the number of contigs is too many to put them in a command line, then just put them in a file:: fatt extract --reverse --file contig_names.txt consensus.fa > consensus_with_some_removed.fa Note that small and circular contigs might be plasmids that share the sequence with the genome. Renaming contigs ----------------- After all the postprocessing, you might want to rename circularized contigs to human-friendly names. Create an edit script like this:: loadseq consensus_with_some_removed.fa rename "ctg000000000132" "ChromosomeI" rename "ctg000000000076" "ChromosomeII" saveseq final.fa and run fatt (explained in the previous section) as follows:: fatt edit editscript.txt Yikes! You finally get final.fa in which you see the two chromosomes of production quality.