Saturday, July 6, 2013

Tools for NGS analysis - fastq file processing

As I have recently been working (although very slowly) in person on output of the HiSeq1500 in our facility, I needed to look for, test and validate some tools to handle fastq files for various purposes. Below I list some of them for those who are starting or will start this sort of work.

For various kinds of filtering/trimming

seqtk - a fixed version that retains full sequence names (or 'comments') is here
                      [see a post at BioStar]

fastx-tools (of many tools therein, I am using fastx_trimmer and fastq_quality_filter)

prinseq (of many options, I use 'trim_left/right' and 'derep')

condetri - ... I could not get this working in the way I wanted

For merging overlapping paired-end reads


See this external blog post (from Nov. 2012) for more info

For removing adaptor sequences etc.


cutadapt - Can't this tool accept multiple adapter sequences in a multifasta file?

For retrieving paired reads after read filtering


For removing 'duplicates'

filterPCRdupl (I will not use this any more because 'prinseq -derep 4' does the exactly the same thing much faster)

For validating the tools' functions

fastqc [ also, a tutorial movie available at YouTube ]


prinseq -stats_all

There should be more useful tools that I did not list here. Please first google with some key words and look into the 'Bioinformatics' forum at SEQanswers to get latest info. Its Wiki page there also provides a list of tools.