Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation
BMC Genomics 2015, 16:977doi:10.1186/s12864-015-2007-1
We showed the difference between the length distribution of prepared library inserts and that of actually sequenced fragments. We provide some tips to coordinate the length of library inserts with the read length. For example, we prepared a library starting with RNA fragmentation of 2 and 4 minutes' duration (instead of 8 minutes, according to the TruSeq RNA lib prep protocol), to be sequenced with 2x 171 cycles on HiSeq Rapid Mode (also see this article for the reason why we do '171 cycles').
The paper deals with post-sequencing steps, too. You may find the de novo assembly part insufficiently explored, and we admit that there would be more programs and settings to test. The strength of the paper rather lies in the assessment of assembly results. For your prior information, a long-standing solution for assembly completeness assessment was the program pipeline, CEGMA, developed by the Korf Lab. It was announced in May 2015 that CEGMA is no longer supported, and its function is taken over by BUSCO. In our paper, we derived a reference gene set consisting of 233 genes conserved throughout vertebrates including sea lamprey and elephant shark (or ghost shark), which can be fed into CEGMA and BUSCO. This new gene set, CVG (core vertebrate genes), enables more accurate completeness assessment, and especially when used with BUSCO, it saves a lot of time. In fact, in the course of our benchmarks, we noticed suboptimal performances of BUSCO, one of which is the exclusion of cyclostomes and cartilaginous fishes from its original reference gene set, 'vBUSCO' that is supposedly targeting vertebrates.