Newbler (version 2.0) was used to assemble Lanier.454 with parameters set at 100 bp for overlap length and 95% for nucleotide identity. We assessed homopolymer error rate in metagenomic data using two different strategies. Although the use of the TIGR reference assembly resulted in a slightly higher number of sequence errors for both Illumina and Roche 454 data, Illumina consistently showed a smaller number of sequencing errors and the relative error rate between the two platforms was similar to that based on the JGI genome data alone, independent of the reference genome used (Fig. correction. (C) Assemblies were obtained from 502 Mbp of Roche 454 and 2,460 Mbp of Illumina data using established protocols. 2). In addition, given the monetary savings (e.g., we obtained the Illumina data for about one fourth of the cost of the Roche 454 data), Illumina, and short-read sequencing in general, may be a more appropriate method for metagenomic studies. Newbler was used to assemble Roche 454 replicate datasets (about 20 coverage on average), using 50 bp minimal alignment length and 95% alignment identity. Finally, we calculated the average single-base call error rate and gap opening error rate of individual reads of each dataset as follows: raw reads were trimmed using the same standards as described above and subsequently mapped onto the corresponding reference assembly from RefSeq. Copyright: 2012 Luo et al. Velvet was used to assemble each of these Illumina datasets with K-mer set at 31. School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America, Affiliation Moreover, Illumina yielded longer and more accurate contigs (e.g., fewer truncated genes due to frameshifts) despite the substantially shorter read length relatively to Roche 454 and the comparable average sequencing error in the raw reads of the two platforms (0.5% per base in our hands; Fig. It is possible that the remaining 10% of the contig sequences might have been different because of imperfect or uneven splitting of the original DNA sample into the two aliquots sequenced and the fact that the diversity in the sample was not saturated by sequencing (estimates based on rarefaction curves using raw reads indicated that we sampled about 8085% of the total diversity in the Illumina data). For instance, searching all genes shared between the two assemblies against NCBI's Non Redundant (NR) protein database (Blastx) returned more complete matches with the Lanier.Illumina than the Lanier.454 data, regardless of the identity and e-value threshold used (14% more on average; Fig. Nevertheless, about 1% of the total genes recovered in the Illumina assembly contained homopolymer-associated sequencing errors and this number increased to about 3% when non-homopolymer-associated errors were also taken into account (for contigs showing 10 coverage, on average). Discover a faster, simpler path to publishing in a high-quality journal. Most importantly, different tiles of the sequencing plate tend to produce reads of different quality [14], the 3 ends of sequences tend to have higher sequencing error rates compared to the 5 ends [15], and increased single-base errors have been observed in association with GGC motifs [16]. PLOS ONE promises fair, rigorous peer review, We found that homopolymer errors affected 2.132.78% and 0.321.02% of the total genes evaluated for the Lanier.454 and Lanier.Illumina data, respectively (dividing by the average gene length, 950 bp, provided the per base error rate; range was estimated from 100 replicates using Jackknife resampling), despite the fact that sequencing error in the raw reads of the two platforms was comparable (0.5% per base, in our hands). Although our metagenomic analysis is based on a single community sample, we believe it is robust and informative. https://doi.org/10.1371/annotation/64ba358f-a483-46c2-b224-eaa5b9a33939 Noticeably, due to the inherent biases of the Roche 454 sequencing approach to produce more frameshifts in A and T rich DNA (Fig. Thus, the results reported for Illumina based on the metagenome of Lake Lanier (47 G+C%) should be also applicable to metagenomes with different G+C% contents. The results presented here revealed the errors and limitations as well as the strengths in current metagenomics practice, and should constitute useful guidelines for experimental design and analysis. This resulted in a set of 500 bp long sequence fragments, which were subsequently mapped onto the reference assembly using Blastn. e30087. To estimate the previously described errors associated with GGC motifs in Illumina reads [29], we selected the Roche 454 reads that were covered by at least 10 Illumina reads per base, on average, as reference sequences in Bowtie mapping (86.6 Mbp of reads in total). Lastly, our preliminary evaluation indicates that the latest Illumina sequencer (Hi-Seq 2000) performs similar to Illumina GA-II in terms of read length and quality; hence, our results should be applicable to this sequencer as well. 3), low G+C% genomes sequenced with this platform may have 20% or more genes with frameshift errors whereas the Illumina platform is not affected as much by the G+C% of the sequenced DNA (Fig. 2B, inset). 5), which was consistent with our observations on the assembly N50 values of the metagenomes (Fig. Single-base sequencing errors increased by an average of 2% when non-homopolymer-associated errors were also taken into account for both platforms. https://doi.org/10.1371/journal.pone.0030087.g003. For example, the high coverage of indigenous communities provided by NGS has made it possible to quantitatively assess the impact of diet on human gut microbiota [8] and the diversity of metabolic pathways within marine planktonic communities [9]. 7); thus, the assembly step did not substantially affect downstream analyses and our conclusions. No, Is the Subject Area "Next-generation sequencing" applicable to this article? 1A). The slightly higher single-base accuracy of Roche 454 metagenomic reads relative to that of the isolate genome reads is presumably due to the use of the latest, optimized Roche 454 protocol in the former and slight differences in the performance of the sequencers used. PCC6803 (Cyanobacteria). No, Is the Subject Area "Sequence alignment" applicable to this article? It is critical to assess the quality of the derived assemblies; to this end, several studies have recently attempted to evaluate the sequencing errors and artifacts specific to each NGS platform. We did not observed a significant difference in error frequency in contigs with higher than 20 coverage (standards on length and coverage for identifying error-prone Illumina contigs are defined in our previous study [18]). Protein-coding genes encoded in the assembled contigs were identified by the MetaGene pipeline [26]. For each genome, a 2D-grid assembly was performed, varying the size of input sequences (20, 30, 40, , 130) and the K-mer (21, 23, 25, , 37) of each of the assemblers used (SOAPdenovo and Velvet). The protein-coding sequences of these genomes were compared against their homologs from the two assemblies to determine homopolymer errors, as described above for direct comparisons between the two assemblies. Red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. 4, which is based on isolate genome data). (B) Error rate (as a percentage of the total genes evaluated, y-axis) increases as homopolymer length increases (x-axis). The genomes were: Candidatus Pelagibacter ubique HTCC1062 (-Proteobacteria), Opitutus terrae PB901 (Verrucomicrobia), Polaromonas sp. Illumina GA II sequencing quality is evaluated in panels E and F, which show: (E) base call error rate of individual reads plotted against the G+C% of the genome; and (F) gap opening error rate of individual reads plotted against the G+C% of the genome. For more information about PLOS Subject Areas, click (B) Protein sequences annotated on raw (not assembled) reads matched genes in the reference assembly more frequently for the Roche 454 than the Illumina data. All 2D plots (panels B, D, E, and F) represent the arithmetic average of the medians of each dataset for the same genome; Illumina medians were identical among replicate datasets; therefore, only one value is shown in panel E. The results show that Illumina sequence quality was affected less than that of Roche 454 by the G+C% content of the sequenced DNA (note the lower r-squared value and the slope in E). We performed six independent assemblies, using K=21, 25, 29 for the three SOAPdenovo runs and K=23, 27, 31 for the three Velvet runs. broad scope, and wide readership a perfect fit for your research every time. Wrote the paper: CL KTK. Due to frameshifts caused primarily by homopolymer-associated errors in the derived consensus sequence of the contigs, genes from Roche 454 assembly had fewer complete matches in the NR database relatively to their Illumina counterparts (inset; results are based on a total of 72,709 gene sequences annotated on contigs that were shared between the two assemblies and were longer than 500 bp). Assembly parameters (primary and secondary x-axes) were evaluated for low (Arcobacter nitrofigilis, 28%; left), medium (Fibrobacter succinogenes, 48%; middle), and high (Cellulomonas flavigena, 74%; right) G+C% genomes. Although low coverage contigs (e.g., 1 to 5) are likely to contain a higher fraction of chimeric sequences than 0.2% according to our previous study [18], such contigs were rare in the results reported here, which included only contigs longer than 500 bp with average coverage 10 or higher (only about 3% of the contigs showed less than 5 coverage; Fig. Finally, gene calling on individual reads (as opposed to assembled contigs) was found to be less error prone in Lanier.454 reads than in Lanier.Illumina reads, mainly due to the longer read length. Finally, our evaluations showed that the choices of parameters and amount of input sequence of the assembly did not have any dramatic effect on the quality of the resulting contigs for both Illumina and Roche 454 assemblies (Fig. The amount of Illumina and Roche 454 input sequence data was chosen so that the ratio of the two was similar to the ratio in the metagenomic analysis (2.5 Gb Illumina reads versus 500 Mbp Roche 454 reads, or 51). The matching gene of the assembly from the protein search using BLAT was compared to the gene matched by the raw read using Bowtie and instances of agreements (matched genes), disagreements (mismatched genes) and no match found (BLAT search did not match a gene while Bowtie mapping did) were counted and reported in Fig. Panels A and C represent the variation observed in reads from different (replicate) datasets of the same genome; red bars represent the median, the upper and lower box boundaries represent the upper and lower quartiles, and the upper and lower whiskers represent the largest and smallest observations. Graphs show the calculated base call error rate (A) and gap open error rate (B) for each comparison (figure key). Roche 454 recovered 14% fewer complete genes than Illumina (Fig. Consistent with the results from assembled contigs, we obtained 90% of overlapping sequences (80% when the overlapping sequences were expressed as a fraction of the total Illumina dataset) between the two datasets when we performed a similar analysis using all raw (not assembled) reads (Fig. One aliquot was sequenced with the Roche 454 FLX Titanium sequencer (average read length, 450 bp) and the other one with the llumina GA II (100100 bp pair-ended reads) at Emory University Genomics Facility. The frequency of single-base errors decreased with higher coverage of the corresponding contigs, i.e., the frequency dropped by about ten fold in contigs with 20 coverage relative to contigs with 2 coverage, reaching a plateau at about 20 coverage. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies. 4). Department of Energy (DOE) Joint Genome Institute, Walnut Creek, California, United States of America, Affiliation (A) A's and T's contribute significantly more homopolymer errors than C's and G's. RCC307 (Cyanobacteria), and Synechoccocus sp. 2B). 2) should be independent of the NGS platform considered and broadly applicable to short-read sequencing. We identified 0.4 million homopolymers (three identical consecutive nucleotide bases or more), of which 14 thousand (3.3% of the total) disagreed on length between the two assemblies, resulting in alternative amino acid sequences for about 7% of the total 72,709 gene sequences evaluated. 1B. Gene sequences from assembled contigs were extracted and ClustalW2 [31] was used to align the sequences against their orthologs from the reference assembly. To select appropriate genomes, we first identified the putative phylogenetic affiliation of each assembled contig (genus level) in the Lanier.454 and Lanier.Illumina datasets and ranked genera in terms of their abundance. As evidence of this, analysis of the assemblies of isolate genomes that were sequenced using both platforms (see below) revealed that the extent of chimeric contigs, i.e., contigs that contained contaminating or in vitro generated sequences, in the Illumina and Roche 454 assemblies was, on average, less than 0.2% of the total length of the assembled contigs. Funding: This research was supported, in part, by the U.S. Department of Energy (award DE-SC0004601). We also measured the percent of the reference genome recovered in each assembly and the degree of chimerism of contigs as follows: A 500 bp window was used to slide through all assembled contig sequences longer than 500 bp with a step of 100 bp. PLoS ONE 7(2): These findings suggest that both NGS technologies are reliable for quantitatively assessing genetic diversity within natural communities. (A) Venn diagram showing the extent of overlapping and platform-specific raw reads between the Lanier.454 and Lanier.Illumina datasets (without assembly). To validate our findings from metagenomics, we performed similar comparative analyses based on eighteen isolate genomes that were sequenced by both Illumina and Roche 454 and showed a range of genome sizes and G+C% content (Table 1). 4). Consistent with the metagenomic observations, we found that Roche 454 assemblies from genome data contained a significantly higher portion of frameshift errors compared to Illumina assemblies from the same genome, when the assemblies were built with 5 times more Illumina data than the Roche 454 data, matching the relative ratio of the metagenomic data reported above. https://doi.org/10.1371/journal.pone.0030087.g007. 6). A similar strategy based on reference genome sequences was used to identify and count non-homopolymer-related, single-base errors. These results were attributable to a higher number of (artificial) frameshifts, caused by homopolymer-associated base call errors, present in the Lanier.454 versus the Lanier.Illumina assembled sequences. 1D). Roche 454 sequencing quality is evaluated in panels A through D, which show: (A) base call error rate of individual reads (x-axis) for each genome evaluated (y-axis); (B) base call error rate (y-axis) plotted against the G+C% of the genome; (C) gap opening error rate of individual reads (x-axis) for each genome evaluated (y-axis); (D) gap opening error rate (y-axis) plotted against the G+C% of the genome. Hence, the majority of non-homopolymer-associated errors remain challenging to model and thus, to correct. Performed the experiments: CL DT. Contributed reagents/materials/analysis tools: NK TR. No, Is the Subject Area "Genome sequencing" applicable to this article? These errors were not observed in the Illumina data, presumably due to both the high sequence coverage that greatly facilitated the resolution of homopolymer ambiguities and the less pronounced sequencing biases of Illumina (Fig. School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America, 29 Mar 2012: We extracted the predicted gene sequences from the reads and the corresponding amino acid sequences were searched against the genes of the reference assembly of the same dataset using BLAT [28]. Individual reads were mapped against the assembled contigs using Bowtie [25] with default settings to calculate average contig coverage. Lanier.454 and Lanier.Illumina reads were trimmed at both the 5 and 3 ends using a Phred quality score cutoff of 20. We compared the reads from the Lanier.Illumina dataset against the Lanier.454 dataset to identify the fraction of reads shared between the two datasets. Yes Affiliation (D) Number of Roche 454 (x-axis) and Illumina (y-axis) reads mapping on the same contig shared between the two assemblies. 2). We found that about 90% of the Roche 454 unique contig sequences overlapped with Illumina contig sequences (Fig. Yes https://doi.org/10.1371/journal.pone.0030087.g006. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Analyzed the data: CL. For this, Blastn [30] was employed to search all gene sequences annotated in the Lanier.454 assembly against those in the Lanier.Illumina assembly. We used the isolate genome data to evaluate the effect of the parameters of the assembly on the quality of the contigs as follows: a series of assemblies were obtained for genomes of low (Arcobacter nitrofigilis, 28%), medium (Fibrobacter succinogenes, 48%), and high (Cellulomonas flavigena, 74%) G+C% content. Algorithms that detect and correct these errors are being developed and incorporated into existing data processing pipelines. JS666 (-Proteobacteria), Polynucleobacter necessarius STIR1 (-Proteobacteria), Synechoccocus sp. 2A, inset; and in [18]). KyrpidesN, Therefore, the two platforms provided comparable in situ abundances for the same genes or genomes. Next generation sequencing (NGS) technologies, such as the Roche 454, Illumina/Solexa, and, to a lesser extent, ABI SOLiD, have been cornerstones in this revolution [5], [6], [7]. Nine Illumina and eight Roche 454 assemblies from independent replicate datasets of the Fibrobacter succinogenes subsp. NGS platforms continue to improve, while new major advancements in sequencing chemistries are on the horizon [22], creating a lot of excitement among microbial ecologists and engineers. Assemblies were obtained for each possible combination and the base call error and gap opening error of the resulting assemblies were determined as described for individual reads above. To compare the quality of Illumina vs. Roche 454 contigs assembled from isolate genome data the following approach was followed: Illumina data for each genome was randomly sampled to form several technical replicate datasets, each of which provided about 100 coverage of the reference assembly, on average. For instance, protein sequences called on Lanier.454 reads had 10% more Blastp matches to reference genes from the Lanier.454 assembly than did protein sequences from Lanier.Illumina reads against the Lanier.Illumina reference assembly (Fig. We also estimated the abundance of each contig shared between the two assemblies by counting the number of reads composing the contig, which can be taken as a proxy of the abundance of the corresponding DNA sequence in the sample [19]. PLOS ONE 7(3): 10.1371/annotation/64ba358f-a483-46c2-b224-eaa5b9a33939. No, Is the Subject Area "Metagenomics" applicable to this article? 1B). First, we examined disagreements in gene sequences annotated on contigs larger than 500 bp and shared between the Lanier.454 and Lanier.Illumian assemblies. We compared the two most frequently used platforms, the Roche 454 FLX Titanium and the Illumina Genome Analyzer (GA) II, on the same DNA sample obtained from a complex freshwater planktonic community. We found a strong linear correlation (r2>0.99) between the Roche 454 and Illumina data with this respect (Fig. The quality of the resulting contigs was examined in terms of base call error (C) and gap opening error (D), which revealed that the combination of the parameters of the assembly did not have a dramatic effect on the quality of the contigs except in the extreme values of the minimal aligned length (see projected contours on x-z and y-z space), which were avoided in our direct comparisons of Illumina versus Roche 454 assemblies. The higher sequence error rate observed for the TIGR reference genome might be due to the different strain of F. succinogenes sequenced or differences in the sequencing platforms or the assembly protocols used by JGI and TIGR. To eliminate the possibility that our results were biased by the selection of reference genomes, we used the reference assembly of Fibrobacter succinogenes subsp. Samples were collected from Lake Lanier, Atlanta, GA, below the Browns Bridge in August 2009 and community DNA was extracted as described previously [17].
Sitemap 4