Yang, Jihoon
(Department of Civil and Environmental Engineering, Yonsei University)
,
Howe, Adina
(Department of Agricultural and Biosystems Engineering, Iowa State University)
,
Lee, Jaejin
(Department of Agricultural and Biosystems Engineering, Iowa State University)
,
Yoo, Keunje
(Department of Environmental Engineering, Korea Maritime and Ocean University)
,
Park, Joonhong
(Department of Civil and Environmental Engineering, Yonsei University)
The identification of bacterial pathogens to humans is critical for environmental microbial risk assessment. However, current methods for identifying pathogens in environmental samples are limited in their ability to detect highly diverse bacterial communities and accurately differentiate pathogens ...
The identification of bacterial pathogens to humans is critical for environmental microbial risk assessment. However, current methods for identifying pathogens in environmental samples are limited in their ability to detect highly diverse bacterial communities and accurately differentiate pathogens from commensal bacteria. In the present study, we suggest an improved approach using a combination of identification results obtained from multiple databases, including the multilocus sequence typing (MLST) database, virulence factor database (VFDB), and pathosystems resource integration center (PATRIC) databases to resolve current challenges. By integrating the identification results from multiple databases, potential bacterial pathogens in metagenomes were identified and classified into eight different groups. Based on the distribution of genes in each group, we proposed an equation to calculate the metagenomic pathogen identification index (MPII) of each metagenome based on the weighted abundance of identified sequences in each database. We found that the accuracy of pathogen identification was improved by using combinations of multiple databases compared to that of individual databases. When the approach was applied to environmental metagenomes, metagenomes associated with activated sludge were estimated with higher MPII than other environments (i.e., drinking water, ocean water, ocean sediment, and freshwater sediment). The calculated MPII values were statistically distinguishable among different environments (p < 0.05). These results demonstrate that the suggested approach allows more for more accurate identification of the pathogens associated with metagenomes.
The identification of bacterial pathogens to humans is critical for environmental microbial risk assessment. However, current methods for identifying pathogens in environmental samples are limited in their ability to detect highly diverse bacterial communities and accurately differentiate pathogens from commensal bacteria. In the present study, we suggest an improved approach using a combination of identification results obtained from multiple databases, including the multilocus sequence typing (MLST) database, virulence factor database (VFDB), and pathosystems resource integration center (PATRIC) databases to resolve current challenges. By integrating the identification results from multiple databases, potential bacterial pathogens in metagenomes were identified and classified into eight different groups. Based on the distribution of genes in each group, we proposed an equation to calculate the metagenomic pathogen identification index (MPII) of each metagenome based on the weighted abundance of identified sequences in each database. We found that the accuracy of pathogen identification was improved by using combinations of multiple databases compared to that of individual databases. When the approach was applied to environmental metagenomes, metagenomes associated with activated sludge were estimated with higher MPII than other environments (i.e., drinking water, ocean water, ocean sediment, and freshwater sediment). The calculated MPII values were statistically distinguishable among different environments (p < 0.05). These results demonstrate that the suggested approach allows more for more accurate identification of the pathogens associated with metagenomes.
* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.
가설 설정
Given the limitations of each individual database for annotating pathogens, we hypothesize that utilizing a combination of identification results of all three databases would offset the drawbacks of each individual database and perform more accurate pathogen annotation. In this study, we used artificial metagenomes to compare the accuracy of pathogen identification between single databases (i.
Given the limitations of each individual database for annotating pathogens, we hypothesize that utilizing a combination of identification results of all three databases would offset the drawbacks of each individual database and perform more accurate pathogen annotation. In this study, we used artificial metagenomes to compare the accuracy of pathogen identification between single databases (i.e., the MLST database, VFDB, and PATRIC database) and our suggested approach. In addition, a quantitative index, which summarizes qualitative information obtained from multiple databases, can be helpful to convey a comprehensive understanding of the complex pathogen identification results [32].
제안 방법
0). A threshold of E-value, representing alignment similarity and alignment length, was set to 10 -3 to minimize misannotations [50, 52] and the aligned match with the highest similarity (e.g., highest bit-score) was chosen for further analysis. For pathogen identification using the MLST database, sequences associated with pathogens in metagenome datasets were considered as a positive alignment only if all the housekeeping genes corresponding to a gene profile were also found in a metagenome, as previously described [15].
Rather, the methodology suggested in this study was focused more on minimizing false negatives, enhancing the accuracy of identification, and providing an index that can be compared across metagenomes. As an early screening tool, the suggested approach can contribute to improving the ability to identify pathogens in the environment and complementing culture-based screening of indicator pathogens and existing molecular biological tools.
This result is associated with the observation that some nonpathogens are phylogenetically similar to the known pathogens and were intentionally included in the artificial metagenomes. Further, the microbial diversity of an environmental sample is much higher than that of the artificial metagenomes and results in a lower estimation of pathogenicity estimation.
In this study, an improved approach for metagenomic pathogen identification was provided by utilizing the currently available pathogen sequence databases more effectively. It was also confirmed that the abundance of pathogen sequences correlated with MPII values of tested metagenomes.
Pathogenic sequences within the artificial metagenomes were annotated against each of the three customized databases, and the identification results were integrated for further analysis (Fig. S3). Using the MLST database alone, 37 out of 50 pathogens were correctly identified.
Importantly, the MPII values can be used for sampleto-sample comparison to get a brief but comprehensive understanding of how many the verified or suspicious pathogenic sequences existing in metagenomes, but it does not directly indicate the magnitude of the risk. Rather, the methodology suggested in this study was focused more on minimizing false negatives, enhancing the accuracy of identification, and providing an index that can be compared across metagenomes. As an early screening tool, the suggested approach can contribute to improving the ability to identify pathogens in the environment and complementing culture-based screening of indicator pathogens and existing molecular biological tools.
We estimated the specificity (the true negative rate) and the sensitivity (the true positive rate) for each approach. Sensitivity and specificity were calculated using the formulas (TP/(TP+FN)) and (TN/(TN+FP), respectively. The ROC curve was drawn by 1-specificity and sensitivity as x and y axis [42].
The ROC curve was drawn by 1-specificity and sensitivity as x and y axis [42]. The accuracy of each approach was assessed by calculating the area under the ROC curve (AUC). A classification model can be considered as effective when the AUC value is 0.
The identification of pathogens in environmental metagenomes using our approach showed that each pathogen annotation group in our cumulative database had distinctive pathogens that affected the estimated MPII values. This result indicates that the suggested approach could be used not only to calculate the MPII values of metagenomes but also to identify the types of pathogens within a sample. The Pseudomonas genus (P.
대상 데이터
To examine the applicability of the suggested approach, publicly available metagenomes from diverse environments were collected and tested. A total of 70 environmental metagenomes were collected from the MGRAST (https://www.mg-rast.org/) and the NCBI sequence read archive (SRA; https://www.ncbi.nlm.nih.gov/sra) (Table S2). The environments of the collected metagenomes can be classified into (i) wastewater-treatment activated sludge (32 metagenomes, W1-W32, [45]); (ii) drinking water (29 metagenomes, D1-D29, [46, 47]); (iii) sediments (5 metagenomes, S1-S5, [48]); and (iv) ocean water (4 metagenomes, O1-O4, [49]).
These artificial metagenomes were used to assess the ability to identify pathogens using a single database and our integrated approach (Table S1). The 50 pathogens were selected from verified pathogens in the MLST database, VFDB, PATRIC database, and the National Institute of Allergy and Infectious Diseases (NIAID, https://www.niaid.nih.gov/research/emerging-infectious-diseases-pathogens). To investigate the effect of the phylogenetic closeness on the accuracy of identification, the genomes of nonpathogenic bacteria were also included [33-35].
The MLST database (https://pubmlst.org/data/, accessed on 02/15/2019) was downloaded and contained the 251,429 sequences of 132 pathogenic species. The VFDB included 32,522 sequences of virulence genes that originated from 262 pathogenic species (http://www.
이론/모형
A phylogenetic tree to describe membership in artificial metagenome datasets and to compare the phylogenic relationship among pathogens and nonpathogens was constructed using BioEdit version 7.2.5 [36] and MEGA X [37] with the following parameters: neighbor-joining method, the bootstrap method with 1,000 replications, and maximum composite likelihood substitution (Fig. S1).
The least squares method was used for the derivation of coefficients. The variables having strong collinearity were excluded from the analysis, and a stepwise regression procedure was employed to select the independent variables that would result in the optimal equation.
The least squares method was used for the derivation of coefficients. The variables having strong collinearity were excluded from the analysis, and a stepwise regression procedure was employed to select the independent variables that would result in the optimal equation. For training the model, 70% of the entire dataset was used, and the remaining 30% dataset was used to validate MLR model performance [54].
If a sequence was detected by multiple pathogen associated databases, higher weighting coefficients were assigned to the sequence based on its classified group. To derive the weighting coefficient of each group, a multiple linear regression (MLR) model was applied [53]. The proportion of associated pathogen sequences (i.
성능/효과
The one nonpathogenic Bacillus, phylogenetically close to other pathogenic Bacillus species, was annotated as a pathogen. Compared to the single database approach, the suggested approach significantly improved sensitivity (0.96) and showed the highest accuracy (0.97). These results also demonstrated that the suggested approach resulted in the fewest false negative identification of pathogens.
For example, the nonpathogenic sequences originating or phylogenetically close from Bacillus were classified as pathogens by the VFDB and PATRIC database. In the studied metagenomes, the MLST database and the VFDB only could annotate up to 0.15% and 0.42% of total metagenomic sequences, respectively. In contrast, the PATRIC database was capable of annotating as many sequences as our suggested approach combining all three databases (Table 3).
01). Overall, the average MPII value of activated sludge metagenomes was estimated to be 3.87, with a range of -0.65 to 10.73. The higher MPII values of the activated sludge metagenomes were mainly due to the detection of Clostridium perfringens and Campylobacter jejuni in Groups VP, V, and P (Fig.
A receiver operating characteristic (ROC) analysis [41] was conducted to compare the annotation of pathogens of a single database approach and that of the suggested approach. Results of pathogen detection can be classified into one of four cases: true positive (TP), false positive (FP, i.e., a nonpathogenic bacteria misannotated as a pathogen), true negative (TN), and false negative (FN, i.e., a pathogen misannotated as a nonpathogen). We estimated the specificity (the true negative rate) and the sensitivity (the true positive rate) for each approach.
The method proposed in this study differs from a single database-based pathogen identification method not only because this approach is capable of using as many sequences as possible but also because we did not consider equal weighting for pathogen-associated databases. For example, virulence factors are correlated with the expression of pathogenicity [58] and maybe a stronger indication of pathogenicity than an associated housekeeping gene.
참고문헌 (58)
1 Furuse Y 2019 Analysis of research intensity on infectious disease by disease burden reveals which infectious diseases are neglected by researchers Proc. Natl. Acad. Sci. USA 116 478 483 10.1073/pnas.1814484116 30598444
2 Hay SI Abajobir AA Abate KH 2017 Global, regional, and national disability-adjusted life-years (DALYs) for 333 diseases and injuries and healthy life expectancy (HALE) for 195 countries and territories, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016 Lancet 390 1260 1344 10.1016/S0140-6736(17)32130-X 28919118
4 Pérez-Losada M Cabezas P Castro-Nallar E Crandall KA 2013 Pathogen typing in the genomics era: MLST and the future of molecular epidemiology Infect. Genet. Evol. 16 38 53 10.1016/j.meegid.2013.01.009 23357583
5 Roche A Hammerl JA Appel B Dieckmann R Dahouk SA 2015 FISHing for bacteria in food - A promising tool for the reliable detection of pathogenic bacteria? Food Microbiol. 46 395 407 10.1016/j.fm.2014.09.002 25475309
6 Li L Mendis N Trigui H Oliver JD Faucher SP 2014 The importance of the viable but non-culturable state in human bacterial pathogens Front. Microbiol. 5 258 10.3389/fmicb.2014.00258 24917854
8 Panicker G Call DR Krug MJ Bej AK 2004 Detection of pathogenic Vibrio spp. in shellfish by using multiplex PCR and DNA microarrays Appl. Environ. Microbiol. 70 7436 7444 10.1128/AEM.70.12.7436-7444.2004 15574946
9 Vora GJ Meador CE Bird MM Bopp CA Andreadis JD Stenger DA 2005 Microarray-based detection of genetic heterogeneity, antimicrobial resistance, and the viable but nonculturable state in human pathogenic Vibrio spp Proc. Natl. Acad. Sci. USA 102 19109 19114 10.1073/pnas.0505033102 16354840
10 Chapela MJ Garrido-Maestu A Cabado AG 2015 Detection of foodborne pathogens by qPCR: a practical approach for food industry applications Cogent. Food Agric. 1 1 19 10.1080/23311932.2015.1013771
11 Yang X Noyes NR Doster E Martin JN Linke LM Magnuson RJ 2016 Use of metagenomic shotgun sequencing technology to detect foodborne pathogens within the microbiome of the beef production chain Appl. Environ. Microbiol. 82 2433 2443 10.1128/AEM.00078-16 26873315
12 Mohiuddin MM Salama Y Schellhorn HE Golding GB 2017 Shotgun metagenomic sequencing reveals freshwater beach sands as reservoir of bacterial pathogens Water Res. 115 360 369 10.1016/j.watres.2017.02.057 28340372
13 Iseki H Alhassan A Ohta N Thekisoe OMM Yokoyama N 2007 Development of a multiplex loop-mediated isothermal amplification (mLAMP) method for the simulttaneous detection of bovine Babesia parasites J. Microbiol. Methods 71 281 287 10.1016/j.mimet.2007.09.019 18029039
14 Wylezich C Papa A Beer M 2018 A versatile sample processing workflow for metagenomic pathogen detection Sci. Rep. 8 13108 10.1038/s41598-018-31496-1 30166611
15 Zolfo M Tett A Jousson O Donati C Segata N 2017 MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples Nucleic. Acids Res. 45 e7 10.1093/nar/gkw837 27651451
16 Wattam AR Abraham D Dalay O Disz TL Driscoll T Gabbard JL 2014 PATRIC, the bacterial bioinformatics database and analysis resource Nucleic Acids Res. 42 D581 D591 10.1093/nar/gkt1099 24225323
17 Chen L Yang J Yu J Yao Z Sun L Shen Y Jin Q 2005 VFDB: a reference database for bacterial virulence factors Nucleic Acids Res. 33 D325 D328 10.1093/nar/gki008 15608208
18 Chan MS Maiden MCJ Spratt BG 2001 Database-driven Multi Locus Sequence Typing (MLST) of bacterial pathogens Bioinformatics 17 1077 1083 10.1093/bioinformatics/17.11.1077 11724739
19 Larsen MV Cosentino S Rasmussen S Friis C Hasman H Marvig RL 2012 Multilocus sequence typing of total-genome- sequenced bacteria J. Clin. Microbiol. 50 1355 1361 10.1128/JCM.06094-11 22238442
20 Cai L Zhang T 2013 Detecting human bacterial pathogens in wastewater treatment plants by a high-throughput shotgun sequencing technique Environ. Sci. Technol. 47 5433 5441 10.1021/es400275r 23594284
21 Waller AS Yamada T Kristensen DM Kultima JR Sunagawa S Koonin E V 2014 Classification and quantification of bacteriophage taxa in human gut metagenomes ISME J. 8 1391 1402 10.1038/ismej.2014.30 24621522
22 Gillespie JJ Wattam AR Cammer SA Gabbard JL Shukla MP Dalay O 2011 PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species Infect. Immun. 79 4286 4298 10.1128/IAI.00207-11 21896772
23 Comas I Homolka S Niemann S Gagneux S 2009 Genotyping of genetically monomorphic bacteria: DNA sequencing in Mycobacterium tuberculosis highlights the limitations of current methodologies PLoS One 4 e7815 10.1371/journal.pone.0007815 19915672
24 Jolley KA Maiden MC 2013 Automated extraction of typing information for bacterial pathogens from whole genome sequence data: neisseria meningitidis as an exemplar Euro Surveill. 18 20379 10.2807/ese.18.04.20379-en 23369391
25 Jordan K McAuliffe O 2018 Chapter Seven - Listeria monocytogenes in foods Adv. Food. Nutr. Res. 86 181 213 10.1016/bs.afnr.2018.02.006 30077222
26 Zheng LL Li YX Ding J Guo XK Feng KY Wang YJ 2012 A comparison of computational methods for identifying virulence factors PLoS One 7 e42517 10.1371/journal.pone.0042517 22880014
27 Niu C Yu D Wang Y Ren H Jin Y Zhou W 2013 Common and pathogen-specific virulence factors are different in function and structure Virulence 4 473 482 10.4161/viru.25730 23863604
28 Yang X Noyes NR Doster E Martin JN Linke LM Magnuson RJ 2016 Use of metagenomic shotgun sequencing technology to detect foodborne pathogens within the microbiome of the beef production Chain Appl. Environ. Microbiol. 82 2433 2443 10.1128/AEM.00078-16 26873315
29 Schnoes AM Brown SD Dodevski I Babbitt PC 2009 Annotation error in public databases: misannotation of molecular function in enzyme superfamilies PLoS Comput. Biol. 5 e1000605 10.1371/journal.pcbi.1000605 20011109
31 Brettin T Davis JJ Disz T Edwards RA Gerdes S Olsen GJ 2015 RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes Sci. Rep. 5 8365 10.1038/srep08365 25666585
32 Styles D O'Brien P O'Boyle S Cunningham P Donlon B Jones MB 2009 Measuring the environmental performance of IPPC industry: I. Devising a quantitative science-based and policy-weighted Environmental Emissions Index Environ. Sci. Policy 12 226 10.1016/j.envsci.2009.02.003
33 Behnken S Hertweck C 2012 Cryptic polyketide synthase genes in non-pathogenic clostridium SPP PLoS One 7 e29609 10.1371/journal.pone.0029609 22235310
34 Thiel T Pratte BS Zhong J Goodwin L Copeland A Lucas S 2013 Complete genome sequence of Anabaena variabilis ATCC 29413 Stand. Genomic. Sci. 9 562 573 10.4056/sigs.3899418 25197444
35 Turroni F Bottacini F Foroni E Mulder I Kim JH Zomer A Sánchez B Bidossi A Ferrarini A Giubellini V 2010 Genome analysis of Bifidobacterium bifidum PRL2010 reveals metabolic pathways for host-derived glycan foraging Proc. Natl. Acad. Sci. USA 107 19514 19519 10.1073/pnas.1011100107 20974960
36 Hall TA 1999 BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/ NT Nucleic. Acids. Symp. 41 95 98
37 Kumar S Stecher G Li M Knyaz C Tamura K 2018 MEGA X: molecular evolutionary genetics analysis across computing platforms Mol. Biol. Evol. 35 1547 1549 10.1093/molbev/msy096 29722887
38 Richter DC Ott F Auch AF Schmid R Huson DH 2008 MetaSim-A sequencing simulator for genomics and metagenomics PLoS One 3 e3373 10.1371/journal.pone.0003373 18841204
39 Li B Ju F Cai L Zhang T 2015 Profile and fate of bacterial pathogens in sewage treatment plants revealed by high-throughput metagenomic approach Environ. Sci. Technol. 49 10492 10502 10.1021/acs.est.5b02345 26252189
40 Tang J Bu Y Zhang XX Huang K He X Ye L 2016 Metagenomic analysis of bacterial community composition and antibiotic resistance genes in a wastewater treatment plant and its receiving surface water Ecotoxicol. Environ. Saf. 132 260 269 10.1016/j.ecoenv.2016.06.016 27340885
42 Florkowski CM 2008 Sensitivity, specificity, Receiver-Operating Characteristic (ROC) curves and likelihood ratios: communicating the performance of diagnostic tests Clin. Biochem. Rev. 29 S83 S87 18852864
43 Harvey R McBean E Hipel K Fang L Cullmann J Bristow M 2015 A Data Mining Tool for Planning Sanitary Sewer Condition Inspection Conflict Resolution in Water Resources and Environmental Management Springer Cham 181 199 10.1007/978-3-319-14215-9_10
44 Youngstrom EA 2014 A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: we are ready to ROC J. Pediatr. Psychol. 39 204 221 10.1093/jpepsy/jst062 23965298
45 Ibarbalz FM Orellana E Figuerola ELM Erijman L 2016 Shotgun metagenomic profiles have a high capacity to discriminate samples of activated sludge according to wastewater type Appl. Environ. Microbiol. 82 5186 5196 10.1128/AEM.00916-16 27316957
46 Ma L Li B Jiang XT Wang YL Xia Y Li AD 2017 Catalogue of antibiotic resistome and host-tracking in drinking water deciphered by a large scale survey Microbiome. 5 154 10.1186/s40168-017-0369-0 29179769
47 Pinto AJ Marcus DN Ijaz UZ Bautista-de lose Santos QM Dick GJ Raskin L 2016 Metagenomic evidence for the presence of comammox nitrospira -like bacteria in a drinking water system mSphere. 1 e00054 15 10.1128/mSphere.00054-15 27303675
48 Ma L Li B Zhang T 2014 Abundant rifampin resistance genes and significant correlations of antibiotic resistance genes and plasmids in various environments revealed by metagenomic analysis Appl. Microbiol. Biotechnol. 98 5195 5204 10.1007/s00253-014-5511-3 24615381
49 Kopf A Bicak M Kottmann R Schnetzer J Kostadinov I Lehmann K 2015 The ocean sampling day consortium Gigascience 4 27 10.1186/s13742-015-0066-5 26097697
51 Nayfach S Bradley PH Wyman SK Laurent TJ Williams A Eisen JA 2015 Automated and accurate estimation of gene family abundance from shotgun metagenomics PLoS Comput. Bio. 11 e1004573 10.1371/journal.pcbi.1004573 26565399
52 Bibby K Viau E Peccia J 2011 Viral metagenome analysis to guide human pathogen monitoring in environmental samples Lett. Appl. Microbiol. 52 386 392 10.1111/j.1472-765X.2011.03014.x 21272046
53 Shapiro-Ilan DI Fuxa JR Lacey LA Onstad DW Kaya HK 2005 Definitions of pathogenicity and virulence in invertebrate pathology J. Invertebr. Pathol. 88 1 7 10.1016/j.jip.2004.10.003 15707863
54 Yoo K Yoo H Lee JM Shukla SK Park J 2018 Classification and regression tree approach for prediction of potential hazards of urban airborne bacteria during Asian dust events Sci. Rep. 8 11823 10.1038/s41598-018-29796-7 30087362
55 Aertsen W Kint V Van Orshoven J Özkan K Muys B 2010 Comparison and ranking of different modelling techniques for prediction of site index in Mediterranean mountain forests Ecol Model. 221 1119 1130 10.1016/j.ecolmodel.2010.01.007
56 Ul-Saufie AZ Yahya AS Ramli NA Hamid HA 2011 Comparison between multiple linear regression and feed forward back propagation neural network models for predicting PM10 concentration level based on gaseous and meteorological parameters Int. J. Res. Appl. Sci. Eng. Technol. 1 42 49
57 Roy K Ambure P 2016 The "double cross-validation" software tool for MLR QSAR model development Chemom. Intell. Lab. Syst. 159 108 126 10.1016/j.chemolab.2016.10.009
58 Jarraud S Mougel C Thioulouse J Lina G Meugnier H Forey F 2002 Relationships between Staphylococcus aureus genetic background, virulence factors, agr groups (alleles), and human disease Infect. Immun. 70 631 641 10.1128/IAI.70.2.631-641.2002 11796592
※ AI-Helper는 부적절한 답변을 할 수 있습니다.