[보고서]인체유래 펩타이드를 이요안 기계학습 기반 암 예측법 개발

Balachandran Manavalan

인체유래 펩타이드를 이요안 기계학습 기반 암 예측법 개발
Machine-learning-based method for the prediction of cancer associated peptides from human endogenous peptides 원문보기

보고서 정보
주관연구기관	아주대학교 Ajou University
연구책임자	Balachandran Manavalan
보고서유형	최종보고서
발행국가	대한민국
언어	대한민국
발행년월	2021-06
과제시작연도	2021
주관부처	교육부 Ministry of Education
등록번호	TRKO202100010710
과제고유번호	1345335390
사업명	개인기초연구(교육부)(R&D)
DB 구축일자	2022-04-16
키워드	암.내인성 펩티드.생지표.기계 학습.특성 인코딩.Cancer.Endogenous peptide.Biomarker.Machine learning.Feature encoding.

초록

Abstract ▼

□ 연구개발 목표 및 내용
○ 최종 목표
The main objective of the project was to investigate and develop a novel computational method to predict the cancerassociated peptides (CAP) and their types (i.e., cancer biomarkers) from the human endogenous peptides.

○ 전체 내용
Cancer is the second leading disease cause of death in worldwide, where the lack of diagnosis at an early stage is one of the major obstacles in treating cancer patients (J Clin Oncol. 2016). Due to the advancement in genomics and proteomics, the probability of detecting cancer at an early stage has been improved significantly using peptide-based biomarkers (Stat Biosci. 2016). In this project, I planned to utilize the available experimental data to develop a computational method for the cancer biomarkers prediction from the human endogenous peptides. My development approach involved four stages: (i) construction of non-redundant (nr) dataset; (ii) extraction of features and generation of features sets (iii) exploration of six different machine learning (ML)-based methods, including support vector machine (SVM), random forest (RF), deep learning (DL), extremely randomized tree (ERT), artificial neural network (ANN), and k-nearest neighbors (k-NN) for the prediction models development and (iv) selection of the final model and construction of two-layer prediction frame work.

○ 1단계
● 목표
Previously, we proposed to use peptides from other resources (except human) as negative samples. Since we are handling human peptides, we changed our original plan and decided to use human peptides as negative samples.
● 내용
▷ Initially, we thought of using CAPs as positive and peptide from another source as negative in the firstlayer prediction. To develop a reliable prediction model, the negative dataset should be extracted from human samples rather than other species. Therefore, we modified our original first-layer prediction protocol. Currently, we are working on it.
▷ In the second-layer, positive samples belong to one class can be considered as positive and the remaining positive samples belonging to other classes are considered as negatives. As mentioned in our plan, the criteria for positive samples remain the same. However, we slightly changed the negative samples criteria. Instead of using other CAPs classes as negative, we used human peptides expressed in particular tissue. For instance, when we are dealing with ovarian cancer associated peptides, peptides expressed in ovary that are not associated with positives are considered as negative samples. Basically, we collected these data through extensive literature search.
▷ As mentioned in our proposal, we have been involved in the construction of CAP dataset. To generate a highquality dataset, CD-HIT of 0.9 was applied on original positive samples and negative samples and excluded their respective redundant sequences. Notably, large number of negative samples were obtained for each cancer types. To avoid bias during prediction model development, we randomly extracted negative samples, which is equal to 10 times the number of positive samples. From the actual data, we randomly selected 80% of the data for prediction model development and remaining 20% for independent evaluation. It should be noted that the positive and negative samples are quite imbalanced, hence developing a prediction model is not straightforward. Therefore, we slightly changed our original plan for the model development.

○ 2단계
● 목표
In the original plan, we proposed a two-layer framework model. However, the second-layer model development is not possible for the following reasons: (i) smaller number of bladder and melanoma samples; and (ii) the three cancer-associated peptides are not related. For instance, ovarian is specific for females. However, Melanoma and Bladder are common for males and females. Therefore, we excluded the second-layer prediction and focused on developing a prediction model that accurately predicts ovarian cancer-associated peptide biomarkers.
● 내용
▷ The accurate dataset collection is the most important step in developing a ML-based reliable predictor. We manually collected the positive and negative samples through extensive literature search.
▷ As mentioned in our original plan, we constructed ovarian cancer-associated urinary peptides (positive sample) from the CancerPDF and our extensive literature search. Urinary peptides from healthy females are considered as negative samples. Since there is no database available containing negative samples, we constructed it from scratch by searching literature (~3500 papers from NCBI). Finally, we got 4368 positive samples and 4203 negative samples. Here, two different datasets were generated, namely Ovarian_99 and Ovarian_90. In Ovarian_99, the identical peptides were eliminated, and kept the remaining samples. Whereas in Ovarian_90, CD-HIT of 0.9 was applied and excluded the redundant sequences. We randomly selected 80% of the data for prediction model development and the remaining 20% for independent evaluation from each dataset.
▷ The training dataset is used for developing a prediction model and independent test set is used to validate the model robustness.

○ 3단계
● 목표
Construction of a robust model and the manuscript completion.
● 내용
▷ We constructed a robust predictor by exploiting different computational frameworks, including stacking framework, iterative feature representation, and metapredictor.
▷ Selected the stacking based predictor that shows the transferability when evaluated with an independent dataset.
▷ Robust predictor-based webserver named OvPBpred.
▷ Completed manuscript writing and ready for the submission.

□ 연구개발성과
▷ In bladder dataset, we explored nine different encodings and six different classifiers to develop a prediction model. Results showed that four classifiers (RF, ERT, SVM, and GB) performed consistently better than the remaining two classifiers (AB and KNN), hence we excluded those two classifiers. Instead of selecting the best model, we worked on integrating these four classifiers as employed in our previous studies to improve the performance.
▷ In ovarian dataset, we explored nine different encodings and six different classifiers to develop a prediction model. Results showed that five classifiers (RF, ERT, SVM, AB, and GB) performed consistently better than KNN classifier, hence, we excluded it. Instead of selecting the best model, we are currently working on integrating these five classifiers to develop a single model as employed in our previous studies.
▷ In this study, we employed eleven different encodings (amino acid composition (AAC), Geary autocorrelation (GA), dipeptide deviation from the expected mean (DDE), Triad (KSC), amino acid index (AAI), grouped dipeptide composition (GDPC), grouped tripeptide composition (GTPC), binary profile (BPF), Quasi-sequence order (QSO), enhanced amino acid composition (EAAC), sequence order coupling number (SOCN)) and assessed its capability in discriminating OCAP (Ovarian cancer associated peptides) from non-OCAP using six different classifiers (random forest (RF), extremely randomized tree (ERT), gradient boosting (GB), adaBoost (AB), extreme gradient boosting (XGB), and support vector machine ), independently for Ovarian_99 and Ovarian_90 datasets.
▷ Except for SVM, decision tree-based classifiers' performance topology was found to be similar. We observed that five different encodings (KSC, GDPC, QSO, SOCN, and GTPC) achieved similar performances when compared to the remaining six encodings regardless of the classifiers. Notably, we also observed a similar observation from Ovarian_90 dataset, indicating that the performance-based on Ovarian_99 dataset is not an overestimation.
▷ Among eleven encodings, DDE performance is the worst. Hence, we excluded it from the subsequent analysis.
▷ To develop a reliable and robust model, instead of selecting the best model, we utilized all the models and explored different computational frameworks (not mentioned in the proposal), including feature representation learning, stacking framework, iterative feature representation, and adaptive feature representation learning.
▷ Subsequently, we identified the appropriate computational framework that can accurately identify CAP from the given female urine peptide samples.
▷ Currently, we are preparing the manuscript and will submit it soon.

□ 연구개발성과 활용계획 및 기대 효과
The outcome of this project will be beneficial to the experimentalists, cancer patients and computational biologists, which are as follows:
▷ The predicted CAPs screened from the endogenous peptides will be helpful to the experimentalists for the further validation. Thus, my method is complementary to the experimentalists, which will be helpful for the early treatment of cancer patients. Furthermore, this method can be integrated into one of the steps in drug discovery pipeline.
▷ CAP database will be a valuable resource and acts as a base for the future peptide-based cancer biomarkers prediction.
▷ Biomarkers are not limited to cancer but also heart disease, multiple sclerosis, and many other diseases. Therefore, the current approach can be extended to other diseases biomarkers prediction.
▷ Web server will be made publicly available to serve the scientific community.

(출처 : 요약문 3p)

목차 Contents

COVER ... 1
요 약 문 ... 3
1. 연구개발과제의 개요 ... 7
2. 연구개발과제의 수행 과정 및 수행 내용 ... 8
3. 연구개발과제의 수행 결과 및 목표 달성 정도 ... 12
1) 연구수행 결과 ... 15
4. 연구개발성과의 관련 분야에 대한 기여 정도 ... 21
5. 연구개발성과의 관리 및 활용 계획 ... 21
6. 참고문헌 ... 22
End of Page ... 24

표/그림 (9)

표 Inter- and intra-tumor heterogeneity
표 Schematic illustration from the experimental approach to the computational approach
표 Progress of the methodology development and their stages
표 A summary of each cancer type dataset for the construction and evaluation of prediction model is as follows
표 A summary of ovarian cancer dataset is as follows
표 Performances of various methods and encodings on bladder training dataset
표 Performances of various methods and encodings on ovarian training dataset
표 Performances of 12 different encodings and six different classifiers on Ovarian_99 training dataset
표 Performances of 12 different encodings and six different classifiers on Ovarian_90 training dataset

과제명(ProjectTitle) :	-
연구책임자(Manager) :	-
과제기간(DetailSeriesProject) :	-
총연구비 (DetailSeriesProject) :	-
키워드(keyword) :	-
과제수행기간(LeadAgency) :	-
연구목표(Goal) :	-
연구내용(Abstract) :	-
기대효과(Effect) :	-

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 제목(한글), 저자명(한글), 발행일자, 전자원문, 초록(한글), 초록(영문) 관리번호, 제목(한글), 제목(영문), 저자명(한글), 저자명(영문), 주관연구기관(한글), 주관연구기관(영문), 발행일자, 총페이지수, 주관부처명, 과제시작일, 보고서번호, 과제종료일, 주제분류, 키워드(한글), 전자원문, 키워드(영문), 입수제어번호, 초록(한글), 초록(영문), 목차
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

인체유래 펩타이드를 이요안 기계학습 기반 암 예측법 개발
Machine-learning-based method for the prediction of cancer associated peptides from human endogenous peptides 원문보기