보고서 정보
주관연구기관 |
아주대학교 Ajou University |
연구책임자 |
Balachandran Manavalan
|
보고서유형 | 최종보고서 |
발행국가 | 대한민국 |
언어 |
대한민국
|
발행년월 | 2021-06 |
과제시작연도 |
2021 |
주관부처 |
교육부 Ministry of Education |
등록번호 |
TRKO202100010710 |
과제고유번호 |
1345335390 |
사업명 |
개인기초연구(교육부)(R&D) |
DB 구축일자 |
2022-04-16
|
키워드 |
암.내인성 펩티드.생지표.기계 학습.특성 인코딩.Cancer.Endogenous peptide.Biomarker.Machine learning.Feature encoding.
|
초록
Abstract
▼
□ 연구개발 목표 및 내용
○ 최종 목표
The main objective of the project was to investigate and develop a novel computational method to predict the cancerassociated peptides (CAP) and their types (i.e., cancer biomarkers) from the human endogenous peptides.
○ 전체 내용
Cancer is the second leading disea
□ 연구개발 목표 및 내용
○ 최종 목표
The main objective of the project was to investigate and develop a novel computational method to predict the cancerassociated peptides (CAP) and their types (i.e., cancer biomarkers) from the human endogenous peptides.
○ 전체 내용
Cancer is the second leading disease cause of death in worldwide, where the lack of diagnosis at an early stage is one of the major obstacles in treating cancer patients (J Clin Oncol. 2016). Due to the advancement in genomics and proteomics, the probability of detecting cancer at an early stage has been improved significantly using peptide-based biomarkers (Stat Biosci. 2016). In this project, I planned to utilize the available experimental data to develop a computational method for the cancer biomarkers prediction from the human endogenous peptides. My development approach involved four stages: (i) construction of non-redundant (nr) dataset; (ii) extraction of features and generation of features sets (iii) exploration of six different machine learning (ML)-based methods, including support vector machine (SVM), random forest (RF), deep learning (DL), extremely randomized tree (ERT), artificial neural network (ANN), and k-nearest neighbors (k-NN) for the prediction models development and (iv) selection of the final model and construction of two-layer prediction frame work.
○ 1단계
● 목표
Previously, we proposed to use peptides from other resources (except human) as negative samples. Since we are handling human peptides, we changed our original plan and decided to use human peptides as negative samples.
● 내용
▷ Initially, we thought of using CAPs as positive and peptide from another source as negative in the firstlayer prediction. To develop a reliable prediction model, the negative dataset should be extracted from human samples rather than other species. Therefore, we modified our original first-layer prediction protocol. Currently, we are working on it.
▷ In the second-layer, positive samples belong to one class can be considered as positive and the remaining positive samples belonging to other classes are considered as negatives. As mentioned in our plan, the criteria for positive samples remain the same. However, we slightly changed the negative samples criteria. Instead of using other CAPs classes as negative, we used human peptides expressed in particular tissue. For instance, when we are dealing with ovarian cancer associated peptides, peptides expressed in ovary that are not associated with positives are considered as negative samples. Basically, we collected these data through extensive literature search.
▷ As mentioned in our proposal, we have been involved in the construction of CAP dataset. To generate a highquality dataset, CD-HIT of 0.9 was applied on original positive samples and negative samples and excluded their respective redundant sequences. Notably, large number of negative samples were obtained for each cancer types. To avoid bias during prediction model development, we randomly extracted negative samples, which is equal to 10 times the number of positive samples. From the actual data, we randomly selected 80% of the data for prediction model development and remaining 20% for independent evaluation. It should be noted that the positive and negative samples are quite imbalanced, hence developing a prediction model is not straightforward. Therefore, we slightly changed our original plan for the model development.
○ 2단계
● 목표
In the original plan, we proposed a two-layer framework model. However, the second-layer model development is not possible for the following reasons: (i) smaller number of bladder and melanoma samples; and (ii) the three cancer-associated peptides are not related. For instance, ovarian is specific for females. However, Melanoma and Bladder are common for males and females. Therefore, we excluded the second-layer prediction and focused on developing a prediction model that accurately predicts ovarian cancer-associated peptide biomarkers.
● 내용
▷ The accurate dataset collection is the most important step in developing a ML-based reliable predictor. We manually collected the positive and negative samples through extensive literature search.
▷ As mentioned in our original plan, we constructed ovarian cancer-associated urinary peptides (positive sample) from the CancerPDF and our extensive literature search. Urinary peptides from healthy females are considered as negative samples. Since there is no database available containing negative samples, we constructed it from scratch by searching literature (~3500 papers from NCBI). Finally, we got 4368 positive samples and 4203 negative samples. Here, two different datasets were generated, namely Ovarian_99 and Ovarian_90. In Ovarian_99, the identical peptides were eliminated, and kept the remaining samples. Whereas in Ovarian_90, CD-HIT of 0.9 was applied and excluded the redundant sequences. We randomly selected 80% of the data for prediction model development and the remaining 20% for independent evaluation from each dataset.
▷ The training dataset is used for developing a prediction model and independent test set is used to validate the model robustness.
○ 3단계
● 목표
Construction of a robust model and the manuscript completion.
● 내용
▷ We constructed a robust predictor by exploiting different computational frameworks, including stacking framework, iterative feature representation, and metapredictor.
▷ Selected the stacking based predictor that shows the transferability when evaluated with an independent dataset.
▷ Robust predictor-based webserver named OvPBpred.
▷ Completed manuscript writing and ready for the submission.
□ 연구개발성과
▷ In bladder dataset, we explored nine different encodings and six different classifiers to develop a prediction model. Results showed that four classifiers (RF, ERT, SVM, and GB) performed consistently better than the remaining two classifiers (AB and KNN), hence we excluded those two classifiers. Instead of selecting the best model, we worked on integrating these four classifiers as employed in our previous studies to improve the performance.
▷ In ovarian dataset, we explored nine different encodings and six different classifiers to develop a prediction model. Results showed that five classifiers (RF, ERT, SVM, AB, and GB) performed consistently better than KNN classifier, hence, we excluded it. Instead of selecting the best model, we are currently working on integrating these five classifiers to develop a single model as employed in our previous studies.
▷ In this study, we employed eleven different encodings (amino acid composition (AAC), Geary autocorrelation (GA), dipeptide deviation from the expected mean (DDE), Triad (KSC), amino acid index (AAI), grouped dipeptide composition (GDPC), grouped tripeptide composition (GTPC), binary profile (BPF), Quasi-sequence order (QSO), enhanced amino acid composition (EAAC), sequence order coupling number (SOCN)) and assessed its capability in discriminating OCAP (Ovarian cancer associated peptides) from non-OCAP using six different classifiers (random forest (RF), extremely randomized tree (ERT), gradient boosting (GB), adaBoost (AB), extreme gradient boosting (XGB), and support vector machine ), independently for Ovarian_99 and Ovarian_90 datasets.
▷ Except for SVM, decision tree-based classifiers' performance topology was found to be similar. We observed that five different encodings (KSC, GDPC, QSO, SOCN, and GTPC) achieved similar performances when compared to the remaining six encodings regardless of the classifiers. Notably, we also observed a similar observation from Ovarian_90 dataset, indicating that the performance-based on Ovarian_99 dataset is not an overestimation.
▷ Among eleven encodings, DDE performance is the worst. Hence, we excluded it from the subsequent analysis.
▷ To develop a reliable and robust model, instead of selecting the best model, we utilized all the models and explored different computational frameworks (not mentioned in the proposal), including feature representation learning, stacking framework, iterative feature representation, and adaptive feature representation learning.
▷ Subsequently, we identified the appropriate computational framework that can accurately identify CAP from the given female urine peptide samples.
▷ Currently, we are preparing the manuscript and will submit it soon.
□ 연구개발성과 활용계획 및 기대 효과
The outcome of this project will be beneficial to the experimentalists, cancer patients and computational biologists, which are as follows:
▷ The predicted CAPs screened from the endogenous peptides will be helpful to the experimentalists for the further validation. Thus, my method is complementary to the experimentalists, which will be helpful for the early treatment of cancer patients. Furthermore, this method can be integrated into one of the steps in drug discovery pipeline.
▷ CAP database will be a valuable resource and acts as a base for the future peptide-based cancer biomarkers prediction.
▷ Biomarkers are not limited to cancer but also heart disease, multiple sclerosis, and many other diseases. Therefore, the current approach can be extended to other diseases biomarkers prediction.
▷ Web server will be made publicly available to serve the scientific community.
(출처 : 요약문 3p)
목차 Contents
- COVER ... 1
- 요 약 문 ... 3
- 1. 연구개발과제의 개요 ... 7
- 2. 연구개발과제의 수행 과정 및 수행 내용 ... 8
- 3. 연구개발과제의 수행 결과 및 목표 달성 정도 ... 12
- 1) 연구수행 결과 ... 15
- 4. 연구개발성과의 관련 분야에 대한 기여 정도 ... 21
- 5. 연구개발성과의 관리 및 활용 계획 ... 21
- 6. 참고문헌 ... 22
- End of Page ... 24
※ AI-Helper는 부적절한 답변을 할 수 있습니다.