[보고서]헬스케어 데이터 분석을 위한 시각화 및 대규모 컴퓨팅 기반 인공지능 파이프라인 구축

강병철

헬스케어 데이터 분석을 위한 시각화 및 대규모 컴퓨팅 기반 인공지능 파이프라인 구축
Building a large-scale computing based AI pipeline and visualization 원문보기

보고서 정보
주관연구기관	주식회사 디이프
연구책임자	강병철
보고서유형	최종보고서
발행국가	대한민국
언어	한국어
발행년월	2022-12
과제시작연도	2022
주관부처	질병관리청 Korea Disease Control and Prevention Agency(KDCA)
등록번호	TRKO202300028476
과제고유번호	1776000203
사업명	헬스케어이종데이터활용체계및인공지능개발(R&D)
DB 구축일자	2023-11-15
키워드	헬스케어.멀티오믹스.인공지능.기계학습.파이프라인.Healthcare.Multi-Omics.Machine learning Pipeline.

초록 ▼

헬스케어 빅데이터에 관한 관심과 잠재적 가치가 중요해짐에 따라, 관련 데이터 수집이 증가하고 있습니다. 그러나 헬스케어 빅데이터는 종류가 다양하고 고차원의 복잡도가 높은 데이터로, 이를 체계적으로 활용할 수 있는 기반이 부족한 실정입니다. 데이터의 활용 가치를 높이기 위해서는 정밀한 데이터분석과 인공지능 모델 개발이 필요하며, 이 과정은 효율적으로 관리되어아 합니다.

본 연구에서는 질병관리청에서 보유하고 있는 WGS, RNA-seq, Methyl-seq, 마이크로바이옴 (타액, 변), 라이프로그, 병리 영상 등 헬스케어 데이터를 종류별로 체계화하고, 정제하였습니다. 정제한 데이터와 임상 정보를 결합하여 인공지능 연구를 위한 7종의 기준 데이터 세트를 구축하였습니다. 또한, 다수의 차원 축소 방법과 가시화 방법을 고안하였으며, 15종의 기계 학습 알고리즘과 DNN, LSTM, CNN 등의 딥러닝 아키텍처를 활용하여 분류/회귀를 수행하는 인공지능 모델을 연구하였습니다. 나아가 복잡한 기능을 간단히 활용할 수 있도톡, 구현한 분석 방법을 모듈화하여 “aipipe” 라는 파이썬 패키지를 구성하였습니다.

인공지능 파이프라인은 다양한 종류의 파이프라인을 효과적으로 운영하고 관리하기 위하여 MLOps (Machine learning operation)를 지원하는 Kubeflow Pipelines 플랫폼 위에 구축하였습니다. 7종의 기준 데이터 세트와 이종 간 기준 데이터 세트를 병합한 멀티 데이터 세트를 입력으로 하여 전처리, 차원축소, 시각화, 기계학습/딥러닝을 수행하는 총 18종의 파이프라인을 구축하였습니다. Kubeflow에서 데이터 시각화는 웹 기반 상호반응형 환경을 제공함으로써 , 연구자가 데이터를 탐색하는데 편리하도톡 구성하였습니다.

Kubeflow 플랫폼과 구축한 파이프라인은 질병관리청 헬스케어 빅데이터 연구 서버를 포함한 3개의 서버에 배포하였으며, 헬스케어 데이터의 잠재 가치를 발글하고 후속 과제 설계에 도움을 줄 수 있을 것으로 기대합니다.

(출처 : 요약문 6p)

Abstract ▼

As the interest and potential value of healthcare big data becomes more important, the collection of related data is increasing. However, the foundation for application systematically is insufficient since healthcare big data is diverse and high-dimensional data. To improve data utilization, it is required to analyze data precisely and develop artificial intelligence models. Also, this process must be managed effectively.

In this study, healthcare data, WGS, RNA-seq, methy卜seq, microbiome-feces, microbiome-saliva, lifelog, and pathology images, were organized and refined. 7 data sets were built for AI research by combining refined data and clinical information. We applied various dimension reduction and visualization methods. We implemented artificial intelligence models that perform classification and regression using 15 kinds of machine learning algorithms and deep learning architectures, such as DNN, LSTM, and CNN. Furthermore, “aipipe”, a Python package composed of analysis methods, enabled to use of complex functions simply.

The AI pipeline was built on the Kubeflow Pipelines platform that supports MLOps (machine learning operation) to operate and manage various types of pipelines. A total of 18 pipelines on 7 standard data sets were connected with components of preprocessing, dimension reduction, visualization, and machine learning/deep learning. Especially, data visualization in Kubeflow provides a web-based interactive environment, making it convenient for researchers to explore data.

The Kubeflow platform and established pipeline have been distributed to three servers, including the Korea Disease Control and Prevention Agency's Healthcare Big Data Research Server, and it is expected to help discover the potential value of healthcare data and help design follow-up tasks.

(source : Summary 7p)

목차 Contents

표지 ... 1
제출문 ... 2
목차 ... 3
요약문 ... 6
Summary ... 7
학술연구개발용역 과제 연구결과 ... 8
제1장 최종 목표 ... 8
1.1. 목표 ... 8
1.2. 목표달성도 및 관련분야에 대한 기여도 ... 19
제2장 국내외 기술 현황 ... 21
2.1. 국내외 헬스케어 빅데이터 인공지능 기술 현황 ... 21
2.2. 시각화 기술 현황 ... 23
2.3. 국내 시각화 기술 현황 ... 25
2.4. 연구결과가 국내외 기술현황에 차지하는 위치 ... 26
제3장 최종 연구 내용 및 방법 ... 27
3.1. 기준 데이터 세트 구성 ... 27
3.2. 인공지능 파이프라인 개발 ... 36
제4장 최종 연구 결과 ... 43
4.1. 원시 데이터 확보 현황 ... 43
4.2. 원시 데이터 정제 및 전처리 ... 45
4.3. 기준 데이터 세트 구성 ... 50
4.4. 단위 데이터 설명 및 전처리 결과 ... 52
4.5. 차원 축소 및 시각화 기법 연구 ... 57
4.6. 딥러닝 중심의 인공지능 모델 연구 및 구현 ... 82
4.7. MLOps 기반의 인공지능 파이프라인 개발 ... 102
4.8. MLOps 기반의 인공지능 파이프라인 배포 및 운영 ... 126
제5장 연구결과 고찰 및 결론 ... 132
5.1. 헬스케어 멀티 오믹스 원시데이터를 정제한 기준 데이터 세트 구축 ... 132
5.2. 헬스케어 데이터 차원 축소 및 시각화 분석 ... 132
5.3. MLOps 기반의 인공지능 파이프라인 구현 ... 132
5.4. 수행 연구 결과 활용 방안 ... 133
5.5 후속 연구 방향 ... 133
제6장 연구성과 및 활용계획 ... 134
제7장 연구용역과제 진행과정에서 수집한 해외과학기술정보 ... 138
7.1. 해외 과학 기술 정보 ... 138
제8장 기타 중요변경사항 ... 140
8.1. 연구 개발 기간 ... 140
8.2. 데이터 확보 목록 변경 ... 140
8.3. 오믹스 원시 데이터 품질 점검 추가 연구 ... 140
제9장 연구비 사용 내역 및 연구원 분담 ... 141
9.1. 연구비 사용 내역 ... 141
9.2. 연구원 현황 ... 142
9.3. 연구분담내용 ... 142
제10장 참고문헌 ... 143
10.1. 참고문헌 ... 143
제11장 첨부서류 ... 145
끝페이지 ... 146

표/그림 (180)

표 Data to be used in this project
표 Example of data to be used
표 Available data on the CODA system
표 Data status in National Biobank of Korea
표 Example of Exploratory Data Analysis
표 The Process of Data refinement
표 Dimension reduction and visualization for healthcare data
표 Example of data visualiza仕on tool
표 Example of AI model and tasks by data types
표 Example of analytic pipelines by datasets
표 Example of analytical pipelines for healthcare image data
표 Example of pipeline schema of MLOps
표 Example of user manual
표 Example of document and program code
표 Major foreign healthcare big data artificial intelligence companies and technologies (출처 : 인공지능(AI) 헬스케어산업 현황 및 동향. 융합연구정책센터. 2019.)
표 Major domestic healthcare big data artificial intelligence companies and technologies (출처 : 인공지능(AI) 헬스케어산업 현황 및 동향. 융합연구정책센터. 2019.)
표 Lifelog visualization of Fit Bit
표 Image visualization of The Cancer Imaging Archive
표 Omics visualization of cBioportal
표 Lifelog visualiza社on of Samsung Electronics
표 Image visualization of Infinite Healthcare
표 Target data in the proposal
표 Final list of data types to be used
표 FastQC criteria
표 Status of basic clinical information and its missing values
표 Pre-process of VCFs
표 Pre-process of RNA-seq TPMs
표 Comparison of Log transformarion
표 Comparison of filter out (0. 0.2)
표 Example of methylation call report
표 Pre-process of microbiome
표 Preprocess of Whole Slide Image
표 AI algorithms to be used
표 Classification metrics
표 Regression metrics
표 Raw data acquisition status
표 Healthcare Big Data Showcase data subject status (Removal of duplicate subjects)
표 Summary of quality control result of the omics data
표 Example of duplicate lifelog ac仕vity data
표 Batch effect of healthy and chronic patients in lifelog activity data
표 Pre-process and refinem ent of lifelog data
표 List of acquired basic clinical information data
표 Visualization of missing data
표 Unit data information
표 Standard datasets information
표 Part of final matrix, HBS-patient-info
표 Part of final matrix, W GS - All SNVs
표 Part of final matrix，RNA-seq
표 Part of final matrix，Methyl-seq
표 Part of final matrix, microbiome
표 Example of raw data, WSI
표 Example of patch, WSI
표 Unit test for data pre-processing
표 Visualization of patient basic information for nominal features
표 Visualization of patient basic information for continuous features
표 Relationship visualization of patient basic information
표 Data frequency by age
표 Methods of dimension reduc仕on for WGS
표 part of matrix after dimension reduction，SNV-BIN
표 part of matrix after dimension reduction，M UT-W GS
표 Relationship visualization between SNVs total count and nominal features
표 Relationship visualization between SNVs total count and continuous features
표 PCA of Mutation distribution (SNV-BIN)
표 PCA of Mutational type (MUT-WGS)
표 Part of counts matrix for tri-nucleotide mutation
표 Part of frequencies matrix for tri-nucleotide mutation
표 Visualization of distribution by mutation type
표 Reconstruction errors by r for optimization
표 Part of W rnatrix after NMF
표 Distribution of signatures by all patients
표 Heatmap of correlation between all signatures and clinical features
표 Part of M matrix after NMF
표 Distribution of signatures by matrix H
표 Distribution of mutation type by signature (Sigl~3)
표 Distribution of mutation type by signature (Sig4~6)
표 Distribution of mutation type by signature (Sig7~8)
표 PCA of RNA-seq
표 Organization list by pa仕ent group
표 Heatmap of correlation between PCA components of RNA expression and clinical information
표 Clustermaps of normal and other groups
표 Boxplots of top-8 ranked expressed genes by group
표 Extraction of CpG Island BED from UCSC Table Browser
표 Relation visualization between mean of methylation level and nomial features
표 Relation visualization between mean of methylation level and continuous features
표 PCA of methylation level
표 Data shape of microbiome grouped by level of Taxon
표 Comparison between microbiome feces data and research reported
표 PCA of microbiome feces
표 Heatmap of correlation between microbiome feces and clinical information
표 Relationship visualization of microbiome feces by patient group
표 Comparison between microobiome saliva data and research reported
표 PCA of microbiome saliva
표 Heatmap of correlation between microbiome saliva and clinical information
표 Relationship visualization of microbiome saliva by patient group
표 Heatmap of correla社on between average amount of ac社vity and clinical information
표 Clustermap of average amount of activity by weekday
표 Research scopes for AI modeling for WGS
표 Comparison of number of samples after oversampling for balanced data
표 Evaluation of multi-class classification of WGS SNV-BIN，MUT-WGS by ML models
표 Evaluation of binary classification of WGS SNV-BIN, MUT-WGS by ML models
표 Evaluation of multi-class classification of MUT-SIG by ML models
표 Evaluation of binary classification of MUT-SIG by ML models
표 KCDA-SHOW-22-WGS-CLF perforaiance table
표 Evaluation of regression of SNV-BIN, MUT-WGS by ML models
표 KCDA-SHOW-22-WGS-REG performance table
표 Architecture of DL model for WGS，learning curve and confusion matrix
표 Accuracies and confusion matrix by mean of k-folds
표 Important features for multi - class classification of WGS by Linear SVM
표 Distribution of accuracies by number of features
표 Confusion matrix when 500 features
표 RNAaseq DNN structure and params
표 KCDA-SHOW-22-RNA-CLF performance table
표 KCDA-SHOW-22-RNA-REG performance table
표 Methyl-seq DNN structure
표 KCDA-SHOW-22-METH-CLF performance table
표 KCDA-SHOW-22-METH-REG performance table
표 microbiome DNN structure and params
표 KCDA -SHOW -22-MIB -FECES-CLFmodel perform ance table
표 KCDA-SHOW-22-MIB-FECES-REG performance table
표 KCDA-SHOW-22-MIB-SALIVA performance table
표 KCDA-SHOW-22-MIB-SALIVA-REG performance table
표 LSTM structure & bidirectional LSTM structure
표 KCDA -SHOW -22 -LIFELOG -CLF performance table
표 KCDA-SHOW-22-LIFELOG-REG performance table
표 Example of WSI, non-tumor (left), tumor (right)
표 Structure of DenseNet121, DenseNet201
표 KCDA-OPEN-22-WSI-Liver performance table
표 Structure of multiomics Autoencoder models
표 Structure of multiomics classifier
표 KCDA-OPEN-22-MULTI performance table
표 Composition of python package “aipipe” and Kubeflow pipeline
표 plot list by datasets
표 Example pipeline with visualization (classify_m ib_by_dl)
표 Example of common visualization (scatter plot, box plot)
표 Example of common visualization (PCA)
표 Example of indivdual visualization (bar plot, pie plot)
표 Pipeline and run for classify_wgs_by_ml
표 Form of classify_wgs_by_ml
표 Result and Visualization of classify_wgs by_ml
표 Pipeline and form for classify_wgs_by_ dl
표 Pipeline and run for regress_w gs_by_m l
표 Form of regress_wgs_by_ml
표 Result and visualization of regress_w gs_by_ml
표 Pipeline and form for regress_wgs_by_ dl
표 Pipeline and run for classify_ ma_by_ml
표 Form and result of classify_rna_by_ml
표 Pipeline and run for classify_ma_by_ dl
표 Form and result of classify_rna_by_ dl
표 Pipeline and run for regress_ma_by_ml
표 Form and result of regress_ma_by_m
표 Pipeline and run for classify_meth _ by_ dl
표 Form and result of classify_meth_by_dl
표 Pipeline and run for regress_meth_by_ dl
표 Result and visualization of regress_m eth_by_dl
표 Pipeline and run for classify_mib_by_ml
표 Form and result of classify_m ib_by_ml
표 Pipeline and run for classify_mib_by_ dl
표 Form and result of classify_m ib_by_dl
표 Pipeline and run for regress_mib_by_ml
표 Form and result of regress_m ib_by_ml
표 Pipeline and run for regress_mib_by_ dl
표 Form and result of regress_mib_by_dl
표 Pipeline and run for classify_ lifelog_by_ dl
표 visualization of classify_ lifelog_ by_ dl
표 Pipeline and run for regress_ lifelog_by_ dl
표 Pipeline and run for classify_w si_by_dl
표 Form and result of classify_wsi_by_ dl
표 Pipeline for classify_m ulti_by_dl
표 Run for classify_multi_by_ dl
표 Three Server systems for Kubeflow pipeline deployment
표 Kubeflow pipeline deployment status
표 Kubeflow login screen
표 Kubeflow dashboard screen
표 Kubeflow Pipeline list screen
표 Kubeflow Pipeline Example of Naming Rules
표 Kubeflow Pipeline detail screen
표 Kubeflow pipeline Run list screen
표 Kubeflow pipeline Run detail screen
표 Kubeflow pipeline Experiment list screen
표 Kubeflow pipeline Experiment detail screen

과제명(ProjectTitle) :	-
연구책임자(Manager) :	-
과제기간(DetailSeriesProject) :	-
총연구비 (DetailSeriesProject) :	-
키워드(keyword) :	-
과제수행기간(LeadAgency) :	-
연구목표(Goal) :	-
연구내용(Abstract) :	-
기대효과(Effect) :	-

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 제목(한글), 저자명(한글), 발행일자, 전자원문, 초록(한글), 초록(영문) 관리번호, 제목(한글), 제목(영문), 저자명(한글), 저자명(영문), 주관연구기관(한글), 주관연구기관(영문), 발행일자, 총페이지수, 주관부처명, 과제시작일, 보고서번호, 과제종료일, 주제분류, 키워드(한글), 전자원문, 키워드(영문), 입수제어번호, 초록(한글), 초록(영문), 목차
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

헬스케어 데이터 분석을 위한 시각화 및 대규모 컴퓨팅 기반 인공지능 파이프라인 구축
Building a large-scale computing based AI pipeline and visualization 원문보기