[논문]A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Roh, Yuji; Heo, Geon; Whang, Steven Euijong

doi:10.1109/tkde.2019.2946162

[해외논문] A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective 원문보기

IEEE transactions on knowledge and data engineering, v.33 no.4, 2021년, pp.1328 - 1347

Roh, Yuji (Korea Advanced Institute of Science and Technology, School of Electrical Engineering, Daejeon, Korea) , Heo, Geon (Korea Advanced Institute of Science and Technology, School of Electrical Engineering, Daejeon, Korea) , Whang, Steven Euijong (Korea Advanced Institute of Science and Technology, School of Electrical Engineering, Daejeon, Korea)

Abstract ▼ AI-Helper

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

참고문헌 (187)

Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.. SMOTE: Synthetic Minority Over-sampling Technique. The journal of artificial intelligence research, vol.16, 321-357.

상세보기
Sinno Jialin Pan, Qiang Yang. A Survey on Transfer Learning. IEEE transactions on knowledge and data engineering, vol.22, no.10, 1345-1359.

상세보기
Proc 3rd Int Conf Learn Representations Explaining and harnessing adversarial examples goodfellow 2015
Tensorflow hub 0
Weiss, Karl, Khoshgoftaar, Taghi M., Wang, DingDing. A survey of transfer learning. Journal of big data, vol.3, 9-.

상세보기
10.18653/v1/N19-5004
IEEE Trans Pattern Anal Mach Intell One-shot learning of object categories li 2006 10.1109/TPAMI.2006.79 28 594

상세보기
Day, Oscar, Khoshgoftaar, Taghi M.. A survey on heterogeneous transfer learning. Journal of big data, vol.4, 29-.

상세보기
Proc 27th Int Conf Neural Inf Process Syst How transferable are features in deep neural networks? yosinski 2014 3320
10.1109/ICCV.2015.168
Haibo He, Garcia, E.A.. Learning from Imbalanced Data. IEEE transactions on knowledge and data engineering, vol.21, no.9, 1263-1284.

상세보기
Shah, Vraj, Kumar, Arun, Zhu, Xiaojin. Are key-foreign key joins safe to avoid when learning high-capacity classifiers?. Proceedings of the VLDB Endowment, vol.11, no.3, 366-379.

상세보기
10.1145/2882903.2882952
10.1145/2213836.2213912
Dalvi, Nilesh, Kumar, Ravi, Soliman, Mohamed. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, vol.4, no.4, 219-230.

상세보기
10.1145/2213836.2213848
Cafarella, Michael J., Halevy, Alon, Khoussainova, Nodira. Data integration for the relational web. Proceedings of the VLDB Endowment, vol.2, no.1, 1090-1101.

상세보기
J Mach Learn Res Latent dirichlet allocation blei 2003 3 993
10.3115/v1/D14-1162
Proc 26th Int Conf Neural Inf Process Syst Distributed representations of words and phrases and their compositionality mikolov 2013 3111
10.1007/978-1-4614-8265-9_1154
Frenay, Benoit, Verleysen, Michel. Classification in the Presence of Label Noise: A Survey. IEEE transactions on neural networks and learning systems, vol.25, no.5, 845-869.

상세보기
Tensorflow data validation 0
10.1145/2939672.2939778
10.1007/978-3-030-01424-7_27
Proc 3rd Int Conf Learn Representations Very deep convolutional networks for large-scale image recognition simonyan 2015
Proc 25th Int Conf Neural Inf Process Syst Imagenet classification with deep convolutional neural networks krizhevsky 2012 1106
10.1145/3178876.3186133
Proc 32nd AAAI Conf Artif Intell Anchors: High-precision model-agnostic explanations ribeiro 2018 1527
10.1145/2723372.2723725
Elmeleegy, Hazem, Madhavan, Jayant, Halevy, Alon. Harvesting relational tables from lists on the web. The VLDB journal : very large data bases : a publication of the VLDB Endowment, vol.20, no.2, 209-226.

상세보기
10.1145/3097983.3098021
IEEE Data Eng Bull Data services leveraging bing's data assets chakrabarti 2016 39 15
10.1145/2882903.2903730
Proc Biennial Conf Innovative Data Syst Res The data civilizer system deng 2017
10.1145/3035918.3058740
Cafarella, Michael J., Halevy, Alon, Wang, Daisy Zhe, Wu, Eugene, Zhang, Yang. WebTables : exploring the power of tables on the web. Proceedings of the VLDB Endowment, vol.1, no.1, 538-549.

상세보기
10.1145/3183713.3183746
Proc 20th Int Conf Int Conf Mach Learn Semi-supervised learning using gaussian fields and harmonic functions zhu 2003 912
Google dataset search 0
Proc Int Conf Artif Intell Statistics Large scale distributed semi-supervised learning using streaming approximation ravi 2016 519
Cafarella, Michael, Halevy, Alon, Lee, Hongrae, Madhavan, Jayant, Yu, Cong, Wang, Daisy Zhe, Wu, Eugene. Ten years of webtables. Proceedings of the VLDB Endowment, vol.11, no.12, 2140-2149.

상세보기
Proc Biennial Conf Innovative Data Syst Res Crowdsourced databases: Query processing with people marcus 2011 211
10.1109/ICDE.2012.122
10.1145/2939502.2939515
Zhou, Zhi-Hua. A brief introduction to weakly supervised learning. National science review, vol.5, no.1, 44-53.

상세보기
10.3115/1690219.1690287
10.1145/3035918.3056442
Schaekermann, Mike, Goh, Joslin, Larson, Kate, Law, Edith. Resolvable vs. Irresolvable Disagreement : A Study on Worker Deliberation in Crowd Work. Proceedings of the acm on human-computer interaction, vol.2, no.no.cscw, 1-19.

상세보기
Proc Biennial Conf Innovative Data Syst Res The role of massively multi-task and weak supervision in software 2.0 ratner 2019
Weak supervision: The new programming paradigm for machine learning 0
10.1109/CVPR.2009.5206848
10.1007/978-1-4899-7637-6
Gu, Y., Jin, Z., Chiu, S.C.. Combining Active Learning and Semi-supervised Learning Using Local and Global Consistency. Lecture notes in computer science, vol.8834, 215-222.

상세보기
Crescenzi, Valter, Merialdo, Paolo, Qiu, Disheng. Crowdsourcing large scale wrapper inference. Distributed and parallel databases : an international journal, vol.33, no.1, 95-122.

상세보기
Park, Noseong, Mohammadi, Mahmoud, Gorde, Kshitij, Jajodia, Sushil, Park, Hongkyu, Kim, Youngmin. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, vol.11, no.10, 1071-1083.

상세보기
Proc 2nd Mach Learn Healthcare Conf Generating multi-label discrete patient records using generative adversarial networks choi 2017 286
Proc Int Conf Neural Inf Process Generative adversarial nets goodfellow 2014 2672
10.1145/2588555.2588576
Proc Biennial Conf Innovative Data Syst Res Data curation at scale: The data tamer system stonebraker 2013
Crescenzi, Valter, Merialdo, Paolo, Qiu, Disheng. Crowdsourcing large scale wrapper inference. Distributed and parallel databases : an international journal, vol.33, no.1, 95-122.

상세보기
Proc ACM SIGMOD Int Conf Manage Data Crowdfill: Collecting structured data from the crowd park 2014 577
Proc IEEE Int Conf Data Eng Crowdsourced enumeration queries franklin 2013 673
IEEE Data Eng Bull Data integration: The current status and the way forward stonebraker 2018 41 3
Proc IEEE Conf Comput Vis Pattern Recognit Learning from massive noisy labeled data for image classification xiao 2015 2691
10.1145/3329486.3329493
10.1145/1993498.1993536
10.1145/2254556.2254659
10.1145/1978942.1979444
Proc 27th Int Conf Very Large Data Bases Potter's wheel: An interactive data cleaning system raman 2001 381
Dolatshah, Mohamad, Teoh, Mathew, Wang, Jiannan, Pei, Jian. Cleaning crowdsourced labels using oracles for statistical classification. Proceedings of the VLDB Endowment, vol.12, no.4, 376-389.

상세보기
CoRR Boostclean: Automated error detection and repair for machine learning krishnan 2017 abs 1711 1299
Polyzotis, Neoklis, Roy, Sudip, Whang, Steven Euijong, Zinkevich, Martin. Data Lifecycle Challenges in Production Machine Learning : A Survey. SIGMOD record, vol.47, no.2, 17-28.

상세보기
Proc 34th Int Conf Mach Learn Learning the structure of generative models without labeled data bach 2017 273
Google cloud automl 0
10.1145/3035918.3054782
Krishnan, Sanjay, Wang, Jiannan, Wu, Eugene, Franklin, Michael J., Goldberg, Ken. ActiveClean : interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, vol.9, no.12, 948-959.

상세보기
Amazon sagemaker 0
10.1145/1989323.1989331
Microsoft custom vision 0
Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.. Web data extraction, applications and techniques: A survey. Knowledge-based systems, vol.70, 301-323.

상세보기
Rekatsinas, Theodoros, Chu, Xu, Ilyas, Ihab F., Ré, Christopher. HoloClean : holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, vol.10, no.11, 1190-1201.

상세보기
Bhardwaj, Anant, Deshpande, Amol, Elmore, Aaron J., Karger, David, Madden, Sam, Parameswaran, Aditya, Subramanyam, Harihar, Wu, Eugene, Zhang, Rebecca. Collaborative data analytics with DataHub. Proceedings of the VLDB Endowment, vol.8, no.12, 1916-1919.

상세보기
10.1145/2384616.2384663
10.1145/1866029.1866040
Park, Hyunjung, Garcia-Molina, Hector, Pang, Richard, Polyzotis, Neoklis, Parameswaran, Aditya, Widom, Jennifer. Deco : a system for declarative crowdsourcing. Proceedings of the VLDB Endowment, vol.5, no.12, 1990-1993.

상세보기
10.1145/2047196.2047203
10.1145/3299869.3319878
Principles of Data Integration doan 2012
10.1145/2882903.2882952
Chen, Lingjiao, Kumar, Arun, Naughton, Jeffrey, Patel, Jignesh M.. Towards linear algebra over normalized data. Proceedings of the VLDB Endowment, vol.10, no.11, 1214-1225.

상세보기
Ratner, Alexander, Bach, Stephen H., Ehrenberg, Henry, Fries, Jason, Wu, Sen, Ré, Christopher. Snorkel : rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, vol.11, no.3, 269-282.

상세보기
Proc Conf Neural Inf Process Syst Data programming: Creating large training sets, quickly ratner 2016 3567
Proc Workshop Human-In-the-Loop Data Analytics Data programming with ddlite: Putting humans in a different part of the loop ehrenberg 2016 10.1145/2939502.2939515
Deepdive: A data management system for automatic knowledge base construction zhang 2015
10.18653/v1/N18-1170
10.18653/v1/E17-1083
10.1145/3329486.3329492
10.1145/2733373.2806243
10.1145/3299869.3314036
Xia, Y., Cao, X., Wen, F., Sun, J.. Well Begun Is Half Done: Generating High-Quality Seeds for Automatic Image Dataset Construction from Web. Lecture notes in computer science, vol.8692, 387-400.

상세보기
IEEE Data Eng Bulletin Keyword search in relational databases: A survey yu 2010 33 67
10.1145/3209889.3209898
Chaudhuri, Surajit, Das, Gautam. Keyword querying and ranking in databases. Proceedings of the VLDB Endowment, vol.2, no.2, 1658-1659.

상세보기
10.18653/v1/P18-1079
IEEE Data Eng Bulletin Managing google's data lake: An overview of the goods system halevy 2016 39 5
Proc Biennial Conf Innovative Data Syst Res YAGO3: A knowledge base from multilingual wikipedias mahdisoltani 2015
Proc Joint Conf Empirical Methods Natural Language Process Comput Natural Language Learn Open language learning for information extraction schmitz 2012 523
10.1145/1376616.1376746
Proc IEEE 34th Int Conf Data Eng Aurum: A data discovery system fernandez 2018 1001
10.1145/1242572.1242667
Proc IEEE 34th Int Conf Data Eng Seeping semantics: Linking datasets using word embeddings for data discovery fernandez 2018 989
10.3115/v1/D14-1038
10.1109/ICSE.2013.6606627
Gupta, Rahul, Halevy, Alon, Wang, Xuezhi, Whang, Steven Euijong, Wu, Fei. Biperpedia : an ontology for search applications. Proceedings of the VLDB Endowment, vol.7, no.7, 505-516.

상세보기
10.1145/2623330.2623623
CoRR Synthesizing tabular data using generative adversarial networks xu 2018 abs 1811 11264
Proc 27th Int Conf Very Large Data Bases Roadrunner: Towards automatic data extraction from large web sites crescenzi 2001 109
CoRR The GAN landscape: Losses, architectures, regularization, and normalization kurach 2018 abs 1807 4720
CoRR NIPS 2016 tutorial: Generative adversarial networks goodfellow 2017 abs 1701 160
Proc IEEE Conf Comput Vis Pattern Recognit Autoaugment: Learning augmentation policies from data cubuk 2019 113
Proc Int Conf Neural Inf Process Learning to compose domain-specific transformations for data augmentation ratner 2017 3239
10.1145/988672.988687
10.1109/ICCV.2015.151
Proc Association Advancement Artif Intell Never-ending learning mitchell 2015 2302
10.1109/ICRA.2017.7989232
Proc 24th AAAI Conf Artif Intell Toward an architecture for never-ending language learning carlson 2010 1306
10.1007/978-3-030-01225-0_39
Semi-supervised learning literature survey zhu 2008
CoRR Synthetic data and artificial neural networks for natural scene text recognition jaderberg 2014 abs 1406 2227
UCI machine learning repository dheeru 2017
Deep Learning goodfellow 2016
10.1109/CVPR.2016.254
Multiple-Valued Logic Soft Comput Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework alcal-fdez 2010 17 255
Deep learning for detection of diabetic eye disease 0
Proc 18th Int Conf Mach Learn Toward optimal active learning through sampling estimation of error reduction roy 2001 441
10.1145/279943.279962
Proc 20th Int Conf Neural Inf Process Syst Multiple-instance active learning settles 2007 1289
10.1109/ICTAI.2004.48
10.3115/1613715.1613855
Zhou, Zhi-Hua, Li, Ming. Tri-training: exploiting unlabeled data using three classifiers. IEEE transactions on knowledge and data engineering, vol.17, no.11, 1529-1541.

상세보기
Proc 15th Int Conf Mach Learn Query learning strategies using boosting and bagging abe 1998 1
10.3115/981658.981684
Active Learning settles 2012 10.1007/978-3-031-01560-1
J Mach Learn Res Scikit-learn: Machine learning in python pedregosa 2011 12 2825

상세보기
10.1145/130385.130417
10.1109/DSAA.2016.49
10.1007/978-1-4471-2099-5_1
Proc Int Conf Artif Intell Statistics Scaling graph-based semi supervised learning to large number of labels using count-min sketch talukdar 2014 940
Proc 15th Int Conf Mach Learn Employing em and pool-based active learning for text classification mccallum 1998 350
10.3115/1690219.1690291
Burbidge, R., Rowland, J.J., King, R.D.. Active Learning for Regression Based on Query by Committee. Lecture notes in computer science, vol.4881, 209-218.

상세보기
Proc ICML Workshop Learn Multiple Views A co-regularized approach to semi-supervised learning with multiple views sindhwani 2005
Proc 19th Int Joint Conf Artif Intell Semi-supervised regression with co-training zhou 2005 908
Triguero, Isaac, García, Salvador, Herrera, Francisco. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and information systems, vol.42, no.2, 245-284.

상세보기
10.1145/1143844.1143862
Proc Biennial Conf Innovative Data Syst Res Datahub: Collaborative data science & dataset version management at scale bhardwaj 2015
Bhattacherjee, Souvik, Chavan, Amit, Huang, Silu, Deshpande, Amol, Parameswaran, Aditya. Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff. Proceedings of the VLDB Endowment, vol.8, no.12, 1346-1357.

상세보기
Proc Biennial Conf Innovative Data Syst Res Data publishing and sharing using fusion tables halevy 2013
10.1145/1807128.1807158
10.1145/1807167.1807286
Ckan 0
Wang, Jiannan, Kraska, Tim, Franklin, Michael J., Feng, Jianhua. CrowdER : crowdsourcing entity resolution. Proceedings of the VLDB Endowment, vol.5, no.11, 1483-1494.

상세보기
Quandl 0
Allahbakhsh, M., Benatallah, B., Ignjatovic, A., Motahari-Nezhad, H. R., Bertino, E., Dustdar, S.. Quality Control in Crowdsourcing Systems: Issues and Directions. IEEE internet computing, vol.17, no.2, 76-81.

상세보기
10.1145/2556288.2557238
Datamarket 0
Marcus, Adam, Parameswaran, Aditya. Crowdsourced Data Management: Industry and Academic Perspectives. Foundations and trends^® in databases, vol.6, no.1, 1-161.

상세보기
Kaggle 0
Li, Guoliang, Wang, Jiannan, Zheng, Yudian, Franklin, Michael J.. Crowdsourced Data Management: A Survey. IEEE transactions on knowledge and data engineering, vol.28, no.9, 2296-2319.

상세보기
10.1145/1401890.1401965
Proc Biennial Conf Innovative Data Syst Res Data wrangling: The challenging yourney from the wild to the lake terrizzano 2015
Daniel, Florian, Kucherbaev, Pavel, Cappiello, Cinzia, Benatallah, Boualem, Allahbakhsh, Mohammad. Quality Control in Crowdsourcing : A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM computing surveys, vol.51, no.1, 1-40.

상세보기
Proc ICML Workshop Continuum Labeled Unlabeled Data Mach Learn Data Mining Combining active learning and semi-supervised learning using gaussian fields and harmonic functions zhu 2003 58
Zhou, Z.-H., Chen, K.-J., Jiang, Y.. Exploiting Unlabeled Data in Content-Based Image Retrieval. Lecture notes in computer science, vol.3201, 525-536.

상세보기
Mozafari, Barzan, Sarkar, Purna, Franklin, Michael, Jordan, Michael, Madden, Samuel. Scaling up crowd-sourcing to very large datasets : a case for active learning. Proceedings of the VLDB Endowment, vol.8, no.2, 125-136.

상세보기
Proc 31st Int Conf Int Conf Mach Learn Distributed representations of sentences and documents le 2014 1188
10.1145/3025453.3026044
10.1145/2213836.2213878
Amsterdamer, Yael, Milo, Tova. Foundations of Crowd Data Sourcing. SIGMOD record, vol.43, no.4, 5-14.

상세보기
Proc 24th Int Conf Neural Inf Process Syst Iterative learning for reliable crowdsourcing systems karger 2011 1953
Proc 22nd Annu Conf Learn Theory Vox populi: Collecting high-quality labels from a crowd dekel 2009
Marcus, Adam, Karger, David, Madden, Samuel, Miller, Robert, Oh, Sewoong. Counting with the crowd. Proceedings of the VLDB Endowment, vol.6, no.2, 109-120.

상세보기
Amazon mechanical turk 0
10.1145/2998181.2998196
10.1145/2998181.2998332
Garcia-Molina, Hector, Joglekar, Manas, Marcus, Adam, Parameswaran, Aditya, Verroios, Vasilis. Challenges in Data Crowdsourcing. IEEE transactions on knowledge and data engineering, vol.28, no.4, 901-911.

상세보기

LOADING...

활용도 분석정보

상세보기

다운로드

내보내기

활용도 Top5 논문

해당 논문의 주제분야에서 활용도가 높은 상위 5개 콘텐츠를 보여줍니다.
더보기 버튼을 클릭하시면 더 많은 관련자료를 살펴볼 수 있습니다.

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증