[논문]k-NN Join Based on LSH in Big Data Environment

Ji, Jiaqi; Chung, Yeongjee

doi:10.6109/jicce.2018.16.2.99

k-NN Join Based on LSH in Big Data Environment 원문보기

Journal of information and communication convergence engineering, v.16 no.2, 2018년, pp.99 - 105

Ji, Jiaqi (Department of Information Center, Hebei Normal University for Nationalities) , Chung, Yeongjee (Department of Computer Engineering, Wonkwang University)

Abstract ▼ AI-Helper

k-Nearest neighbor join (k-NN Join) is a computationally intensive algorithm that is designed to find k-nearest neighbors from a dataset S for every object in another dataset R. Most related studies on k-NN Join are based on single-computer operations. As the data dimensions and data volume increase, running the k-NN Join algorithm on a single computer cannot generate results quickly. To solve this scalability problem, we introduce the locality-sensitive hashing (LSH) k-NN Join algorithm implemented in Spark, an approach for high-dimensional big data. LSH is used to map similar data onto the same bucket, which can reduce the data search scope. In order to achieve parallel implementation of the algorithm on multiple computers, the Spark framework is used to accelerate the computation of distances between objects in a cluster. Results show that our proposed approach is fast and accurate for high-dimensional and big data.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

가설 설정

(3) By using the aforementioned hash function, similar vectors in the signature matrix are mapped onto the same bucket with higher probability.
Effect of node number and data size. (a) The size of S is fixed, and the size of R is changed. (b) The size of R is fixed, and the size of S is changed.
(a) The size of S is fixed, and the size of R is changed. (b) The size of R is fixed, and the size of S is changed.

제안 방법

d(r,s) shows the distance between the r and s. Jaccard distance is adopted to calculate the distance in this study.
In order to address standard issues in distributed systems, such as scalability and fault tolerance, we implement our algorithm on Spark. The key idea of the algorithm is to use LSH to map the data. The theoretical analysis and results show that the algorithm proposed in this study for processing high-dimensional big data is fast and very effective.

대상 데이터

The binary labels reflect whether or not the content owner approves of the ad. The dataset has 54877 dimensions and 4143 data.

이론/모형

In this paper, we introduce the LSHS k-NN Join algorithm for handling high-dimensional big data. In order to address standard issues in distributed systems, such as scalability and fault tolerance, we implement our algorithm on Spark.

성능/효과

The main mechanism of LSHS k-NN is that dataset S is first hashed to a different bucket by LSH; it then calculates the distance based on Spark. The experimental results indicate that the proposed algorithm is efficient and accurate for high-dimensional big data.
The key idea of the algorithm is to use LSH to map the data. The theoretical analysis and results show that the algorithm proposed in this study for processing high-dimensional big data is fast and very effective.

참고문헌 (13)

Y. Hu, C. Yang, C. Ji, Y. Xu, and X. Li, "Efficient snapshot kNN join processing for large data using MapReduce," in Proceedings of 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, China pp. 713-720, 2016. DOI: 10.1109/ICPADS.2016.0098.
J. Maillo, J. Luengo, S. Garcia, F. Herrera, and I. Triguero, "Exact fuzzy k-nearest neighbor classification for big datasets," in Proceedings of 2017 IEEE International Conference on Fuzzy Systems, Naples, Italy, pp. 1-6, 2017. DOI: 10.1109/FUZZ-IEEE.2017.8015686.
T. Wen, Z. Zhang, M. Qiu, M. Zeng, and W. Luo, "A two-dimensional matrix image based feature extraction method for classification of sEMG: a comparative analysis based on SVM, kNN and RBF-NN," Journal of X-ray Science and Technology, vol. 25, no. 2, pp. 287-300, 2017. DOI: 10.3233/XST-17260.

상세보기
M. Antol and V. Dohnal, "Popularity-based ranking for fast approximate kNN search," Informatica, vol. 28, no. 1, pp. 1-21, 2017. DOI: 10.15388/informatica.2017.118.
T. Emrich, H. P. Kriegel, P. Kroger, J. Niedermayer, M. Renz, and A. Zufle, "On reverse-k-nearest-neighbor joins," GeoInformatica, vol. 19, no. 2, pp. 299-330, 2015. DOI: 10.1007/s10707-014-0215-5.

상세보기
M. Afzali, N. Singh, and S. Kumar, "Hadoop-MapReduce: a platform for mining large datasets," in Proceedings of 2016 3rd International Conference on Computing for Sustainable Global Development, New Delhi, India, pp. 1856-1860, 2016.
H. V. L. Cao, T. N. Phan, M. Q. Tran, T. L. Hong, and M. N. Q. Truong, "Processing all k-nearest neighbor query on large multidimensional data," in Proceedings of 2016 International Conference on Advanced Computing and Applications, Can Tho, Vietnam, pp. 11-17, 2016. DOI: 10.1109/ACOMP.2016.012.
G. Song, J. Rochas, L. El Beze, F. Huet, and F. Magoules, "k-Nearest neighbour joins for big data on map reduce: a theoretical and experimental analysis," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016. DOI: 10.1109/TKDE.2016.2562627.

상세보기
J. D. Kim, "A method for continuous k-nearest neighbor search with partial order," Journal of the Korea Institute of Information and Communication Engineering, vol. 15, no. 1, pp. 126-132, 2011. DOI: 10.6109/jkiice.2011.15.1.126.

원문보기 상세보기
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, et al., "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016. DOI: 10.1145/2934664.

상세보기
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, et al., "Mllib: machine learning in Apache Spark," The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
Y. Zhong and X. Peng, "SIFT-based low-quality fingerprint LSH retrieval and recognition method," International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 8, no. 8, pp. 263-272, 2015. DOI: 10.14257/IJSIP.2015.8.8.28.

상세보기
C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in MapReduce," in Proceedings of the 15th International Conference on Extending Database Technology, Berlin, Germany, pp. 38-49, 2012. DOI: 10.1145/2247596.2247602.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

k-NN Join Based on LSH in Big Data Environment 원문보기

Abstract ▼ AI-Helper

주제어

AI 본문요약
AI-Helper

가설 설정

제안 방법

대상 데이터

이론/모형

성능/효과

참고문헌 (13)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

k-NN Join Based on LSH in Big Data Environment 원문보기

Abstract ▼ AI-Helper

주제어

AI 본문요약 엑셀 다운로드 AI-Helper

가설 설정

제안 방법

대상 데이터

이론/모형

성능/효과

참고문헌 (13)

이 논문을 인용한 문헌

저자의 다른 논문 :

정영지 (23)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

AI 본문요약
AI-Helper