자연어 처리 분야에서도 심층 신경망 기술이 주목되고 있으며, 최근에는 convolutional neural network (CNN)기반의 심층 신경망 구조가 이미지 분류뿐만 아니라 자연어 처리의 문서 분류에서도 좋은 성능이 입증되었다. 하지만 convolutional neural network (CNN)을 이용한 문서 분류 연구에서는 문장의 평균 단어 수가 16개로 이루어진 짧은 문장에 한하여 적용되었으며, 구문 전체와 의미론적 관계가 복잡한 전체 문장을 다루기 어렵다는 단점을 가지고 있다. 본 논문은 기존 연구의 한계점을 극복하고 더 정확한 문서 분류 성능을 위하여 word2vec를 활용한 recurrent neural network (RNN)기반의 심층 신경망의 접근법을 새롭게 제안한다. 이를 위해 장기 의존성 문제를 해결한 long short-term memory (LSTM)을 사용하여 긴 시퀀스의 입력에서도 효과적인 문서 분류가 가능하도록 하였고, 제안 방식의 효율성을 검증하기 위해 영문 데이터 뿐 아니라 한국어 영화 리뷰 데이터에 대해서도 실험을 수행하였다. 그 결과 장문을 포함하고 있는 영문 신문 기사에서는 87%, 단문으로 구성된 영문 영화 리뷰 문서에서는 90%, 한국어 영화 리뷰에서는 88%의 문서 분류 정확도를 보였다.
자연어 처리 분야에서도 심층 신경망 기술이 주목되고 있으며, 최근에는 convolutional neural network (CNN)기반의 심층 신경망 구조가 이미지 분류뿐만 아니라 자연어 처리의 문서 분류에서도 좋은 성능이 입증되었다. 하지만 convolutional neural network (CNN)을 이용한 문서 분류 연구에서는 문장의 평균 단어 수가 16개로 이루어진 짧은 문장에 한하여 적용되었으며, 구문 전체와 의미론적 관계가 복잡한 전체 문장을 다루기 어렵다는 단점을 가지고 있다. 본 논문은 기존 연구의 한계점을 극복하고 더 정확한 문서 분류 성능을 위하여 word2vec를 활용한 recurrent neural network (RNN)기반의 심층 신경망의 접근법을 새롭게 제안한다. 이를 위해 장기 의존성 문제를 해결한 long short-term memory (LSTM)을 사용하여 긴 시퀀스의 입력에서도 효과적인 문서 분류가 가능하도록 하였고, 제안 방식의 효율성을 검증하기 위해 영문 데이터 뿐 아니라 한국어 영화 리뷰 데이터에 대해서도 실험을 수행하였다. 그 결과 장문을 포함하고 있는 영문 신문 기사에서는 87%, 단문으로 구성된 영문 영화 리뷰 문서에서는 90%, 한국어 영화 리뷰에서는 88%의 문서 분류 정확도를 보였다.
Deep neural network based methods have obtained remarkable progress on natural language processing (NLP) task. Recently, convolutional neural network (CNN) based approaches often outperform not only in image classification, but also in document classification. However, convolutional neural network (...
Deep neural network based methods have obtained remarkable progress on natural language processing (NLP) task. Recently, convolutional neural network (CNN) based approaches often outperform not only in image classification, but also in document classification. However, convolutional neural network (CNN) based methods is applied only to a short sentence composed of 16 words in average, and it has a disadvantage that it is difficult to deal with a sentence having a complicated semantic relationship with the whole sentence. In this paper, we propose a new method based on recurrent neural network (RNN) using word2vec to overcome the limitations of previous related work and to get much higher accuracy of document classification. By using long short-term memory (LSTM) to solve the long-term dependency problem, effective document classification is also possible for long sequence input. To validate performance of our proposed method in various data, we tested our proposed method both with English sentence and Korean movie review dataset. As a result, 87% of the English newspaper articles containing the long texts, 90% of the English movie review and 88% of the Korean movie review showed the accuracy of document classification.
Deep neural network based methods have obtained remarkable progress on natural language processing (NLP) task. Recently, convolutional neural network (CNN) based approaches often outperform not only in image classification, but also in document classification. However, convolutional neural network (CNN) based methods is applied only to a short sentence composed of 16 words in average, and it has a disadvantage that it is difficult to deal with a sentence having a complicated semantic relationship with the whole sentence. In this paper, we propose a new method based on recurrent neural network (RNN) using word2vec to overcome the limitations of previous related work and to get much higher accuracy of document classification. By using long short-term memory (LSTM) to solve the long-term dependency problem, effective document classification is also possible for long sequence input. To validate performance of our proposed method in various data, we tested our proposed method both with English sentence and Korean movie review dataset. As a result, 87% of the English newspaper articles containing the long texts, 90% of the English movie review and 88% of the Korean movie review showed the accuracy of document classification.
KIM, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882, 2014.
Wang, Peng, et al. "Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification." Neurocomputing vol. 174, no. 1, pp.806-814, 2016
Dowoo Kim, Myoung-Wan Koo. "Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec", Journal of KIISE, vol.44. no. 7, pp.742-747, 2017
In-Su Kang. A Comparative Study on Using SentiWordNet for English Twitter Sentiment Analysis. Journal of Korean Institute of Intelligent Systems, vol. 23, no. 4 , pp. 317-324, 2013.
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781, 2013.
Mei-Ying Ren, Sinjae Kang. "Comparison Between Optimal Features of Korean and Chinese for Text Classification." Joural of Korean Institute of Intelligent Systems, vol. 25, no. 4, pp. 386-391, 2015.
Dong-Wook Lee, Seo-Hyeon Baek, Min-Ji Park, JinHee Park, Hye-Wuk Jung, Jee-Hyong Lee. "Document Summarization Using Mutual Recommendation with LSA and Sense Analysis." Journal of Korean Institute of Intelligent Systems, vol. 22, no. 5 , pp. 656-662, 2012
Sunghae Jun. "A Big Data Preprocessing using Statistical Text Mining." Journal of Korean Institute of Intelligent Systems, vol. 25, no. 5, pp. 470-476, 2015
Su Jeong Choi, Seong-Bae Park. "Categorization of POIs Using Word and Context information." Journal of Korean Institute of Intelligent Systems, vol, 24, no. 5, pp. 470-476, 2014
Kingma, Diederik, and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980, 2014.
mmihaltz, "word2vec-GoogleNews-vectors", Available:ttps://github.com/mmihaltz/word2vec-GoogleNews-vectors, 2016, [Accessed: July 2 2017]
"The 20 Newsgroups data set", Available: http://qwone.com/-jason/20Newsgroups/ 2008, [Accessed: March 9, 2017]
"Naver sentiment movie corpus v1.0", Available: https://github.com/e9t/nsmc 2015, [Accessed: July 9, 2017]
"Stanford Sentiment Treebank", Available: https://nlp.stanford.edu/sentiment/ 2011, [Accessed: July 20, 2017]
Genkin, Alexander, David D. Lewis, and David Madigan. "Large-scale Bayesian logistic regression for text categorization." Technometrics vol. 49, no. 3, pp. 291-304, 2007
Drucker, Harris, Donghui Wu, and Vladimir N. Vapnik. "Support vector machines for spam categorization." IEEE Transactions on Neural networks, vol. 10, no.5, pp. 1048-1054, 1999
※ AI-Helper는 부적절한 답변을 할 수 있습니다.