[논문]Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석

이현영; 강승식

doi:10.30693/smj.2021.10.1.16

Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석
Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word 원문보기

스마트미디어저널 = Smart media journal, v.10 no.1, 2021년, pp.16 - 24

이현영 (국민대학교 컴퓨터공학과 대학원) , 강승식 (국민대학교 컴퓨터공학과)

초록
AI-Helper

word2vec 등 기존의 단어 임베딩 기법은 원시 말뭉치에 출현한 단어들만을 대상으로 각 단어를 다차원 실수 벡터 공간에 고정된 길이의 벡터로 표현하기 때문에 형태론적으로 풍부한 표현체계를 가진 언어에 대한 단어 임베딩 기법에서는 말뭉치에 출현하지 않은 단어들에 대한 단어 벡터를 표현할 때 OOV(out-of-vocabulary) 문제가 빈번하게 발생한다. 문장을 구성하는 단어 벡터들로부터 문장 벡터를 구성하는 문장 임베딩의 경우에도 OOV 단어가 포함되었을 때 문장 벡터를 정교하게 구성하지 못하는 문제점이 있다. 특히, 교착어인 한국어는 어휘형태소와 문법형태소가 결합되는 형태론적 특성 때문에 미등록어의 임베딩 기법은 성능 향상의 중요한 요인이다. 본 연구에서는 단어의 형태학적인 정보를 이용하는 방식을 문장 수준으로 확장하고 OOV 단어 문제에 강건한 병렬 Tri-LSTM 문장 임베딩을 제안한다. 한국어 감정 분석 말뭉치에 대해 성능 평가를 수행한 결과 한국어 문장 임베딩을 위한 임베딩 단위는 형태소 단위보다 문자 단위가 우수한 성능을 보였으며, 병렬 양방향 Tri-LSTM 문장 인코더는 86.17%의 감정 분석 정확도를 달성하였다.

Abstract ▼ AI-Helper

The exiting word embedding methodology such as word2vec represents words, which only occur in the raw training corpus, as a fixed-length vector into a continuous vector space, so when mapping the words incorporated in the raw training corpus into a fixed-length vector in morphologically rich language, out-of-vocabulary (OOV) problem often happens. Even for sentence embedding, when representing the meaning of a sentence as a fixed-length vector by synthesizing word vectors constituting a sentence, OOV words make it challenging to meaningfully represent a sentence into a fixed-length vector. In particular, since the agglutinative language, the Korean has a morphological characteristic to integrate lexical morpheme and grammatical morpheme, handling OOV words is an important factor in improving performance. In this paper, we propose parallel Tri-LSTM sentence embedding that is robust to the OOV problem by extending utilizing the morphological information of words into sentence-level. As a result of the sentiment analysis task with corpus in Korean, we empirically found that the character unit is better than the morpheme unit as an embedding unit for Korean sentence embedding. We achieved 86.17% accuracy on the sentiment analysis task with the parallel bidirectional Tri-LSTM sentence encoder.

주제어

참고문헌 (24)

이태석, 강승식, "LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기," 스마트미디어저널, 제7권, 제4호, 17-23쪽, 2018년

원문보기 상세보기
이태석, 선충녕, 정영임, 강승식, "미등록 어휘에 대한 선택적 복사를 적용한 문서 자동요약," 스마트미디어저널, 제8권, 제2호, 58-65쪽, 2019년 06월

원문보기 상세보기
이명호, 임명진, 신주현, "단어와 문장의 의미를 고려한 비속어 판별 방법," 스마트미디어저널, 제9권, 제3호, 98-106쪽, 2020년 9월

원문보기 상세보기
이현영, 강승식. "종단 간 심층 신경망을 이용한 한국어 문장 자동 띄어쓰기," 정보처리학회논문지:소프트웨어 및 데이터 공학, 제8권, 제11호, 441-448쪽, 2019년

원문보기 상세보기
Marco Baroni, Georgiana Dinu and German Kruszewski, "Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors," In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238-247, Baltimore, Maryland, USA, Jun. 2014.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean, "Distributed representations of words and phrases and their compositionality," In Proceedings of Advances in neural information processing systems, pp. 3111-3119, Harrah's Lake Tahoe, USA, Dec. 2013.
Jeffrey Pennington, Richard Socher, and Christopher Manning, "GloVe: Global vectors for word representation," In Proceedings of Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, Doha, Qatar, Oct. 2014.
Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.

상세보기
Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, and Alice Oh. "Subword-level word vector representations for Korean," In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2429-2438, Melbourne, Australia, Jul. 2018.
Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural machine translation of rare words with subword units," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, Berlin, Germany, Aug. 2016.
Nicolas Garneau, Jean-Samuel Leboeuf, and Luc Lamontagne, "Predicting and interpreting embeddings for out of vocabulary words in downstream tasks," In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331-333, Brussels, Belgium, Nov. 2018.
Sebastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio, "On using very large target vocabulary for neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1-10, Beijing, China, Jul. 2015.
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba, "Addressing the rare word problem in neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 11-19, Beijing, China, Jul. 2015.
조단비, 이현영, 박지훈, 강승식, "형태소 임베딩과 SVM을 이용한 뉴스 기사 정치적 편향성의 자동분류," 한국정보처리학회 2020년 춘계학술발표대회, 제27권, 제01호, 451-454쪽, 2020년 5월
Thang Luong, Richard Socher, and Christopher Manning, "Better word representations with recursive neural networks for morphology," In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104-113, Sofia, Bulgaria, Aug. 2013.
Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black, "Character-based neural machine translation," arXiv preprint arXiv:1511.04586, 2015.
Jianpeng Cheng and Mirella Lapata, "Neural summarization by extracting sentences and words," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 484-494, Berlin, Germany, Aug. 2016.
Andrew M. Dai, Christopher Olah, and Quoc V. Le, "Document embedding with paragraph vectors," arXiv preprint arXiv:1507.07998, Aug. 2015.
Ranjan Kumar Behera, Monalisa Jena, Santanu Kumar Rath, and Sanjay Misra,"Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data," Information Processing & Management, vol. 58, issue 1, 2021.

상세보기
Dan-Bi Cho, Hyun-Young Lee, and Seung-Shik Kang, "Sentiment analysis for informal text by using SentencePiece tokenizer and subword embedding," In Proceedings of Korea Computer Congress 2020 (online), vol. 47, no. 1, pp. 395-397, Jul. 2020.
Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis, "Finding Function in Form: Compositional character models for open vocabulary word representation," In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1520-1530, Lisbon, Portugal, Sep. 2015.
Sepp Hochreiter and Jurgen Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, Issue 8, pp.1735-1780, 1997.

상세보기
이현영, 강승식, "문맥 의존 병렬 Trigram 문장 임베딩," 한국정보과학회 2020 한국소프트웨어종합 학술대회 (online), 305-306쪽, 2020년 12월

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석
Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (24)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석 Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (24)

이 논문을 인용한 문헌

저자의 다른 논문 :

이현영 (3) 강승식 (27)

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석
Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word 원문보기

초록
AI-Helper