[논문]트레이닝 데이터가 제한된 환경에서 N-Gram 사전을 이용한 트위터 스팸 탐지 방법

최혁준; 박정희

doi:10.3745/ktsde.2017.6.9.445

트레이닝 데이터가 제한된 환경에서 N-Gram 사전을 이용한 트위터 스팸 탐지 방법
A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling 원문보기

정보처리학회논문지. KIPS transactions on software and data engineering. 소프트웨어 및 데이터 공학, v.6 no.9, 2017년, pp.445 - 456

초록
AI-Helper

본 논문에서는 트레이닝 데이터가 제한된 환경에서 n-gram 사전을 이용하여 불건전 정보를 포함하는 스팸 트윗을 탐지하는 방법을 제안한다. 불건전 정보를 포함하는 스팸 트윗은 유사한 단어와 문장을 사용하는 경향이 있다. 이러한 특성을 이용하여 스팸 트윗과 정상 트윗에 대한 n-gram 사전을 구축하고 나이브 베이스 분류기를 적용하여 효과적으로 스팸 트윗을 탐지할 수 있음을 보인다. 반면에, 실시간으로 대용량의 데이터가 유입되는 트위터의 특성은 초기 트레이닝 집합 구성에 매우 큰 비용을 요구 한다. 따라서, 초기 트레이닝 집합이 매우 작거나 존재하지 않는 환경에서 적용할 수 있는 스팸 트윗 탐지 방법이 필요하다. 이를 위해 트위터의 리트윗 기능을 활용하여 의사 라벨을 생성하고 초기 트레이닝 집합의 구성과 n-gram 사전 업데이트에 활용하는 방법을 제안한다. 2016년 12월 1일부터 2016년 12월 7일까지 수집된 한국어 트윗 130만 건을 사용한 다양한 실험 결과는 비교 방법들보다 제안하는 방법의 성능이 우수함을 입증한다.

Abstract ▼ AI-Helper

In this paper, we propose a method to detect spam tweets containing unhealthy information by using an n-gram dictionary under limited labeling. Spam tweets that contain unhealthy information have a tendency to use similar words and sentences. Based on this characteristic, we show that spam tweets can be effectively detected by applying a Naive Bayesian classifier using n-gram dictionaries which are constructed from spam tweets and normal tweets. On the other hand, constructing an initial training set requires very high cost because a large amount of data flows in real time in a twitter. Therefore, there is a need for a spam detection method that can be applied in an environment where the initial training set is very small or non exist. To solve the problem, we propose a method to generate pseudo-labels by utilizing twitter's retweet function and use them for the configuration of the initial training set and the n-gram dictionary update. The results from various experiments using 1.3 million korean tweets collected from December 1, 2016 to December 7, 2016 prove that the proposed method has superior performance than the compared spam detection methods.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	팔로우와 리트윗은 어떤 역할을 하는가?	트위터가 대표적인 소셜 미디어로 성장하게 된 주요 요인은 팔로우와 리트윗 기능이다. 두 기능은 트위터에서 정보를 확산시키는 데에 주요한 역할을 한다. 팔로우 기능은 다른 사용자와 관계를 맺는 기능으로써 팔로우 기능을 통해 자신이 팔로우하고 있는 사용자의 트윗을 실시간으로 받아볼 수 있으며, 리트윗 기능은 이미 존재하는 특정 트윗을 자신의 팔로워들에게 전파하는 역할을 한다. 만약 화제가 되는 사건이 발생할 경우 해당 사건에 대한 트윗이 급증함과 동시에 리트윗 수치 또한 급격히 높아지게 되며, 이에 따라 리트윗 된 트윗을 접하는 사용자들이 급속히 증가하게 된다.
	Streaming API란?	본 논문에서 사용하는 트위터 데이터는 트위터의 StreamingAPI를 사용하여 수집하였다[29]. Streaming API는 트위터에서 실시간으로 발생하는 트윗의 1%를 무작위로 제공하는API이다. Streaming API를 통해 수집되는 트윗들은 트윗을 작성한 사용자와 트윗에 대한 정보를 JSON 파일의 형태로 반환하게 되며 여기서 필요한 정보만을 파싱하여 사용할 수 있다.
	트위터가 대표적인 소셜 미디어로 성장하게 된 주요 요인은?	트위터가 대표적인 소셜 미디어로 성장하게 된 주요 요인은 팔로우와 리트윗 기능이다. 두 기능은 트위터에서 정보를 확산시키는 데에 주요한 역할을 한다.

참고문헌 (33)

Statista, Number of Monthly Active Twitter Users Worldwide from 1st quarter 2010 to 4th quarter 2016 (in millions) [Internet], https://www.statista.com/statistics/282087/numberof-monthly-active-twitter-users/.
David Sayce, Number of tweets per day? [Internet], http://www.dsayce.com/social-media/tweets-day/.
L. M. Aiello et al., "Sensing Trending Topics in Twitter," IEEE Trans. Multimedia., Vol.15, No.6, pp.1268-1282, 2013.

상세보기
T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake Shakes Twitter Users: Real-Time Event Detection by Social Sensors," in Proc. 19th International Conference on World Wide Web, ACM, pp. 851-860, 2010.
A. I. Baqapuri, S. Saleh, M. U. Ilyas, "Sentiment Classification of Tweets using Hierarchical Classification," in Proc. IEEE International Conference on Communications, IEEE, 2016.
Neal Ungerleider, Almost 10% of Twitter Is Spam [Internet], https://www.fastcompany.com/3044485/almost-10-of-twitter-is-spam/.
Judy Mottl, Twitter acknowledges 23 million active users are actually bots [Internet], http://www.techtimes.com/articles/12840/20140812/twitter-acknowledges-14-percent-users-bot s-5-percent-spam-bots.htm/.
C. Chen, J. Zhang, Y. Xiang, W. Zhou, and J. Oliver, "Spammers Are Becoming "Smarter" on Twitter," IEEE Trans. IT Professional., Vol.18, No.2, pp.66-70, 2016.
H. J. Choi and C. H. Park, "A Twitter Spam Detection Method based on n-gram Dictionary," in Proc. Korea Computer Congress, Jeju, pp.227-229, 2017.
K. Tao, F. Abel, C. Hauff, G. J. Houben, and U. Gadiraju, "Groundhog Day: Near-Duplicate Detection on Twitter," in Proc. 22nd International Conference on World Wide Web, ACM, pp.1273-1284, 2013.
K. M. Lee, J. Caverlee, and S. Webb, "Uncovering social spammers : social honeypots + machine learning," in Proc. 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp.435-442, 2010.
F. Benevenuto, G. magno, T. Rodrigues, and V. Almeida, "Detecting spammers on Twitter," Presented at the 7th annual Collaboration Electronic Messaging Anti-Abuse Spam Conference (CEAS), Vol.6, 2010.
A. H. Wang, "Don't follow me : spam detection in twitter," in Proc. International Conference on Security and Cryptography (SECRYPT), 2010.
S. Liu, J. Zhang, and Y. Xiang, "Statistical Detection of Online Drifting Twitter Spam," in Proc. 11th ACM on Asia Conference on Computer and Communications Security, ACM, pp.1-10, 2016.
C. Chen, et al, "A Performance Evaluation of Machine Learning-Based Streaming Spam Tweet Detection," IEEE Trans. Computational Social Systems, Vol.2, No.3, pp.65-75. 2015.

상세보기
C. Chen, J. Zhang, Y. Xiang, and W. Zhou, "Asymmetric Self-Learning for Tackling Twitter Spam Drift," in Proc. IEEE Conference on Computer Communications Workshops, IEEE, pp.208-213, 2015.
G. Stringhini, C. Kruegel, and G. Vigna, "Detecting spammers on social networks," in Proc. 26th Annual Computer Security Applications Conference, ACM, pp.1-9, 2010.
J. Song, S. Lee, and J. Kim, "Spam filtering in Twitter using sender-reeiver relationship," in Proc. 14th International Conference on Recent Advances in Intrusion Detection, Springer Berlin/Heidelberg, pp.301-317, 2011.
C. Yang, R. Harkreader, and G. Gu, "Empirical evaluation and new design for fighting evolving twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013.

상세보기
K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, "Design and evaluation of a real-time URL spam filtering service," in Proc. IEEE Symposium on Security and Privacy, Washington, pp.447-462, 2011.
S. H. Lee and J. Kim, "Warningbird : A near real-time detection system for suspicious URLs in Twitter spammers," IEEE Trans. Information Forensics and Security, Vol.8, No. 8, pp.1280-1293, 2013

상세보기
D. M. Freeman, "Using Naive Bayes to Detect Spammy Names in Social Networks," in Proc. the 2013 ACM Workshop on Artificial Intelligence and Security, ACM, pp. 3-12, 2013
A. Herdagdelen, "Twitter n-gram corpus with demographic metadata," Language Resources and Evaluation, Vol.47, No. 4, pp.1127-1147, 2013.

상세보기
S. J. Lee and D. J. Choi, "Personalized Mobile Junk Message Filtering System," The Journal of the Korea Contents Association, Vol.11, No.12, pp.122-135, 2010.
H. N. Lee, M. G. Song, and E. G. Im, "A Study on Structuring Spam Short Message Service(SMS) filter," in Proc. Symposium of the Korean Institute of communications and Information Sciences, pp.1072-1073, 2011.
S. W. Lee, "Spam Filter by Using X2 Statistics and Support Vector Machines," KIPS Journal B (2001-2012), Vol.17B, No.3, pp.249-254, 2010.
I. W. Joe and H. T. Shim, "A SVM-based Spam Filtering System for Short Message Service (SMS)," The Journal of The Korean Institute of Communication Sciences, Vol.34, No.9, pp.908-913, 2009.
Y. H. Kim et al., "Spam Twit Filtering using NaIve Bayesian Algorithm and URL Analysis," in Proc. Korean Institute of Information Scientists and Engineers, Vol.38, No.2B, pp. 375-378, Nov., 2011.
Twitter, Inc., Streaming APIs [Internet], https://dev.twitter.com/streaming/overview.
Cyren, Q3 Trend Report Highlights Real-Time Malware Campagigns And Increase In Phishing [Internet], https://blog.cyren.com/articles/commtouch-internet-threats-trendreport-q3-2013.html.
V. Metsis, I. Androutsopoulos, and G. Paliouras, "Spam Filtering with Naive Bayes-Which Naive Bayes?," in Proc. the Third Conference on Email and Anti-Spam, pp.28-69, 2006.
J. Graovac, "Text Categorization Using n-Gram Based Language Independent Techniques," in Proc. 35th Anniversary of Computational Linguistics, pp.124-135, 2014.
Machine Learning Group at the University of Waikato, Weka3: Data Mining Software in Java [Internet], http://www.cs.waikato.ac.nz/ml/weka/.

저자의 다른 논문 :

LOADING...

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증