[논문]PCA document reconstruction for email classification

Gomez, J.C.; Moens, M.F.

doi:10.1016/j.csda.2011.09.023

PCA document reconstruction for email classification 원문보기

Computational statistics & data analysis, v.56 no.3, 2012년, pp.741 - 751

Gomez, J.C. (KULEUVEN, Computer Science Department, Celestijnenlaan 200A, B-3001 Heverlee, Belgium) , Moens, M.F.

Abstract ▼ AI-Helper

This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.

주제어

참고문헌 (48)

Abu-Nimeh 60 2007 Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit: eCrime 2007 A comparison of machine learning techniques for phishing detection
Anderson 2003 An Introduction to Multivariate Statistical Analysis
Androutsopoulos 9 2000 Proceedings of the 11th European Conference on Machine Learning: ECML 2009, Workshop on Machine Learning in the New Information Age An evaluation of naive Bayesian anti-spam filtering
Barman 703 2006 Proceedings of the 13th International Conference ICONIP 2006 Non-negative matrix factorization based text mining: feature extraction and classification
Berry 2782 2009 Proceedings of the IEEE International Symposium on Circuits and Systems 2009 Document classification using nonnegative matrix factorization and underapproximation
Biro 29 2008 Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web: AIRWeb 2008 Latent Dirichlet allocation in web spam filtering
Journal of Machine Learning Research Blei 3 993 2003 Latent Dirichlet allocation
Journal of Machine Learning Research Bratko 7 2673 2006 Spam filtering using statistical data compression models
Brutlag 2000 Proceedings of the 17th International Conference on Machine Learning: ICML 2000 Challenges of the email domain for text classification
Carreras 58 2001 Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing: RANLP 2001 Boosting trees for anti-spam email filtering
Cormack, G.V., 2007. Spam track overview. In: Proceedings of the 16th Text REtrieval Conference: TREC-2007. National Institute of Standards and Technology (NIST).
Journal of the American Society for Information Science Deerwester 41 391 1990 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Indexing by latent semantic analysis

상세보기
IEEE Transactions on Neural Networks Drucker 10 5 1048 1999 10.1109/72.788645 Support vector machines for spam categorization

상세보기
SIGKDD Explorations Fawcett 5 2 203 2003 10.1145/980972.980990 In vivo spam filtering: a challenge problem for data mining

상세보기
Fette 649 2007 Proceedings of the 16th International World Wide Web Conference: WWW 2007 Learning to detect phishing emails
Annals of Eugenics Fisher 7 179 1936 10.1111/j.1469-1809.1936.tb02137.x The use of multiple measurements in taxonomic problems

상세보기
Gansterer, W.N., Ilger, M., Lechner, P., Neumayer, R., Strauss, J., 2005. Anti-spam methods - state of the art. Tech. rep.
Gansterer 165 2007 Survey of Text Mining II: Clustering, Classification, and Retrieval Spam filtering based on latent semantic indexing
Gansterer 449 2009 Proceedings of the 31st European Conference on Information Retrieval: ECIR 2009 E-mail classification for phishing defense
Gee 460 2003 Proceedings of the 2003 ACM Symposium on Applied Computing, Data Minning Track Using latent semantic indexing to filter spam
Gomez 566 2010 Proceedings of the 14th International Conference KES 2010 Using biased discriminant analysis for email filtering
Gomez, J.C., Moens, M.-F., 2011. Highly discriminative statistical features for email classification. Knowledge and Information Systems, in press (doi:10.1007/s10115-011-0403-7).
Scientific American Goodman 292 4 42 2005 10.1038/scientificamerican0405-42 Stopping spam

상세보기
Expert Systems with Applications Guzella 36 10206 2009 10.1016/j.eswa.2009.02.037 A review of machine learning approaches to spam filtering

상세보기
Hartley, R., Schaffalitzky, F., 2004. PowerFactorization: 3d reconstruction with missing or uncertain data. In: Proceedings of the Australia-Japan Advanced Workshop on Computer Vision: AJAW 2003.
Pattern Recognition Hoffmann 40 863 2007 10.1016/j.patcog.2006.07.009 Kernel PCA for novelty detection

상세보기
Hofmann 50 1999 Proceedings of the 22nd Annual International ACM SIGIR Probabilistic latent semantic indexing
Journal of Educational Psychology Hotelling 24 7 498 1933 10.1037/h0070888 Analysis of a complex of statistical variables into principal components

상세보기
10.1007/11893004_51 Ishii, N., Murai, T., Yamada, T., Bao, Y., Suzuki, S., 2006. Text classification: combining grouping, LSA and kNN vs support vector machine. In: Knowledge-Based Intelligent Information and Engineering Systems, Lecture Notes in Computer Science, vol. 4252, pp. 393-400.

상세보기
Janecek 2010 Utilizing Nonnegative Matrix Factorization for Email Classification Problems
Jolliffe 1986 Principal Component Analysis
International Journal on Artificial Intelligence Tools Kanaris 16 6 1047 2007 10.1142/S0218213007003692 Words vs. character n-grams for anti-spam filtering

상세보기
Journal of Machine Learning Research Kim 6 37 2005 Dimension reduction in text classification with support vector machines

상세보기
Image and Vision Computing Malagon-Borja 27 1-2 2 2009 10.1016/j.imavis.2007.03.004 Object detection using image reconstruction with PCA

상세보기
Annals of Mathematical Statistics Mann 18 1 50 1947 10.1214/aoms/1177730491 On a test of whether one of two random variables is stochastically larger than the other

상세보기
SIAM: Journal of Numerical Analysis Moler 10 2 241 1973 10.1137/0710024 An algorithm for generalized matrix eigenvalue problems

상세보기
IEEE Transactions on Pattern Analysis and Machine Intelligence Morita 19 8 858 1997 10.1109/34.608289 A sequential factorization method for recovering shape and motion from image streams

상세보기
Philosophical Magazine Pearson 2 6 559 1901 10.1080/14786440109462720 On lines and planes of closest fit to systems of points in space

상세보기
Platt 1998 Fast Training of Support Vector Machines Using Sequential Minimal Optimization
10.1007/11760023_39 Pu, Q., Yang, G.-W., 2006. Short-text classification based on ICA and LSA. In: Advances in Neural Networks, vol. 3972, Lecture Notes in Computer Science. pp. 265-270.

상세보기
Linux Journal Robinson 2003 107 58 2003 A statistical approach to the spam problem
Sculley 9 2007 Proceedings of the 30th Annual International ACM SIGIR Conference Relaxed online SVMs for spam filtering
Silva 300 2009 Proceedings of the 10th International Conference IDEAL 2009 Knowledge extraction with non-negative matrix factorization for text classification
Torkkola 2001 Proceedings of the 2001 IEEE ICDM Workshop on Text Mining Linear discriminant analysis in document classification
International Journal of Computer Vision Vidal 79 1 85 2008 10.1007/s11263-007-0099-z Multiframe motion segmentation with missing data using PowerFactorization and GPCA

상세보기
Witten 2000 Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
Xia 474 2006 Proceedings of the 23rd Annual ACM symposium on Applied Computing: SAC 2008 Binarization approaches to email categorization
Knowledge-Based Systems Yu 21 4 355 2008 10.1016/j.knosys.2008.01.001 A comparative study for content-based dynamic spam classification using four machine learning algorithms

상세보기

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

PCA document reconstruction for email classification 원문보기

Abstract ▼ AI-Helper

주제어

참고문헌 (48)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트