Automatic charset detection using support vector machines with charset grouping
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/18
G06F-015/00
출원번호
UP-0238478
(2005-09-28)
등록번호
US-7689531
(2010-04-23)
발명자
/ 주소
Diao, Lili
Cheng, Yun-chian
출원인 / 주소
Trend Micro Incorporated
대리인 / 주소
IP Strategy Group, P.C.
인용정보
피인용 횟수 :
10인용 특허 :
5
초록▼
The invention relates, in an embodiment, to a computer-implemented method for automatic charset detection, which includes detecting an encoding scheme of a target document. The method includes training, using a plurality of text document samples, to obtain a set of machine learning models. Training
The invention relates, in an embodiment, to a computer-implemented method for automatic charset detection, which includes detecting an encoding scheme of a target document. The method includes training, using a plurality of text document samples, to obtain a set of machine learning models. Training includes using a SVM (Support Vector Machine) technique to generate the set of machine learning models from feature vectors obtained from the plurality of text document samples. The method also includes applying the set of machine learning models against a set of target document feature vectors converted from the target document to detect the encoding scheme.
대표청구항▼
What is claimed is: 1. An article of manufacture comprising a program storage medium having computer readable code embodied therein, the computer readable code being configured for handling at least an email message received through a communication network, said email message including a target doc
What is claimed is: 1. An article of manufacture comprising a program storage medium having computer readable code embodied therein, the computer readable code being configured for handling at least an email message received through a communication network, said email message including a target document, said target document involving an encoding scheme, the article of manufacture comprising: code for training, using a plurality of text document samples that have been encoded with different encoding schemes and selected for training purposes, said different encoding schemes pertaining to charset encoding for transmission over a network, to obtain a set of machine learning models, said training including using a SVM (Support Vector Machine) technique to generate said set of machine learning models from feature vectors converted from said plurality of text document samples, said feature vectors are grouped by charsets, wherein said training including generating fundamental units from said plurality of text document samples for charsets of said plurality of text document samples and extracting a subset of said fundamental units to form a set of feature lists, said feature vectors are converted from said set of feature lists and said plurality of text document samples, said extracting said subset of said fundamental units includes filtering said fundamental units to obtain fundamental units that are more discriminatory in describing differences among said different encoding schemes; code for applying said set of machine learning models against a set of target document feature vectors converted from said target document, said applying including analyzing said set of target document feature vectors using said set of machine learning models to compute similarity indicia between said set of target document feature vectors and said set of machine learning models associated with said different encoding schemes, wherein a first encoding scheme associated with said set of machine learning models is designated as said encoding scheme if characteristics of said first encoding scheme as represented by said set of machine learning models are computed to be most similar, relative to other encoding schemes of said different encoding schemes, to said set of target document feature vectors; code for decoding said target document to obtain decoded content of said target document based on at least said first encoding scheme; code for determining whether said email message is a spam message based on at least said decoded content of said document; and code for preventing said email message from reaching an email user if said email message is determined to be spam according to said determining. 2. The article of manufacture of claim 1 wherein said filtering employs cross-entropy. 3. The article of manufacture of claim 1 wherein said feature vectors are converted using a statistical representation technique. 4. The article of manufacture of claim 1 wherein said feature vectors are converted using a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 5. The article of manufacture of claim 4 wherein said TF-IDF technique employs a VSM (Vector Space Model) representation approach. 6. The article of manufacture of claim 1 wherein target document contains text. 7. The article of manufacture of claim 1 wherein said applying including converting said target document to said set of target document feature vectors. 8. The article of manufacture of claim 7 wherein converting said target document to said set of target document feature vectors employs a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 9. The article of manufacture of claim 8 wherein said TF-IDF technique employs a VSM (Vector Space Model) representation approach. 10. The article of manufacture of claim 1 wherein said target document represents said email message. 11. The article of manufacture of claim 1 wherein said target document represents an attachment to an email. 12. The article of manufacture of claim 1 wherein said target document represents at least a portion of a web page. 13. An article of manufacture comprising a program storage medium having computer readable code embodied therein, the computer readable code being configured for handling at least an email message received through a communication network, said email message including a received document, said received document involving an encoding scheme, the article of manufacture comprising: code for receiving a plurality of text document samples, said plurality of text document samples being encoded with different encoding schemes and selected for training purposes; said different encoding schemes pertaining to charset encoding for transmission over a network, and code for training, using said plurality of text document samples, to obtain a set of machine learning models, said code for training including code for generating fundamental units from said plurality of text document samples for charsets of said plurality of text document samples, code for extracting a subset of said fundamental units as feature lists, said extracting said subset of said fundamental units including filtering said fundamental units to obtain fundamental units that are more discriminatory in describing differences between said different encoding schemes, code for converting said feature lists into a set of feature vectors, said feature vectors are grouped by charsets, and code for generating said set of machine learning models from said set of feature vectors using a SVM (Support Vector Machine) technique; code for applying said set of machine learning models against a set of target document feature vectors converted from said received document, said applying including analyzing said set of received document feature vectors using said set of machine learning models to compute similarity indicia between said set of target document feature vectors and said set of machine learning models associated with said different encoding schemes, wherein a first encoding scheme associated with said set of machine learning models is designated as said encoding scheme if characteristics of said first encoding scheme as represented by said set of machine learning models are computed to be most similar, relative to other encoding schemes of said different encoding schemes, to said set of received document feature vectors; code for decoding said received document to obtain decoded content of said received document based on at least said first encoding scheme; code for determining whether said email message is a spam message based on at least said decoded content of said document; and code for preventing said email message from reaching an email user if said email message is determined to be spam according to said determining. 14. The article of manufacture of claim 13 wherein said filtering employs a feature selection technique. 15. The article of manufacture of claim 13 wherein said set of feature vectors are converted using a statistical representation technique. 16. The article of manufacture of claim 13 wherein said set of feature vectors are converted using a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 17. The article of manufacture of claim 13 wherein said received document includes text. 18. The article of manufacture of claim 13 further comprising code for displaying said decoded content if said email message is determined to be not spam according to said determining. 19. The article of manufacture of claim 18 wherein said applying including converting said received document to said set of received document feature vectors. 20. The article of manufacture of claim 19 wherein said converting said received document to said set of received document feature vectors employs a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 21. The article of manufacture of claim 13 wherein said received document represents an attachment to said email message.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (5)
Powell Robert David, Identifying language and character set of data representing text.
Shanahan,James G.; Roma,Norbert; Evans,David A., Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering.
Maine, Stephen Jared; Coulson, Michael J.; Vishwanath, Tirunelveli R.; Christensen, Erik B., Conversion of hierarchical infoset type data to binary data.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.