[특허]Automatic charset detection using support vector machines with charset grouping

Automatic charset detection using support vector machines with charset grouping 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-015/18 G06F-015/00
출원번호	UP-0238478 (2005-09-28)
등록번호	US-7689531 (2010-04-23)
발명자 / 주소	Diao, Lili Cheng, Yun-chian
출원인 / 주소	Trend Micro Incorporated
대리인 / 주소	IP Strategy Group, P.C.
인용정보	피인용 횟수 : 10 인용 특허 : 5

초록 ▼

The invention relates, in an embodiment, to a computer-implemented method for automatic charset detection, which includes detecting an encoding scheme of a target document. The method includes training, using a plurality of text document samples, to obtain a set of machine learning models. Training includes using a SVM (Support Vector Machine) technique to generate the set of machine learning models from feature vectors obtained from the plurality of text document samples. The method also includes applying the set of machine learning models against a set of target document feature vectors converted from the target document to detect the encoding scheme.

대표청구항 ▼

What is claimed is: 1. An article of manufacture comprising a program storage medium having computer readable code embodied therein, the computer readable code being configured for handling at least an email message received through a communication network, said email message including a target document, said target document involving an encoding scheme, the article of manufacture comprising: code for training, using a plurality of text document samples that have been encoded with different encoding schemes and selected for training purposes, said different encoding schemes pertaining to charset encoding for transmission over a network, to obtain a set of machine learning models, said training including using a SVM (Support Vector Machine) technique to generate said set of machine learning models from feature vectors converted from said plurality of text document samples, said feature vectors are grouped by charsets, wherein said training including generating fundamental units from said plurality of text document samples for charsets of said plurality of text document samples and extracting a subset of said fundamental units to form a set of feature lists, said feature vectors are converted from said set of feature lists and said plurality of text document samples, said extracting said subset of said fundamental units includes filtering said fundamental units to obtain fundamental units that are more discriminatory in describing differences among said different encoding schemes; code for applying said set of machine learning models against a set of target document feature vectors converted from said target document, said applying including analyzing said set of target document feature vectors using said set of machine learning models to compute similarity indicia between said set of target document feature vectors and said set of machine learning models associated with said different encoding schemes, wherein a first encoding scheme associated with said set of machine learning models is designated as said encoding scheme if characteristics of said first encoding scheme as represented by said set of machine learning models are computed to be most similar, relative to other encoding schemes of said different encoding schemes, to said set of target document feature vectors; code for decoding said target document to obtain decoded content of said target document based on at least said first encoding scheme; code for determining whether said email message is a spam message based on at least said decoded content of said document; and code for preventing said email message from reaching an email user if said email message is determined to be spam according to said determining. 2. The article of manufacture of claim 1 wherein said filtering employs cross-entropy. 3. The article of manufacture of claim 1 wherein said feature vectors are converted using a statistical representation technique. 4. The article of manufacture of claim 1 wherein said feature vectors are converted using a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 5. The article of manufacture of claim 4 wherein said TF-IDF technique employs a VSM (Vector Space Model) representation approach. 6. The article of manufacture of claim 1 wherein target document contains text. 7. The article of manufacture of claim 1 wherein said applying including converting said target document to said set of target document feature vectors. 8. The article of manufacture of claim 7 wherein converting said target document to said set of target document feature vectors employs a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 9. The article of manufacture of claim 8 wherein said TF-IDF technique employs a VSM (Vector Space Model) representation approach. 10. The article of manufacture of claim 1 wherein said target document represents said email message. 11. The article of manufacture of claim 1 wherein said target document represents an attachment to an email. 12. The article of manufacture of claim 1 wherein said target document represents at least a portion of a web page. 13. An article of manufacture comprising a program storage medium having computer readable code embodied therein, the computer readable code being configured for handling at least an email message received through a communication network, said email message including a received document, said received document involving an encoding scheme, the article of manufacture comprising: code for receiving a plurality of text document samples, said plurality of text document samples being encoded with different encoding schemes and selected for training purposes; said different encoding schemes pertaining to charset encoding for transmission over a network, and code for training, using said plurality of text document samples, to obtain a set of machine learning models, said code for training including code for generating fundamental units from said plurality of text document samples for charsets of said plurality of text document samples, code for extracting a subset of said fundamental units as feature lists, said extracting said subset of said fundamental units including filtering said fundamental units to obtain fundamental units that are more discriminatory in describing differences between said different encoding schemes, code for converting said feature lists into a set of feature vectors, said feature vectors are grouped by charsets, and code for generating said set of machine learning models from said set of feature vectors using a SVM (Support Vector Machine) technique; code for applying said set of machine learning models against a set of target document feature vectors converted from said received document, said applying including analyzing said set of received document feature vectors using said set of machine learning models to compute similarity indicia between said set of target document feature vectors and said set of machine learning models associated with said different encoding schemes, wherein a first encoding scheme associated with said set of machine learning models is designated as said encoding scheme if characteristics of said first encoding scheme as represented by said set of machine learning models are computed to be most similar, relative to other encoding schemes of said different encoding schemes, to said set of received document feature vectors; code for decoding said received document to obtain decoded content of said received document based on at least said first encoding scheme; code for determining whether said email message is a spam message based on at least said decoded content of said document; and code for preventing said email message from reaching an email user if said email message is determined to be spam according to said determining. 14. The article of manufacture of claim 13 wherein said filtering employs a feature selection technique. 15. The article of manufacture of claim 13 wherein said set of feature vectors are converted using a statistical representation technique. 16. The article of manufacture of claim 13 wherein said set of feature vectors are converted using a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 17. The article of manufacture of claim 13 wherein said received document includes text. 18. The article of manufacture of claim 13 further comprising code for displaying said decoded content if said email message is determined to be not spam according to said determining. 19. The article of manufacture of claim 18 wherein said applying including converting said received document to said set of received document feature vectors. 20. The article of manufacture of claim 19 wherein said converting said received document to said set of received document feature vectors employs a TF-IDF (Term Frequency-Inverse Document Frequency) technique. 21. The article of manufacture of claim 13 wherein said received document represents an attachment to said email message.

이 특허에 인용된 특허 (5)

Powell Robert David, Identifying language and character set of data representing text.
상세보기
Shanahan,James G.; Roma,Norbert; Evans,David A., Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering.
상세보기
Bokser Mindy R. (San Francisco CA), Pattern classification means for use in a pattern recognition system.
상세보기
Pastor Jose (191 Wilton Rd. Westport CT 06880), Recognition method for character set.
상세보기
Casey Richard G. ; Takahashi Hiroyasu,JPX, Speed and recognition enhancement for OCR using normalized height/width position.
상세보기

이 특허를 인용한 특허 (10)

Petriuc, Mihai, Click distance determination.
상세보기
Maine, Stephen Jared; Coulson, Michael J.; Vishwanath, Tirunelveli R.; Christensen, Erik B., Conversion of hierarchical infoset type data to binary data.
상세보기
Buryak, Kirill; Lewis, Glenn M.; Benbarak, Nadav; Ben-Artzi, Aner; Peng, Jun, Customer support solution recommendation system.
상세보기
Tankovich, Vladimir; Meyerzon, Dmitriy; Poznanski, Victor, Detection of junk in search result ranking.
상세보기
Tankovich, Vladimir; Meyerzon, Dmitriy; Taylor, Michael James, Document length as a static relevance feature for ranking search results.
상세보기
Jethanandani, Natasha H.; Maine, Stephen Jared; Osovetsky, Evgeny; Rangachari, Krishnan R.; Vishwanath, Tirunelveli R., Encoding/decoding while allowing varying message formats per message.
상세보기
Meyerzon, Dmitriy; Shnitko, Yauhen; Burges, Chris J. C.; Taylor, Michael James, Enterprise relevancy ranking using a neural network.
상세보기
Poznanski, Victor; Wang, Oivind; Holm, Fredrik; Bodd, Nicolai; Tankovich, Vladimir; Meyerzon, Dmitriy, Re-ranking search results.
상세보기
Tankovich, Vladimir; Li, Hang; Meyerzon, Dmitriy; Xu, Jun, Search results ranking using editing distance and document information.
상세보기
Merrigan, Chadd Creighton; Peltonen, Kyle G.; Meyerzon, Dmitriy; Lee, David J., System and method for scoping searches using index keys.
상세보기

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC

저장형식

Text(ASCII format)
Excel format
PIAS분석(.xls)

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

연합인증

Automatic charset detection using support vector machines with charset grouping 원문보기