Yadollahi, Mohammad Mehdi
(Social Networks Lab., Faculty of Electrical and Computer Engineering, University of Tehran, Tehran, Iran)
,
Asadpour, Masoud
(Social Networks Lab., Faculty of Electrical and Computer Engineering, University of Tehran, Tehran, Iran)
A webpage contains many blocks of data, which can be informative or non-informative. In content extraction methods, informative data such as page title, headlines, news article and post body are distinguished from non-informative data such as advertisement, sidebar and navigational menus. The conten...
A webpage contains many blocks of data, which can be informative or non-informative. In content extraction methods, informative data such as page title, headlines, news article and post body are distinguished from non-informative data such as advertisement, sidebar and navigational menus. The content extraction tasks have many difficulties because of the variety structure of webpages. In this paper, we proposed a content extraction method named Automatic Webpage Segmentation, AWS, which classifies the main content of a given webpage using a feature set consisting of structural and shallow text features. We benefit DOM tree of webpages for feature extraction. The obtained results are promising due to the effectiveness of proposed method to classify individual text elements of a webpage. Besides, feature selection methods such as wrapper and filter are utilized to improve performance of AWS.
A webpage contains many blocks of data, which can be informative or non-informative. In content extraction methods, informative data such as page title, headlines, news article and post body are distinguished from non-informative data such as advertisement, sidebar and navigational menus. The content extraction tasks have many difficulties because of the variety structure of webpages. In this paper, we proposed a content extraction method named Automatic Webpage Segmentation, AWS, which classifies the main content of a given webpage using a feature set consisting of structural and shallow text features. We benefit DOM tree of webpages for feature extraction. The obtained results are promising due to the effectiveness of proposed method to classify individual text elements of a webpage. Besides, feature selection methods such as wrapper and filter are utilized to improve performance of AWS.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.