• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Yan, Jianzhuo (Yan, Jianzhuo.) | Duan, Hexin (Duan, Hexin.) | Fang, Liying (Fang, Liying.) | Ying, Wang (Ying, Wang.)

Indexed by:

EI Scopus

Abstract:

Many people use the web as the main information source in their daily lives. However, most web pages contain non-information components, such as site bars, footers and ads, etc., which make it complicated to extract text from the original HTML documents. Because of the high human intervention and the low results extraction quality, although the web text extraction techniques have been developed, the popularization and efficiency of the usage still need to be solved. In this paper, we proposed a maximum subsequence segmentation (MSS) algorithm and discussed its application in the domain of news web sites. Differing from the tree structure analysis and VIPS, the algorithm divided the web into text segmentation and label segmentation. Experiment shows that the MSS algorithm achieves 93.73% accuracy over 2000 news pages from 5 different news sites and the efficiency is much faster than DOM-based using same dataset. © 2013 IEEE.

Keyword:

Websites Satellites Extraction Text processing Efficiency Trees (mathematics)

Author Community:

  • [ 1 ] [Yan, Jianzhuo]College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China
  • [ 2 ] [Duan, Hexin]College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China
  • [ 3 ] [Fang, Liying]College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China
  • [ 4 ] [Ying, Wang]College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Related Article:

Source :

Year: 2013

Page: 619-622

Language: English

Cited Count:

WoS CC Cited Count: 0

SCOPUS Cited Count: 3

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 8

Online/Total:1522/10544417
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.