Indexed by:
Abstract:
There are huge amounts of information on Web pages, which includes content information and other useless information, such as navigation, advertisement and flash of animation etc. Reducing the toils of Web users, we estabished a thechnique to extract the content information from web page. Fristly, we analyzed the semantic of web documents by V8 engine of Google and parsed the web document into DOM tree. And then, traversed the DOM tree, pruned the DOM tree in the light of the characteristic of Web page's edit language. Finally, we extracted the content information from Web page. Theoretics and experiments showed that the technique could simplify the web page, present the content information to web users and supply clean data for applicable area, such as retrieval, KDD and DM from web. © 2011 Springer-Verlag Berlin Heidelberg.
Keyword:
Reprint Author's Address:
Email:
Source :
ISSN: 1865-0929
Year: 2011
Issue: PART 2
Volume: 144 CCIS
Page: 271-278
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count: 1
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 6
Affiliated Colleges: