The technology of extracting content information from web page based on DOM tree - Details

Author：

Yuan, Dingrong (Yuan, Dingrong.) | Mo, Zhuoying (Mo, Zhuoying.) | Xie, Bing (Xie, Bing.) | Xie, Yangcai (Xie, Yangcai.)

Indexed by：

EI Scopus

Abstract：

There　are　huge　amounts　of　information　on　Web　pages,　which　includes　content　information　and　other　useless　information,　such　as　navigation,　advertisement　and　flash　of　animation　etc.　Reducing　the　toils　of　Web　users,　we　estabished　a　thechnique　to　extract　the　content　information　from　web　page.　Fristly,　we　analyzed　the　semantic　of　web　documents　by　V8　engine　of　Google　and　parsed　the　web　document　into　DOM　tree.　And　then,　traversed　the　DOM　tree,　pruned　the　DOM　tree　in　the　light　of　the　characteristic　of　Web　page＇s　edit　language.　Finally,　we　extracted　the　content　information　from　Web　page.　Theoretics　and　experiments　showed　that　the　technique　could　simplify　the　web　page,　present　the　content　information　to　web　users　and　supply　clean　data　for　applicable　area,　such　as　retrieval,　KDD　and　DM　from　web.　©　2011　Springer-Verlag　Berlin　Heidelberg.

Keyword：

Semantics Information retrieval XML Websites

Author Community：

[ 1 ] [Yuan, Dingrong]College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China
[ 2 ] [Yuan, Dingrong]International WIC Institute, Beijing University of Technology, Beijing 100022, China
[ 3 ] [Mo, Zhuoying]College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China
[ 4 ] [Xie, Bing]College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China
[ 5 ] [Xie, Yangcai]College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China