SCAPE QA Tool: Technologies behind Pagelyzer – II Web Page Segmentation

PDF Eh? – Another Hackathon Tale

Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.

Web Page Segmentation at a Glance

For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).

Fig. 1

An efficient web page segmentation aproach is important for several issues:

  • Process different part of a web page accordingly to its type of content.

  • Assign importance to a region in a web page over the rest

  • Understand the structure of a web page

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default. SVM-based comparison is discussed in the post (SCAPE QA Tool: Technologies behind Pagelyzer – I Support Vector Machine). Based on the segmentation hyperlinks are extracted from each block and the jaccard distance between them are calculated.   
 

In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.

Web page Segmentation Algorithm

We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.

The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page. 

The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),

 

W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).

 

For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.

Implementation

Block-o-matic algorithm is available:

References

Structural and Visual Comparisons for Web Page Archiving
M. T. Law, N. Thome, S. Gançarski, M. Cord
12th edition of the ACM Symposium on Document Engineering (DocEng) 2012
 
Structural and Visual Similarity Learning for Web Page Archiving
M. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord
10th workshop on Content-Based Multimedia Indexing (CBMI) 2012
 
Block-o-Matic: A Web Page Segmentation Framework
A. Sanoja and S. Gançarski. 
Paper accepted for oral presentation in the International Conference on Multimedia Computing and Systems (ICMCS'14). Morroco, April 2014.
 
Block-o-Matic: a Web Page Segmentation Tool and its Evaluation
Sanoja A., Gançarski S.
BDA. Nantes, France. 2013.http://hal.archives-ouvertes.fr/hal-00881693/
 
Yet another Web Page Segmentation Tool
Sanoja A., Gançarski S.
Proceedings iPRES 2012. Toronto. Canada, 2012
 
Understanding Web Pages Changes.
Pehlivan Z., Saad M.B. , Gançarski S.
International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15

31
reads

Leave a Reply

Join the conversation