SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page.

We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):

but web page versions:

An efficient change detection approach is important for several issues:

Crawler optimization
Discovering new crawl strategies e.g. based on patterns
Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.
Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers
Archive maintenance, different operations like format migration can change the archived versions renderings.

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.

It is based on two different technologies:

1 – Web page segmentation (let's keep the details for another blog post)

2 – Supervised Learning with Support Vector Machine(SVM).

In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).

You calculate the similarity (or distance) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images similarities (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions: SIFT similarity and HSV similarity.

To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundation (IMF). With a part of your dataset you train your system.

Let's start training:

First, you put all your vectors in input space.As this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).

You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).

Let's classify:

Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.

The pair of urls in blue will be considered as dissimilar, the one in black will be considered as similar by pagelyzer.

When you choose different types of comparison, you choose different types of similarities and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.

An image to show what happens when you have more than two dimensions:

From www.epicentersoftware.com

References

Structural and Visual Comparisons for Web Page Archiving
M. T. Law, N. Thome, S. Gançarski, M. Cord
12th edition of the ACM Symposium on Document Engineering (DocEng) 2012

Structural and Visual Similarity Learning for Web Page Archiving
M. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord
10th workshop on Content-Based Multimedia Indexing (CBMI) 2012

Block-o-Matic: a Web Page Segmentation Tool and its Evaluation

Sanoja A., Gançarski S.

BDA. Nantes, France. 2013.http://hal.archives-ouvertes.fr/hal-00881693/

Yet another Web Page Segmentation Tool

Sanoja A., Gançarski S.

Proceedings iPRES 2012. Toronto. Canada, 2012

Understanding Web Pages Changes.

Pehlivan Z., Saad M.B. , Gançarski S.

International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15

SCAPE QA Tool: Technologies behind Pagelyzer – I Support Vector Machine

Leave a Reply

You might also like…

SCAPE QA Tool: Technologies behind Pagelyzer – II Web Page Segmentation

TIFF format validation: easy-peasy?

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?

Join the conversation

Member-only content

or

or

or

or

Download

or