Web-NER aims at extracting entities of interest from web pages. The scale, unstructuredness, and diversity of the web pose challenges to NER on the web-pages. Traditionally, rule based techniques like Wrapper Induction Systems have been used for this task but these techniques are site specific and not robust. We intend to use statistical learning based approaches.
The rich HTML structure, that encloses the web content, provides strong visual and spatial cues, in addition to textual information. Further, entities on web-pages are often in spatial relationships.
For instance, on web-pages describing products, the product titles are almost always found above the product images. A web-page represents a 2D layout of irregularly placed blocks of varying sizes. Capturing contextual interactions (spatial dependencies) between blocks on such a layout is a challenging task.
In this project, our aim is to build a framework that will assist in entity extraction from web-pages by exploiting textual, visual and spatial properties. We concentrate mostly on entities composed of several sub-entities that are dispersed on a web-page. In our initial attempts, we have used CRFs and SVMs with simple textual, spatial and visual features. For our experiments, we found that SVMs perform better than CRFs