Leonard, T. and Glaser, H. (2001) Large scale acquisition and maintenance from the web without source access. In: Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001. pp. 97-101.
Download
|
PDF
169Kb |
Abstract
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically.
Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback.
The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation.
In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
| Item Type: | Conference or Workshop Item | ||||||
|---|---|---|---|---|---|---|---|
| Creator/Authors: |
| ||||||
| Editors: |
| ||||||
| Research Group: | Current ECS Groups > IT Innovation Centre Old ECS Groups > Dependable Systems and Software Engineering Research Group Current ECS Groups > Web and Internet Science Old ECS Groups > Intelligence, Agents, Multimedia | ||||||
| Date: | October 2001 | ||||||
| Information about this record: | |||||||
| Performance Indicator: | EZ~02~02~05 | ||||||
| Citations: | Google Scholar: 31 | ||||||
| Downloads (2010): | 7 | ||||||
| ID Code: | 6185 | ||||||
| Last Modified: | 23 Sep 2011 10:28 | ||||||
| Deposited On: | 17 Dec 2001 by Leonard, Thomas | ||||||
Tools & Metadata
Download Statistics
Members of ECS may view the download statistics dashboard for this record.
Corrections
ECS staff and postgraduates may modify this record








