Intranet Tools

nb. next round of REF2013 will NOT be using data from eprints.ecs, but the central university REF interface.

RSS 1.0 Feed
RSS 2.0 Feed
Atom Feed
 

Large scale acquisition and maintenance from the web without source access

Leonard, T. and Glaser, H. (2001) Large scale acquisition and maintenance from the web without source access. In: Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001. pp. 97-101.

Download

[img]
Preview
PDF
169Kb

Abstract

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically.

Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback.

The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation.

In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

Item Type:Conference or Workshop Item
Creator/Authors:
Thomas Leonard
Hugh Glaser
Editors:
Siegfried Handschuh
Rose Dieng-Kuntz
Steffan Staab
Research Group:Current ECS Groups > IT Innovation Centre
Old ECS Groups > Dependable Systems and Software Engineering Research Group
Current ECS Groups > Web and Internet Science
Old ECS Groups > Intelligence, Agents, Multimedia
Date:October 2001
Information about this record:
Performance Indicator:EZ~02~02~05
Citations:Google Scholar: 31
Downloads (2010):7
ID Code:6185
Last Modified:23 Sep 2011 10:28
Deposited On:17 Dec 2001 by Leonard, Thomas

Tools & Metadata

Download Statistics

Last month

Last year

Members of ECS may view the download statistics dashboard for this record.

Corrections

ECS staff and postgraduates may modify this record

  Welcome from Deputy Head of School (Research) Research Prospectus Industrial Partnerships New Research Students Notes for Guidance New Research Students Notes for Guidance
The ECS EPrints Repository supports OAI 2.0 with a base URL of http://eprints.ecs.soton.ac.uk/cgi/oai2

EPrints is free software developed by the University of Southampton to facilitate Open Access to research.
EPrints