|
USENIX Technical Program - Abstract - USENIX 99
Lightweight Structured Text Processing
Robert C. Miller and Brad A. Myers, Carnegie Mellon University
Abstract
Text is a popular storage and distribution format for information,
partly due to generic text-processing tools like Unix grep
and sort. Unfortunately, existing generic tools make
assumptions about text format (e.g., each line is a record) that limit
their applicability. Custom-built tools are one alternative, but they
require substantial time investment and programming expertise. We
describe a new approach, lightweight structured text processing,
which overcomes these difficulties by enabling users to define text
structure interactively and manipulate the structure with generic
tools. Our prototype system, LAPIS, is a web browser that can
highlight, filter, and sort text regions described by the user. LAPIS
has several advantages over other systems: (1) the ability to define
custom structure with a simple, intuitive pattern language; (2)
interactive specification, showing pattern matches in context and
letting users choose the most convenient combination of manual
selection and pattern matching; and (3) external parsers for
standard text formats. The pattern language in LAPIS, text
constraints, describes text structure in high-level terms, with
region relationships like before, after, in, and
contains. We describe an implementation of text constraints
using a novel, compact representation of region sets as collections of
rectangles, or region intervals. We also illustrate some
examples of applying LAPIS to web pages, text files, and source code.
|