Check out the new USENIX Web site.

next up previous
Next: Primitives Up: Lightweight Structured Text Processing Previous: LAPIS Web Browser

Text Constraints

  Text constraints (TC) is a language for specifying text structure using relationships among regions (substrings of the text). TC describes a substring by specifying its start offset and end offset. Formally, a region is an interval [b,e] of inter-character positions in a string, where 0 <= b <= e <= n (n is the length of the string). A region [b,e] identifies the substring that starts at the bth cursor position (just before the bth character of the string) and ends at the eth cursor position (just before the eth character, or at the end of the string if e=n ). Thus the length of a region is e-b. TC is essentially an algebra over sets of regions - operators take region sets as arguments and generate a region set as the result. TC permits an expression to match an arbitrary set of regions, unlike other structured text query languages that constrain region sets to certain types: nonoverlapping (regular expressions), nonnesting (GC-lists [5]), or hierarchical (Proximal Nodes [19]).





Robert C. Miller and Brad A. Myers
Mon Apr 26 11:34:19 EDT 1999