Check out the new USENIX Web site.

next up previous
Next: Region Relations Up: Text Constraints Previous: Text Constraints

Primitives

TC has three primitive expressions: literals, regular expressions, and identifiers. A literal string enclosed in single or double quotes matches all occurrences of the string in the document. Thus "Gettysburg" finds all regions exactly matching the literal characters ``Gettysburg''. The literal matcher can generate overlapping regions, so matching "aa" against the string ``aaaaa'' would yield 4 regions.

A regular expression is indicated by /regexp/. Our regular expression matcher is based on the OROMatcher library for Java [20]. The library follows Perl 5 syntax and semantics [27], returning a set of nonoverlapping regions that are as long as possible.

An identifier is any whitespace-delimited token (except for words and punctuation reserved by TC operators). Identifiers refer to the named region sets generated by parsers. For example, after the HTML parser has run, Tag refers to the set of all HTML tags in the document. Only a single namespace is provided by the LAPIS prototype, so the names generated by different parsers must be chosen uniquely. A future version of LAPIS is expected to support multiple independent namespaces.



Robert C. Miller and Brad A. Myers
Mon Apr 26 11:34:19 EDT 1999