Region Interval Representation

Next: Region Space Up: Implementation Previous: Implementation

Region Interval Representation

The key ingredient to an implementation of text constraints is choice of representation: how shall region sets be represented? One alternative is a bitvector, with one bit for each possible region in lexicographic order. With a bitvector representation, every region set requires O(n²) space, where n is the length of the document. Considering that the region sets generated by matchers and parsers typically have only O(n) elements, the bitvector representation wastes space. Another alternative represents a region set as a list of explicit pairs [b,e], which is more appropriate for sparse sets. Unfortunately the region sets generated by relational operators are not sparse. To choose a pathological example, after [0,0] matches every region in the document. In general, for any region relation op and region set S, the set matching op S may have O(n²) elements.

Other systems have dealt with this problem by restricting region sets to nested sets [19] or overlapped sets [5], sacrificing expressiveness for linear storage and processing. Instead of restricting region sets, we compress dense region sets with a representation called region intervals. A region interval is a quadruple [b,c;d,e], representing the set of all regions [x,y] such that b <= x <= c and d <= y <= e. Essentially, a region interval is a set of regions whose starts and ends are given by intervals, rather than points. A region interval is depicted by extending the region notation for regions (|--|), replacing the vertical lines denoting the region's endpoints with boxes denoting intervals.

A few facts about region intervals follow immediately from the definition:

The set of all regions in a string of length n can be represented by the region interval [0,n;0,n].
The singleton region set {[b,e]} is represented by the region interval [b,b;e,e].
A region interval represents the empty set if b > c or d > e or b > e.
A region interval [b₁,c₁;d₁,e₁] is a subset of another region interval [b₂,c₂;d₂,e₂] if and only if b₂ <= b₁ <= c₁ <= c₂ and d₂ <= d₁ <= e₁ <= e₂.
The intersection of two intervals [b₁,c₁;d₁,e₁] and [b₂,c₂;d₂,e₂] is [max(b₁,b₂), min(c₁,c₂); max(d₁,d₂), min(e₁,e₂)] which may of course be the empty set.

Region intervals are particularly useful for representing the result of applying a region relation operator. Given any region X and a region relation op, the set of regions which are related to X by op can be represented by exactly one region interval, as shown in Figure 4.

Figure 4: Region intervals corresponding to the relational operators.

By extension, if a region relation operator is applied to a region set with m elements, then the result can be represented with m region intervals (possibly fewer, since some of the region intervals may be redundant).

This result extends to region intervals as well: applying a region relation operator to a region interval yields exactly one region interval. For example, the result of before [b,c;d,e] is the set of all regions which lie before some region in [b,c;d,e]. Assuming the region interval is nonempty, every region ending at or before c qualifies, so the result of this operator can be described by the region interval [0,c;0,c]. We can represent an arbitrary region set by a union of region intervals, which is simply a collection of tuples. The size of this collection may still be O(n²) in the worst case (consider, for example, the set of all regions [b,e] of even length, which must be represented by O(n²) singleton region intervals), but most collections are O(n) in practice.

To summarize, the union-of-region-intervals representation enables a straightforward implementation of the text constraints language:

Primitives: convert the set of regions generated by a literal, regular expression, or parser into a union of region intervals by replacing each region [b,e] with the singleton region interval [b,b;e,e].
Relational operators: for each region interval [b,c;d,e], compute a new region interval op [b,c;d,e].
Union: merge the two collections of region intervals, eliminating any region interval that is a subset of another.
Intersection: intersect every possible pair of region intervals (one from each collection) and collect the results.

Next: Region Space Up: Implementation Previous: Implementation

Robert C. Miller and Brad A. Myers
Mon Apr 26 11:34:19 EDT 1999