USENIX Technical Program - Paper - Proceedings of the The Sixth Annual Tcl/Tk Workshop, 1998
DataMynah : A pseudo-NL interface to large multivariate datasets
In a paper presented at Tcl97 [Clarke] I discussed a ``knowledge based UI builder" called Dashboard. In this poster I present another in the family of applications built on a common database: DataMynah.
DataMynah uses the ``data dictionary" or ``Memes" portion of the knowledge base to initialize its internal context. After this initialization, the user can interact with DM using English-like commands to retrieve data from tables whose definitions are known to the knowledge base.
The advantage of the Memes information infrastructure over a generic (data type and column name) database UI is that Mynah can define and explain and show relationships between items. The user can ``feel'' his/her way into the data by question and answer rather than having to know the schema and nomenclature precisely in advance.
Mynah is not a true NL tool. It contains no generative grammar rules or elaborate parsers. The parser uses a brute-force substitution pass (similar to sendmail rules) followed by an ``inspired guesswork" pass. In other words, it ``understands" English sentences like a very naive non-native speaker, by grasping the meaning of some key words and phrases and trying to intepret the rest of the input in light of those known words.
Aside from the pseudo-NL interface, Mynah offers a hyper-text-like results/session-log window in which text can be interactive. It is peculiarly optimized for time series data, because the target real-world application is the analysis of large amounts of telemetry by hardware and software engineers. Hundreds of telemetry points are logged every N minutes or seconds; engineers will review these during post mortems or other inquiries into instrument and telescope performance. They will want to determine whether there is correlation between various states of the system and its failure modes. DataMynah easily allows extraction of data in standard formats for input into other tools (and can launch xgobi [XGobi] directly).
After a shallow investigation of tools like Bayesian correlation engines [AutoClass-C], I decided that for simple time-series information the human visual cortex was the optimal processing hardware, and that stacked stripcharts were the optimal representation. BLT was used to make the ``stack-o-strips" windows which show selected telemetry points to the user on a consistent time axis (see Figure 1). Each log table may contain only part of the telemetry, but plots can be cut/pasted from one stack into another, to compare data from different sets.
This product is still in beta. Inital user reactions at a pre-release demo were guardedly positive. User requests were mostly for customization and short-cut features, and for more flexibility in the parser (improve the Turing illusion level).
The end product, if successful, will combine a documentary (answers questions about meaning and function of keywords and data tables) and analysis (access to rich telemetry) function. It could be applied to databases other than our own, and foreign language support would not be difficult due to the simplicity (stupidity?) of the parsing method. The combined NL/GUI approach is, I think, a more appropriate approach to complicated RDBMS UI than a strictly graphical one, which tends to become cluttered and overcomplex as the datasets increase in size and depth.
DataMynah is a pure Tcl/Tk application, though it can launch other visualization applications written in other languages. It was written using Tcl, Tk, TclX [TclX], BLT [BLT], TkTable [TkTable] and SybTcl [SybTcl]
Figure 1: Time Series Data Displayed by DataMynah
The data in Figure 1 are telemetry data from 1995, ingested into the online RDBMS during alpha test of DM. In this stack of time-series plots, the coincident spikes in several of the telemetry sources jump right out to the human eye. It looks very much like noise bursts, and at first we thought we were seeing electrical switching noise. As it turns out, all the signals displaying the coincident spikes originate from the same ADC. The ADC is defective. This problem was not previously noticed because, despite the wealth of telemetry available, our methods of reducing and analyzing it were too laborious to make it worthwhile except for postmortems of catastrophic failures.
The purpose of DataMynah is to make monitoring easy and simple, so that failures of this kind are detected early. The series of user commands which resulted in this plot were approximately ``I'm interested in temperature data for HIRES", ``when I say ht I mean hitmplog,", "lookup ht", ``get the ht from Dec 18 through Dec 26 1995", ``chart d1". DataMynah has many other features which cannot be listed here; sample sessions and output will be available at the poster.
``DataMynah" is copyright, the Regents of the University of California (1998).
Next: References De Clarke
Mon Jul 27 13:12:58 PDT 1998
This paper was originally published in the Sixth Annual Tcl/Tk Workshop, September 14-18, 1998, San Diego, California, USA
Last changed: 8 April 2002 ml