Design Science, Inc.
The rise of the World Wide Web has made information available to knowledge workers on an unparalleled scale. As the availability of information has increased, the challenge of finding relevant information has become more central and more difficult. The success of text-based Web search engines such as Google has dramatically illustrated the potential of effective information retrieval to impact the way people conduct research and think about information management. However, current text-based search paradigms have a number of limitations, particularly for scientific, mathematics, engineering and technical (SMET) literature.
One challenge is that the majority of this literature is held in collections where the dominant economic model for sustainability involves collecting fees for access. Cataloging information or federating searches involving these collections requires negotiating separately with each such organization. Another challenge is that a significant portion of the information in SMET documents is non-textual, consisting of figures, tables and especially mathematical notation. This problem takes an extreme form for the large and rapidly growing class of documents that were "born in print" and subsequently digitized, where even textual information must somehow be recovered from a scanned image.
At the same time, a window of opportunity for enhancing the searching of SMET literature is opening. For some years, the demand for cross-media publishing and the concomitant need for more effective information reuse and management in large collections of electronic documents has been moving the publishing industry toward XML-based workflows. In particular, SMET publishers have begun moving to XML-based workflows using MathML, a standard for representing mathematical expressions in XML. As a data format, XML offers enticing possibilities for enhanced searching. Since SMET workflows are still transitioning to XML, this is the ideal time to introduce best practices that will facilitate better searching in the future.
To explore the possibilities for enhancing searching of SMET literature in general, and mathematics in particular, a workshop was held in April of 2004 at the Institute for Mathematics and its Applications at the University of Minnesota. The workshop was funded by the National Science Foundation through a National Science Digital Library grant awarded to Design Science, Inc. The remainder of this document describes the issues and areas of consensus identified by that workshop, and lays out an agenda for action.
The workshop brought together a wide variety of SMET content providers, researchers, software vendors and library scientists. Several academic society publishers were represented, as well as commercial publishers, abstracting and indexing services, digitization and archiving projects, and an assortment of academic and educationally-focused collections. The level of interest and participation served to reinforce perceptions that questions of information management, workflow and added functionality such as searching are both timely and relevant to the community. However, it was also clear that enhanced searching of mathematics meant widely-divergent things to different organizations, and during the workshop, several distinct user groups began to emerge.
One natural grouping consisted of academic society publishers from highly mathematical fields. Representatives from SIAM, AMS, CMS, The Albert Einstein Institute (Max Planck Institute for Gravitational Physics) and the IEEE attended, as well as from the abstracting and indexing services Math Reviews and Zentralblatt MATH, which work closely with society publishers. They all publish documents containing many involved and complicated mathematical expressions. Their users are primarily academic researchers, and their users interests are strongly biased toward the apparatus of scholarship: access to the peer-reviewed literature, citations, abstracts, and bibliographies. The material is very dense, and more suited to reading in print than on the screen, so PDF is widely used for electronic publication.
Another important characteristic of this community is that many of its authors use TeX and LaTeX. Dr. Michael Doob of the University of Manitoba reports that author submissions to the Canadian Math Society went from around 10% TeX in 1990, with a large proportion of submissions as manuscripts, to 100% percent LaTeX submissions today. Other publishers of research mathematics and theoretical physics show similar trends. Consequently, many of these publishers share the distinctive characteristic of having TeX-based production workflows. In other disciplines, most production workflows have been based on XML, SGML or proprietary page layout software such as Quark Xpress. For example, a survey of commercial SMET publishers in areas other than mathematics suggests that overall, perhaps 80% submissions are in Word format, with equations rendered in Equation Editor or MathType™ format, and that the use of LaTeX is relatively rare.
Some math-intensive society publishers are planning to migrate or are in the process of migrating to XML-based workflows using MathML, a standard XML-based markup language used to encode mathematical expressions. However, even among those publishers that were not planning to change their production workflows in the immediate future, many are looking at XML+MathML for information management purposes at some point. MathML is a highly-structured, more information-rich representation of mathematics as compared to LaTeX. It contains vocabularies for describing both the visual presentation of expressions, and for indicating their semantic content. However, it is also a low-level representation, more akin to PostScript than TeX. Thus, MathML is almost always generated and processed via software, and is not suitable for hand authoring. For automatic processing, the regularity, structure and ability to represent both presentation and semantic aspects of expressions are highly appealing. Consequently, there was strong interest in LaTeX to XML+MathML conversion strategies at the workshop, as was evidenced by the attention given to the presentation on the Hermes project, a part of the larger MowGLI project funded by the European Union.
Another group that was well-represented, and which has become increasingly significant in recent years, consists of "retro-digitization" projects. Projects such as JSTOR, NUMDAM, ERAM and others have made tremendous strides toward scanning and digitizing research literature which originally appeared in print, stretching back into the 19th century in some cases. For these groups, creating and maintaining accurate cataloging and abstract information is important.
A number of projects also employ OCR and other techniques to extract text for full-text searching. Several are also looking into at least limited use of XML+MathML for some of their holdings, such as abstracts. However, the challenge of augmenting traditional OCR processes so that they can infer the information needed to mark up document structure, equations, tables and other non-textual data is substantial. A presentation by Dr. Masakazu Suzuki of Kyushu University on his INFTY system for mathematical formula recognition was therefore both significant and encouraging. The system, which is already commercially available, appears to yield impressive results with low error rates.
One final group worthy of note consisted of those content providers who publish to the web in HTML or XML as opposed to PDF. This includes many educational content providers and content providers targeting broader audiences, where graphic design, interactivity, and seamless integration with other Web content is important. For this group, XML+MathML is also an appealing format, though most organizations are planning to use XML+MathML as a source format and to generate HTML (or XHTML) for publication, for example using XSLT.
There was a strong consensus that XML and MathML will be important for SMET publishing, especially for information management purposes. The corollary is that effective conversion software is a prerequisite to migrating large bodies of existing material to that format. In fact, the process has begun. The Grainger Library at UIUC has converted around 200,000 pages of SMET research literature into XML+MathML under a Digital Library Initiative grant and over time, documents will increasingly be originally generated as XML+MathML. Some estimates suggest as many as 100,000 pages a year of research literature will be published, processed or archived in that format as early as 2005.
A major theme that emerged from the workshop was metadata and its management. Scholarly publishing has long relied on bibliographic cataloging information for accessing and referencing literature. In the electronic world, cataloging information such as author, title, and date of publication is regarded as metadata. While precise definitions of metadata are disputed by experts, broadly speaking, metadata consists of assertions made by third parties describing the content of an item of information. In practice, most metadata records apply at the document level, as opposed to describing items within a document, such as a picture, but such "fine-grained" metadata is not unheard of. Another useful distinction is that between objective and subjective metadata. Bibliographic cataloging information and intellectual property rights are examples of objective metadata, while grade-level, subject area, and overall quality are examples of subjective metadata.
Properly speaking, metadata records are not part of a document, but are instead stored, maintained and shared separately. For pragmatic reasons, certain kinds of objective metadata, such as the author, are often embedded within the documents to which they apply, usually in some sort of special header field. However, even in these cases, such metadata is often extracted into separate records during production. The point is that to be useful, cards belong in the card catalog and not in the books on the shelves, to use a pre-digital metaphor.
For digital libraries, the role of the card catalog is filled by databases of metadata records. Such databases have obvious utility for searching and information retrieval. One frequent kind of question knowledge workers ask is "what are all the articles by author X" and a metadata database is an ideal for answering such questions. However, even this simple example illustrates two major challenges that must be addressed when using metadata for information retrieval. The first difficulty is that not all the articles by X may be in the same database. The second difficult is that in practice, there is wide variation in how metadata is recorded. As an extreme example, there are dozens of spellings for the name of the Russian mathematician Chebychev, or Chebeshev, or Tschebyscheff, or ...
To attack the first issue, the Open Archives Initiative (OAI) has developed a protocol for sharing metadata so that unified metadata databases can be developed. When libraries were exclusively physical places, it was reasonable that their card catalogs reflected their own holdings. But in digital libraries, boundaries between collections are artificial, and from the point of view of the searcher, what is needed is a single unified metadata store. The basic idea of OAI is that content providers set up OAI servers which returns XML metadata records in response to standardized queries. Aggregators then periodically "harvest" metadata from multiple collections in order to create unified databases, and offer search services built on top of them.
Note that only metadata records are shared, so that content providers who restrict access to their holdings can be included in search services without losing control over their content. A searcher without proper credentials following a link to an item in a protected collection is typically redirected to a page with information about obtaining access to the collection. Of course, for some organizations such as indexing and abstracting services, or reviewing services such as Eisenhower National Clearinghouse, metadata is itself important intellectual property that carries value. However, even in these cases, it is typically possible to share some basic metadata, such as a title, merely to indicate that an item is in a collection, and that authorized users can obtain further information about it. This is precisely the strategy that most of these organizations have adopted to date.
OAI has gathered substantial momentum in the digital library community. The protocol specification has gone through several versions and is now relatively mature and stable. A variety of open source software is available for setting up an OAI server, and for small collections, a mechanism called the OAI Static Repository has been devised whereby metadata can be shared by placing an XML file on a web site and registering it with a gateway. At the workshop, most content providers had either already set up OAI servers or were considering doing so in the near future, and there was strong consensus that OAI would become increasingly dominant as a means of metadata sharing.
The second broad metadata problem, lack of uniformity, has more facets and is therefore less subject to system solutions. One such facet, touched on above, is the identification of equivalent variations in names, spellings, abbreviations of journal titles, and so on. This is an old problem in library science, and is merely exacerbated by the scale of metadata sharing in digital libraries. The principal line of attack is the compilation of name authority databases. For example, the US Library of Congress and the "Deutsche Bibliothek" maintain large name authority databases, and both institutions participate in the Virtual International Authority File (VIAF) project. The aim of this project will be to explore virtually combining the name authority files of both institutions into a single name authority service.
A facet more specific to digital libraries is lack of uniformity in electronic formats for metadata. There have been a number of attempts to standardize metadata formats within different communities. Perhaps the most notable is the Dublin Core standard developed in 1995 under the auspices of NCSA and OCLC. Dublin Core defines a basic set of about a dozen metadata elements such as title, creator, publisher, description and so on. Dublin Core is by far the most widely used metadata standard in the publishing and digital library communities. However, as its name suggests, many if not most organizations using it, have extended it in incompatible ways to record additional metadata specific to their collections. Extensions have proliferated to the extent that the European Center for Standardization has formed a working group that tries to keeps these extensions to Dublin Core organized and has produced an agreement on the presentation of Dublin Core based application profiles. It is now working on a machine readable version of that. Further, while Dublin Core specifies a set of metadata elements, it does not specify a particular electronic format for recording this data, and there are a number in use, including the W3C RDF format and a direct XML encoding. Consequently, creating "crosswalks" or translation guidelines among varying metadata formats and element sets used by various SMET document collections remains a major issue.
The topic of improving the quality, interoperability and management of metadata generated much discussion at the workshop. While there was broad consensus on the momentum and value of OAI, there was little sense that the problems of metadata interoperability had easy or short term solutions. Major content providers have been working toward convergence of metadata formats and crosswalks for interoperability, both on a case-by-case basis, and under the umbrella of various standards organizations. While progress has been slow and difficult, it has been relatively steady. For example, the major math abstracting services Math Reviews and Zentralblatt are nearing an agreement on a common format for their metadata records. Consequently, raising awareness of the issues and continued incremental improvement are likely the best that can be done, given the economic considerations that constrain content providers with very large collections stretching back for many years into the past. In areas such as mathematics where the useful lifespan of research literature routinely runs to decades, such economic constraints cannot be ignored.
Historically, the collection and management of metadata is expensive, as it generally involves a great deal of hand work on the part of subject experts. Consequently, another focus in the workshop discussions was techniques for lowering the cost of metadata without sacrificing quality. One approach is to enlist the aid of authors in identifying and/or checking at least objective bibliographic metadata, for example, by incorporating identification and verification into either the submission process or the proofing process associated with scholarly publication. IEEE and the arXiv both use electronic submission processes that take some steps in this direction. Another avenue is to try to incorporate metadata creation into authoring tools. However, this must be done with care. If metadata creation is purely optional, it will likely be ignored, while at the same time authors will resent intrusive techniques. Furthermore, many kinds of technical documents are edited by multiple people over time. This makes metadata management becomes more complex. Nonetheless, the prevailing view was that there is a good deal of potential for facilitating easy identification of basic metadata such as author and title as part of the authoring process. For example, a standardized set of LaTeX macros for title, author, abstract, MSC classification, date, and so on might be used.
Another area of active research is automatic metadata generation. Because human generated metadata is so expensive, even minimally adequate automatic generation of metadata can has an appealing cost benefit analysis. However studies such as the "Breaking the Metadata Bottleneck" NSDL project conducted by Dr. Elizabeth Liddy of Syracuse University have shown that the quality of human generated metadata varies widely, and in the case of certain kinds of well-defined, objective metadata, automatic algorithms can actually perform better. Another reason automatic algorithms are interesting is that they might be able to make much finer-grained metadata economically viable, for example adding metadata at the individual equation level, as opposed to the document level. Whether or not such fine-grained metadata would be generally useful, however, is a question which leads us into the topic of the next section.
While it is relatively clear to everyone that mathematical notation is not accessible to traditional text-base search to any meaningful degree, there is much less agreement on what a math-aware search ought to look like. Discussions at the workshop covered ideas ranging from fine-grained per-equation metadata, to combined text and equation search, to data mining of specialized databases of mathematical objects. The discussion is further complicated by the paucity of functional, real-world examples of math-aware search. As a result, the need for test bed collections, development of use cases, and further research and usability testing of possible search techniques were the areas where strongest consensus emerged.
The most obvious notion of math aware search is to simply extend the keyword search models used by virtually all popular Web search engines. That is, one would type in a collection of text keywords and mathematical expressions, and the search engine would return a list of documents in which they occur. This is the model used by a math-aware search engine developed by Dr. Abdou Youssef of George Washington University for the Digital Library of Mathematical Functions (DLMF). The DLMF documents are coded in LaTeX using special macro packages. For searching purposes, the mathematical expressions are first converted to structured text, e.g. "x begin_superscript 2 end_superscript," which is then indexed by a conventional text search engine. Queries are entered in a special linear query language similar to LaTeX. To perform the search, the query is also converted into text, and the resulting text search is performed against the index. While preliminary indications are that this yields surprisingly effective math searching, the DLMF search engine is not yet online, so there is not yet a substantial body of usage data.
The DLMF technique of piggy-backing on a text-search engine has some interesting side effects. A common capability of text search engines is control of proximity searching, where searching terms must appear with in a certain distance of each other, for example within 5 words. Using this capability, it is possible to form queries such as "find all expressions of the form (x^2 + ?)" where ? represents any single term. On the other hand, a query such as "find all expressions matching $a^3 + $a^2" is not possible, where both $a's denote the same variable name. Of course, these examples merely scratch the surface of how one might specify abstractions and constraints in a math search query. Setting aside the non-trivial questions of devising an effective query language or graphical query editor, a more basic question is how sophisticated queries need to be in order to be effective. Research indicates that most users of search engines don't bother to inform themselves about the subtleties of the query language, nor do they use even basic features very often if they can't be learned quickly and remembered easily. Consequently, real world usability data is clearly important in determining the success of any math search paradigm that targets a general audience.
Another aspect unique to the DLMF model is that it takes advantage of exceptionally regular notation across the collection and specialized source code macros. In general, however, TeX and LaTeX with their powerful macro mechanisms and ability to redefine basic language constructs on the fly, are notorious for lack of regularity in source code. Consequently, a naturally question to ask is whether the analogous strategy of piggy-backing on an XML-based XQuery search engine indexing XML+MathML source would perform better. In particular, MathML enforces a certain regularity of structure, and admits a fair degree of normalization. Also, the XQuery language is more powerful than text-based query languages in its ability to define constraints and relationships between terms.
In both these models, there is substantial appeal in leveraging existing technology which incorporates high quality text search. However, another intriguing math-search model takes a different approach entirely. This model is exemplified by the Online Encyclopedia of Integer Sequences and Series, created by Neil Sloan of AT&T Research. Using a web interface, a user enters a sequence of integers. If the sequence matches an entry in the database, an page is returned detailing the basic properties of the sequence, known relations to other sequences, and references to instances of the sequence in the literature. In a talk at the Future of Mathematical Communication II conference at MSRI, mathematician Rob Corless told of solving a problem in dynamical systems using results from an article on combinatorics which he located using the Online Encyclopedia. The point here is that Corless thought it very unlikely that he would have found the information using text-based keyword searching; he wouldn't have known any of the appropriate keywords, since he was unfamiliar with the specific area of combinatorics, and unaware of the connection.
The model of a specialized database of mathematical objects is also suggestive of the special functions web site created by Wolfram Research. In that case, the database of special functions contains representations both in the Mathematica language as well as MathML and LaTeX. Using Mathematica, the entries in the database can be formally manipulated, and in theory, could be algorithmically mined for relationships and interconnections. Other groups are also working on searching within formal systems, such as the Coq theorem proving environment, as well as in less structured contexts such as raw MathML.
In this vision of math-aware searching, specialized databases of mathematical objects become powerful research tools, as algorithmic searching reveals interconnections between mathematical objects themselves, as well as the research areas in which instances of the objects occur. One possible architecture for such databases would involve automatically extracting non-trivial equations from documents into databases, and piggy-backing on a metadata sharing system such as OAI to allow aggregators to harvest equations into unified collections. Aggregators could then offer specialized search services, and algorithmically enhance the collection by searching for inter-relationships using more sophisticated but still computationally tractable forms of mathematical equivalence. For example, by analogy with the integer sequences and series, one might have databases of polynomials, matrices, continued fractions, and many other classes of mathematical objects with normal or semi-normal forms, or that can be linearly ordered. In this context, it may be useful to differentiate between searching in or for published literature and the building of specialized services such as an online encyclopedias or expert systems that combine searching and computation.
The methods and examples of math searching discussed so far are strongly biased toward the research literature. However, the educational community is a far larger group of users. In educational contexts at the lower levels, a third notion of math-aware searching returns to the idea of fine-grained, per equation metadata. A motivating use case might be a student searching for "a^2 + b^2 = c^2". Even using Google today, this search returns many thousands of hits. The problem in this case is picking out the one, for example, that occurs with worked examples of the Pythagorean theorem appropriate for grades 6-8. The point is that the student's information need revolves more closely around the metadata assertions than the actual formula.
A number of research groups, such as the NSDL MetaTest project, have begun investigating the effectiveness of search systems based on human and automatically generated metadata, especially in comparison with full-text search systems. However, in general metadata devoted to classifying and describing subject matter is expensive to produce and suffers from lack of uniformity and objectivity. This suggests that metadata search systems based on these kinds of data are unlikely to out perform full-text search systems. By contrast, experience from successful, education-oriented Web sites such as the Eisenhower National Clearinghouse and the Math Forum at Drexel, indicates that end users do perceive significant value in augmenting text searching with metadata-based constraints involving criteria such as educational function, grade level and the like. Similarly, there has been some work done indicating that metadata on physical units might be used to good effect, particularly in engineering and the sciences. Consequently, it seems clear there is a role for faceted metadata-based searching in many contexts, provided metadata can be created and managed in a cost effective way.
Another part of the appeal of a fine-grained metadata approach is that it is well-suited to automatic creation via authoring tools. In languages like LaTeX, a natural approach would be to use specialize macro packages, where the macros would contain metadata labels in some fashion. This would facilitate, for example, tagging an equation as containing a Riemann tensor by using a \Riemann macro. Alternatively, and perhaps more importantly in the educational sphere, wysiwyg authoring tools such as Design Science's MathType editor could facilitate labeling of significant equations via a simple user interface such as pulldown menus, etc. A more sophisticated approach might be to use the local context around an equation together with a clustering algorithm to guess appropriate metadata, which a user could then accept, modify or ignore.
However, as noted at the beginning of the section, all of these visions of math-aware searching must be considered speculative until there is more validation using relatively large collections and representative user groups. There is a growing number of research projects aimed at examining or implementing various of the techniques touched on here, but very few have undergone any degree of real-world use or usability testing. Consequently, there was strong consensus that the availability of test bed document collections, and cooperation in conducting pilot studies is essential if research is to advance quickly to concrete, deployable solutions.
In a round-table discussion, workshop participants summarized their perceptions of the areas of consensus that had emerged in a twelve point agenda for progress. These fall under five headings:
This document is intended as a first step toward SMET community outreach. To further that goal, a web site has also been created as part of the Enhancing the Searching of Mathematics project funded by the National Science Foundation through its National Science Digital Library program. For further information about the Agenda for Progress, what other organizations are doing to implement it, and ways you can contribute, please visit http://www.dessci.com/searching.