Enhancing
the Searching of Mathematics
A
position paper based on the proceedings of the
Enhancing
the Searching of Mathematics Workshop
Robert
Miner
Design Science, Inc.
June 8, 2004
- Introduction
- Trends
in Scientific Collections
- Metadata
and its Management
- Math-aware
Searching
- An
Agenda for Progress
Introduction
The
rise of the World Wide Web has made information available to
knowledge workers on an unparalleled scale. As the availability
of information has increased, the challenge of finding relevant
information has become more central and more difficult. The
success of text-based Web search engines such as Google has
dramatically illustrated the potential of effective information
retrieval to impact the way people conduct research and think
about information management. However, current text-based search
paradigms have a number of limitations, particularly for scientific,
mathematics, engineering and technical (SMET) literature.
One
challenge is that the majority of this literature is held in
collections where the dominant economic model for sustainability
involves collecting fees for access. Cataloging information
or federating searches involving these collections requires
negotiating separately with each such organization. Another
challenge is that a significant portion of the information in
SMET documents is non-textual, consisting of figures, tables
and especially mathematical notation. This problem takes an
extreme form for the large and rapidly growing class of documents
that were "born in print" and subsequently digitized, where
even textual information must somehow be recovered from a scanned
image.
At
the same time, a window of opportunity for enhancing the searching
of SMET literature is opening. For some years, the demand for
cross-media publishing and the concomitant need for more effective
information reuse and management in large collections of electronic
documents has been moving the publishing industry toward XML-based
workflows. In particular, SMET publishers have begun moving
to XML-based workflows using MathML, a standard for representing
mathematical expressions in XML. As a data format, XML offers
enticing possibilities for enhanced searching. Since SMET workflows
are still transitioning to XML, this is the ideal time to introduce
best practices that will facilitate better searching in the
future.
To
explore the possibilities for enhancing searching of SMET literature
in general, and mathematics in particular, a workshop was held
in April of 2004 at the Institute for Mathematics and its Applications
at the University of Minnesota. The workshop was funded by the
National Science Foundation through a National Science Digital
Library grant awarded to Design Science, Inc. The remainder
of this document describes the issues and areas of consensus
identified by that workshop, and lays out an agenda for action.
The
workshop brought together a wide variety of SMET content providers,
researchers, software vendors and library scientists. Several
academic society publishers were represented, as well as commercial
publishers, abstracting and indexing services, digitization
and archiving projects, and an assortment of academic and educationally-focused
collections. The level of interest and participation served
to reinforce perceptions that questions of information management,
workflow and added functionality such as searching are both
timely and relevant to the community. However, it was also clear
that enhanced searching of mathematics meant widely-divergent
things to different organizations, and during the workshop,
several distinct user groups began to emerge.
One
natural grouping consisted of academic society publishers from
highly mathematical fields. Representatives from SIAM, AMS,
CMS, The Albert Einstein Institute (Max Planck Institute for
Gravitational Physics) and the IEEE attended, as well as from
the abstracting and indexing services Math Reviews and Zentralblatt
MATH, which work closely with society publishers. They all publish
documents containing many involved and complicated mathematical
expressions. Their users are primarily academic researchers,
and their users interests are strongly biased toward the apparatus
of scholarship: access to the peer-reviewed literature, citations,
abstracts, and bibliographies. The material is very dense, and
more suited to reading in print than on the screen, so PDF is
widely used for electronic publication.
Another
important characteristic of this community is that many of its
authors use TeX and LaTeX. Dr. Michael Doob of the University
of Manitoba reports that author submissions to the Canadian
Math Society went from around 10% TeX in 1990, with a large
proportion of submissions as manuscripts, to 100% percent LaTeX
submissions today. Other publishers of research mathematics
and theoretical physics show similar trends. Consequently, many
of these publishers share the distinctive characteristic of
having TeX-based production workflows. In other disciplines,
most production workflows have been based on XML, SGML or proprietary
page layout software such as Quark Xpress. For example, a survey
of commercial SMET publishers in areas other than mathematics
suggests that overall, perhaps 80% submissions are in Word format,
with equations rendered in Equation Editor or MathType™
format, and that the use of LaTeX is relatively rare.
Some
math-intensive society publishers are planning to migrate or
are in the process of migrating to XML-based workflows using
MathML, a standard XML-based markup language used to encode
mathematical expressions. However, even among those publishers
that were not planning to change their production workflows
in the immediate future, many are looking at XML+MathML for
information management purposes at some point. MathML is a highly-structured,
more information-rich representation of mathematics as compared
to LaTeX. It contains vocabularies for describing both the visual
presentation of expressions, and for indicating their semantic
content. However, it is also a low-level representation, more
akin to PostScript than TeX. Thus, MathML is almost always generated
and processed via software, and is not suitable for hand authoring.
For automatic processing, the regularity, structure and ability
to represent both presentation and semantic aspects of expressions
are highly appealing. Consequently, there was strong interest
in LaTeX to XML+MathML conversion strategies at the workshop,
as was evidenced by the attention given to the presentation
on the Hermes project, a part of the larger MowGLI project funded
by the European Union.
Another
group that was well-represented, and which has become increasingly
significant in recent years, consists of "retro-digitization"
projects. Projects such as JSTOR, NUMDAM, ERAM and others have
made tremendous strides toward scanning and digitizing research
literature which originally appeared in print, stretching back
into the 19th century in some cases. For these groups, creating
and maintaining accurate cataloging and abstract information
is important.
A
number of projects also employ OCR and other techniques to extract
text for full-text searching. Several are also looking into
at least limited use of XML+MathML for some of their holdings,
such as abstracts. However, the challenge of augmenting traditional
OCR processes so that they can infer the information needed
to mark up document structure, equations, tables and other non-textual
data is substantial. A presentation by Dr. Masakazu Suzuki of
Kyushu University on his INFTY system for mathematical formula
recognition was therefore both significant and encouraging.
The system, which is already commercially available, appears
to yield impressive results with low error rates.
One
final group worthy of note consisted of those content providers
who publish to the web in HTML or XML as opposed to PDF. This
includes many educational content providers and content providers
targeting broader audiences, where graphic design, interactivity,
and seamless integration with other Web content is important.
For this group, XML+MathML is also an appealing format, though
most organizations are planning to use XML+MathML as a source
format and to generate HTML (or XHTML) for publication, for
example using XSLT.
There
was a strong consensus that XML and MathML will be important
for SMET publishing, especially for information management purposes.
The corollary is that effective conversion software is a prerequisite
to migrating large bodies of existing material to that format.
In fact, the process has begun. The Grainger Library at UIUC
has converted around 200,000 pages of SMET research literature
into XML+MathML under a Digital Library Initiative grant and
over time, documents will increasingly be originally generated
as XML+MathML. Some estimates suggest as many as 100,000 pages
a year of research literature will be published, processed or
archived in that format as early as 2005.
Metadata
and its Management
A major theme that emerged from the workshop
was metadata and its management. Scholarly publishing has long
relied on bibliographic cataloging information for accessing
and referencing literature. In the electronic world, cataloging
information such as author, title, and date of publication is
regarded as metadata. While precise definitions of metadata
are disputed by experts, broadly speaking, metadata consists
of assertions made by third parties describing the content of
an item of information. In practice, most metadata records apply
at the document level, as opposed to describing items within
a document, such as a picture, but such "fine-grained" metadata
is not unheard of. Another useful distinction is that between
objective and subjective metadata. Bibliographic cataloging
information and intellectual property rights are examples of
objective metadata, while grade-level, subject area, and overall
quality are examples of subjective metadata.
Properly speaking, metadata records are not
part of a document, but are instead stored, maintained and shared
separately. For pragmatic reasons, certain kinds of objective
metadata, such as the author, are often embedded within the
documents to which they apply, usually in some sort of special
header field. However, even in these cases, such metadata is
often extracted into separate records during production. The
point is that to be useful, cards belong in the card catalog
and not in the books on the shelves, to use a pre-digital metaphor.
For digital libraries, the role of the card
catalog is filled by databases of metadata records. Such databases
have obvious utility for searching and information retrieval.
One frequent kind of question knowledge workers ask is "what
are all the articles by author X" and a metadata database is
an ideal for answering such questions. However, even this simple
example illustrates two major challenges that must be addressed
when using metadata for information retrieval. The first difficulty
is that not all the articles by X may be in the same database.
The second difficult is that in practice, there is wide variation
in how metadata is recorded. As an extreme example, there are
dozens of spellings for the name of the Russian mathematician
Chebychev, or Chebeshev, or Tschebyscheff, or ...
To attack the first issue, the Open Archives
Initiative (OAI) has developed a protocol for sharing metadata
so that unified metadata databases can be developed. When libraries
were exclusively physical places, it was reasonable that their
card catalogs reflected their own holdings. But in digital libraries,
boundaries between collections are artificial, and from the
point of view of the searcher, what is needed is a single unified
metadata store. The basic idea of OAI is that content providers
set up OAI servers which returns XML metadata records in response
to standardized queries. Aggregators then periodically "harvest"
metadata from multiple collections in order to create unified
databases, and offer search services built on top of them.
Note that only metadata records are shared,
so that content providers who restrict access to their holdings
can be included in search services without losing control over
their content. A searcher without proper credentials following
a link to an item in a protected collection is typically redirected
to a page with information about obtaining access to the collection.
Of course, for some organizations such as indexing and abstracting
services, or reviewing services such as Eisenhower National
Clearinghouse, metadata is itself important intellectual property
that carries value. However, even in these cases, it is typically
possible to share some basic metadata, such as a title, merely
to indicate that an item is in a collection, and that authorized
users can obtain further information about it. This is precisely
the strategy that most of these organizations have adopted to
date.
OAI has gathered substantial momentum in the
digital library community. The protocol specification has gone
through several versions and is now relatively mature and stable.
A variety of open source software is available for setting up
an OAI server, and for small collections, a mechanism called
the OAI Static Repository has been devised whereby metadata
can be shared by placing an XML file on a web site and registering
it with a gateway. At the workshop, most content providers had
either already set up OAI servers or were considering doing
so in the near future, and there was strong consensus that OAI
would become increasingly dominant as a means of metadata sharing.
The second broad metadata problem, lack of uniformity,
has more facets and is therefore less subject to system solutions.
One such facet, touched on above, is the identification of equivalent
variations in names, spellings, abbreviations of journal titles,
and so on. This is an old problem in library science, and is
merely exacerbated by the scale of metadata sharing in digital
libraries. The principal line of attack is the compilation of
name authority databases. For example, the US Library of Congress
and the "Deutsche Bibliothek" maintain large name authority
databases, and both institutions participate in the Virtual
International Authority File (VIAF) project. The aim of this
project will be to explore virtually combining the name authority
files of both institutions into a single name authority service.
A facet more specific to digital libraries is
lack of uniformity in electronic formats for metadata. There
have been a number of attempts to standardize metadata formats
within different communities. Perhaps the most notable is the
Dublin Core standard developed in 1995 under the auspices of
NCSA and OCLC. Dublin Core defines a basic set of about a dozen
metadata elements such as title, creator, publisher, description
and so on. Dublin Core is by far the most widely used metadata
standard in the publishing and digital library communities.
However, as its name suggests, many if not most organizations
using it, have extended it in incompatible ways to record additional
metadata specific to their collections. Extensions have proliferated
to the extent that the European Center for Standardization has
formed a working group that tries to keeps these extensions
to Dublin Core organized and has produced an agreement on the
presentation of Dublin Core based application profiles. It is
now working on a machine readable version of that. Further,
while Dublin Core specifies a set of metadata elements, it does
not specify a particular electronic format for recording this
data, and there are a number in use, including the W3C RDF format
and a direct XML encoding. Consequently, creating "crosswalks"
or translation guidelines among varying metadata formats and
element sets used by various SMET document collections remains
a major issue.
The topic of improving the quality, interoperability
and management of metadata generated much discussion at the
workshop. While there was broad consensus on the momentum and
value of OAI, there was little sense that the problems of metadata
interoperability had easy or short term solutions. Major content
providers have been working toward convergence of metadata formats
and crosswalks for interoperability, both on a case-by-case
basis, and under the umbrella of various standards organizations.
While progress has been slow and difficult, it has been relatively
steady. For example, the major math abstracting services Math
Reviews and Zentralblatt are nearing an agreement on a common
format for their metadata records. Consequently, raising awareness
of the issues and continued incremental improvement are likely
the best that can be done, given the economic considerations
that constrain content providers with very large collections
stretching back for many years into the past. In areas such
as mathematics where the useful lifespan of research literature
routinely runs to decades, such economic constraints cannot
be ignored.
Historically, the collection and management
of metadata is expensive, as it generally involves a great deal
of hand work on the part of subject experts. Consequently, another
focus in the workshop discussions was techniques for lowering
the cost of metadata without sacrificing quality. One approach
is to enlist the aid of authors in identifying and/or checking
at least objective bibliographic metadata, for example, by incorporating
identification and verification into either the submission process
or the proofing process associated with scholarly publication.
IEEE and the arXiv both use electronic submission processes
that take some steps in this direction. Another avenue is to
try to incorporate metadata creation into authoring tools. However,
this must be done with care. If metadata creation is purely
optional, it will likely be ignored, while at the same time
authors will resent intrusive techniques. Furthermore, many
kinds of technical documents are edited by multiple people over
time. This makes metadata management becomes more complex. Nonetheless,
the prevailing view was that there is a good deal of potential
for facilitating easy identification of basic metadata such
as author and title as part of the authoring process. For example,
a standardized set of LaTeX macros for title, author, abstract,
MSC classification, date, and so on might be used.
Another area of active research is automatic
metadata generation. Because human generated metadata is so
expensive, even minimally adequate automatic generation of metadata
can has an appealing cost benefit analysis. However studies
such as the "Breaking the Metadata Bottleneck" NSDL project
conducted by Dr. Elizabeth Liddy of Syracuse University have
shown that the quality of human generated metadata varies widely,
and in the case of certain kinds of well-defined, objective
metadata, automatic algorithms can actually perform better.
Another reason automatic algorithms are interesting is that
they might be able to make much finer-grained metadata economically
viable, for example adding metadata at the individual equation
level, as opposed to the document level. Whether or not such
fine-grained metadata would be generally useful, however, is
a question which leads us into the topic of the next section.
Math-aware
Searching
While it is relatively clear to everyone that
mathematical notation is not accessible to traditional text-base
search to any meaningful degree, there is much less agreement
on what a math-aware search ought to look like. Discussions
at the workshop covered ideas ranging from fine-grained per-equation
metadata, to combined text and equation search, to data mining
of specialized databases of mathematical objects. The discussion
is further complicated by the paucity of functional, real-world
examples of math-aware search. As a result, the need for test
bed collections, development of use cases, and further research
and usability testing of possible search techniques were the
areas where strongest consensus emerged.
The most obvious notion of math aware search
is to simply extend the keyword search models used by virtually
all popular Web search engines. That is, one would type in a
collection of text keywords and mathematical expressions, and
the search engine would return a list of documents in which
they occur. This is the model used by a math-aware search engine
developed by Dr. Abdou Youssef of George Washington University
for the Digital Library of Mathematical Functions (DLMF). The
DLMF documents are coded in LaTeX using special macro packages.
For searching purposes, the mathematical expressions are first
converted to structured text, e.g. "x begin_superscript 2 end_superscript,"
which is then indexed by a conventional text search engine.
Queries are entered in a special linear query language similar
to LaTeX. To perform the search, the query is also converted
into text, and the resulting text search is performed against
the index. While preliminary indications are that this yields
surprisingly effective math searching, the DLMF search engine
is not yet online, so there is not yet a substantial body of
usage data.
The DLMF technique of piggy-backing on a text-search
engine has some interesting side effects. A common capability
of text search engines is control of proximity searching, where
searching terms must appear with in a certain distance of each
other, for example within 5 words. Using this capability, it
is possible to form queries such as "find all expressions of
the form (x^2 + ?)" where ? represents any single term. On the
other hand, a query such as "find all expressions matching $a^3
+ $a^2" is not possible, where both $a's denote the same variable
name. Of course, these examples merely scratch the surface of
how one might specify abstractions and constraints in a math
search query. Setting aside the non-trivial questions of devising
an effective query language or graphical query editor, a more
basic question is how sophisticated queries need to be in order
to be effective. Research indicates that most users of search
engines don't bother to inform themselves about the subtleties
of the query language, nor do they use even basic features very
often if they can't be learned quickly and remembered easily.
Consequently, real world usability data is clearly important
in determining the success of any math search paradigm that
targets a general audience.
Another aspect unique to the DLMF model is that
it takes advantage of exceptionally regular notation across
the collection and specialized source code macros. In general,
however, TeX and LaTeX with their powerful macro mechanisms
and ability to redefine basic language constructs on the fly,
are notorious for lack of regularity in source code. Consequently,
a naturally question to ask is whether the analogous strategy
of piggy-backing on an XML-based XQuery search engine indexing
XML+MathML source would perform better. In particular, MathML
enforces a certain regularity of structure, and admits a fair
degree of normalization. Also, the XQuery language is more powerful
than text-based query languages in its ability to define constraints
and relationships between terms.
In both these models, there is substantial appeal
in leveraging existing technology which incorporates high quality
text search. However, another intriguing math-search model takes
a different approach entirely. This model is exemplified by
the Online Encyclopedia of Integer Sequences and Series, created
by Neil Sloan of AT&T Research. Using a web interface, a
user enters a sequence of integers. If the sequence matches
an entry in the database, an page is returned detailing the
basic properties of the sequence, known relations to other sequences,
and references to instances of the sequence in the literature.
In a talk at the Future of Mathematical Communication II conference
at MSRI, mathematician Rob Corless told of solving a problem
in dynamical systems using results from an article on combinatorics
which he located using the Online Encyclopedia. The point here
is that Corless thought it very unlikely that he would have
found the information using text-based keyword searching; he
wouldn't have known any of the appropriate keywords, since he
was unfamiliar with the specific area of combinatorics, and
unaware of the connection.
The model of a specialized database of mathematical
objects is also suggestive of the special functions web site
created by Wolfram Research. In that case, the database of special
functions contains representations both in the Mathematica language
as well as MathML and LaTeX. Using Mathematica, the entries
in the database can be formally manipulated, and in theory,
could be algorithmically mined for relationships and interconnections.
Other groups are also working on searching within formal systems,
such as the Coq theorem proving environment, as well as in less
structured contexts such as raw MathML.
In this vision of math-aware searching, specialized
databases of mathematical objects become powerful research tools,
as algorithmic searching reveals interconnections between mathematical
objects themselves, as well as the research areas in which instances
of the objects occur. One possible architecture for such databases
would involve automatically extracting non-trivial equations
from documents into databases, and piggy-backing on a metadata
sharing system such as OAI to allow aggregators to harvest equations
into unified collections. Aggregators could then offer specialized
search services, and algorithmically enhance the collection
by searching for inter-relationships using more sophisticated
but still computationally tractable forms of mathematical equivalence.
For example, by analogy with the integer sequences and series,
one might have databases of polynomials, matrices, continued
fractions, and many other classes of mathematical objects with
normal or semi-normal forms, or that can be linearly ordered.
In this context, it may be useful to differentiate between searching
in or for published literature and the building of specialized
services such as an online encyclopedias or expert systems that
combine searching and computation.
The methods and examples of math searching discussed
so far are strongly biased toward the research literature. However,
the educational community is a far larger group of users. In
educational contexts at the lower levels, a third notion of
math-aware searching returns to the idea of fine-grained, per
equation metadata. A motivating use case might be a student
searching for "a^2 + b^2 = c^2". Even using Google today, this
search returns many thousands of hits. The problem in this case
is picking out the one, for example, that occurs with worked
examples of the Pythagorean theorem appropriate for grades 6-8.
The point is that the student's information need revolves more
closely around the metadata assertions than the actual formula.
A number of research groups, such as the NSDL
MetaTest project, have begun investigating the effectiveness
of search systems based on human and automatically generated
metadata, especially in comparison with full-text search systems.
However, in general metadata devoted to classifying and describing
subject matter is expensive to produce and suffers from lack
of uniformity and objectivity. This suggests that metadata search
systems based on these kinds of data are unlikely to out perform
full-text search systems. By contrast, experience from successful,
education-oriented Web sites such as the Eisenhower National
Clearinghouse and the Math Forum at Drexel, indicates that end
users do perceive significant value in augmenting text searching
with metadata-based constraints involving criteria such as educational
function, grade level and the like. Similarly, there has been
some work done indicating that metadata on physical units might
be used to good effect, particularly in engineering and the
sciences. Consequently, it seems clear there is a role for faceted
metadata-based searching in many contexts, provided metadata
can be created and managed in a cost effective way.
Another part of the appeal of a fine-grained
metadata approach is that it is well-suited to automatic creation
via authoring tools. In languages like LaTeX, a natural approach
would be to use specialize macro packages, where the macros
would contain metadata labels in some fashion. This would facilitate,
for example, tagging an equation as containing a Riemann tensor
by using a \Riemann macro. Alternatively, and perhaps more importantly
in the educational sphere, wysiwyg authoring tools such as Design
Science's MathType editor could facilitate labeling of significant
equations via a simple user interface such as pulldown menus,
etc. A more sophisticated approach might be to use the local
context around an equation together with a clustering algorithm
to guess appropriate metadata, which a user could then accept,
modify or ignore.
However, as noted at the beginning of the section,
all of these visions of math-aware searching must be considered
speculative until there is more validation using relatively
large collections and representative user groups. There is a
growing number of research projects aimed at examining or implementing
various of the techniques touched on here, but very few have
undergone any degree of real-world use or usability testing.
Consequently, there was strong consensus that the availability
of test bed document collections, and cooperation in conducting
pilot studies is essential if research is to advance quickly
to concrete, deployable solutions.
An Agenda for Progress
In a round-table discussion, workshop participants
summarized their perceptions of the areas of consensus that
had emerged in a twelve point agenda for progress. These fall
under five headings:
Metadata
- Content providers should expose Metadata
via OAI.
- Content providers should work towards convergence
of metadata formats.
Information Management
- Content providers should consider XML and
MathML for information management purposes, and more generally
work toward structure data with semantic content.
- Research and development on conversion technologies
should be a priority.
Search / Discovery / Identification
- Content providers should make data available
for test bed purposes either via open collections or by arranging
appropriate consent.
- The community should work to develop use
cases, e.g. exact matching, search by property, searching
formulae, etc.
- Search service providers should expose standardized
interfaces through Web services.
Enriching Content
- Metadata workflow should be managed.
- Tool vendors should work toward generation
of metadata through authoring process.
- Content publishers should work toward validation
and generation of metadata as part of the publication process.
SMET community
outreach
- Researchers
and developers should inform the community about the potential
and possibilities of math-aware searching.
- Researchers
and developers should better inform themselves of SMET community
requirements at all levels from elementary education through
advanced research.
This document
is intended as a first step toward SMET community outreach.
To further that goal, a web site has also been created as part
of the Enhancing the Searching of Mathematics project funded
by the National Science Foundation through its National Science
Digital Library program. For further information about the Agenda
for Progress, what other organizations are doing to implement
it, and ways you can contribute, please visit http://www.dessci.com/searching.