The Mathdex search engine

Friday, December 8, 2006 - 10:45am - 11:15am
EE/CS 3-180
Robert Miner (Design Science, Inc.)
The talk will describe the architecture and implementation of the Mathdex search engine, a math-aware search engine under development by Design Science. Users can
use mathematical expressions as query terms along with the usual text query terms. Math search terms are entered via a graphical equation editor applet. Ranked
results are returned, with the rank for math expressions based on structural similarity.

The Mathdex search engine is implemented as an extension to the popular Apache Lucene search engine. Content is converted to a common XHTML + MathML format for indexing.
MathML terms are normalized and stored as sequences of text-encoded tokens in the Lucene index. Query terms are similarly tokenized, and the search is performed by
custom code by doing low-level atomic Lucene queries. Final rankings are computed as from atomic term queries with weightings based on analysis of the MathML

We are hoping to eventually index a large corpus of electronic documents. To begin, we are attempting to convert and index the ArXiv, using Hermes and LaTeXML, two
promising LateX to XTHML+MathML translators. We are also working to arrange to index several other collections where unpublished XML+MathML source code is available
by special arrangement. We are also running a customized version of the Apache Nutch web crawler to index online documents containing math in a variety of formats.