Enriching Metadata for XML Journal Articles through Extraction of MathML

Monday, April 26, 2004 - 1:30pm - 2:00pm
Keller 3-180
Tim Cole (University of Illinois at Urbana-Champaign)
Two automated approaches are being investigated. In the first approach we extract all occurrences of MathML contained in full-text of articles included in a sample corpus of XML-encoded sci-tech journal literature published by ACM, AIP, and IEEE-CS (articles include legacy SGML ISO 12083 math fragments previously converted to MathML). We then filter and normalize those MathML fragments recognized as potentially useful for search and discovery, adding the normalized fragments to qualified Dublin Core metadata records describing the articles. The second approach adopts the hierarchical browse vocabulary of the Wolfram Functions Website as a descriptive metadata controlled vocabulary. Function name strings from this vocabulary which occur in a journal article are added to its metadata record, along with the frequency of occurrence. These approaches are seen as having the potential to enhance discoverability of journal articles and facilitate linkages between journal literature and reference mathematics literature (e.g., the Wolfram Functions Website).