I’ve been running a series of Apache Sorl indexing benchmarks on a Dublic Core [1] encoded XML corpus. Batch indexing was trivial using Data Import Handler [2], however, there is currently NO delta-import support for XPathEntityProcessor –it’s only possible with the SqlEntityProcessor. I initially wanted to write my own adapter to convert my XML files to Solr’s ingest format, but evidently, what I wanted to do can easily be done with XsltUpdateRequestHandler… here’s how I did it.
The illustration below is based on an Apache Solr 4.0.0 setup and in addition, I set up a seperate core –w5– (see directory structure below). Assuming an appropriate schema is already present, the only thing that has to be done is to create an XSLT stylesheet and ensure that it is in the conf/xslt directory.
apache-solr-4.0.0
├── :
:
├── example
│ ├── :
:
│ ├── example-DIH
│ │ ├── :
│ │ └── solr
:
│ │ └── w5
│ │ ├── conf
│ │ │ └── xslt
│ │ └── data
│ │ └── index
Input File
<?xml version="1.0"?> <record> <header> <identifier>oai:union.ndltd.org:GEORGIA/oai:digitalarchive.gsu.edu:chemistry_diss-1010</identifier> <datestamp>2012-02-06T03:13:14Z</datestamp> <setSpec>GEORGIA</setSpec> </header> <metadata> <metadata> <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>The Extent of Perturbation of Skin Models by Transdermal Penetration Enhancers Investigated by 31P NMR and Fluorescence Spectroscopy</dc:title> : </oai_dc:dc> </metadata> </metadata> : </record>
XSLT Stylesheet
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:oaidc='http://www.openarchives.org/OAI/2.0/oai_dc/' xmlns:oai_dc='http://www.openarchives.org/OAI/2.0/oai_dc/' xmlns:dc='http://purl.org/dc/elements/1.1/' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'> <xsl:output method="xml" omit-xml-declaration="yes"/> <xsl:template match="/"> <add><xsl:apply-templates select="record" /></add> </xsl:template> <xsl:template match="record"> <doc> <xsl:apply-templates select="//metadata/oai_dc:dc/dc:title" /> </doc> </xsl:template> <!--field column="dc-title" xpath="/record/metadata/dc/title" /--> <xsl:template match="dc:title"> <field name="dc-title"><xsl:value-of select="." /></field> </xsl:template> </xsl:stylesheet>
Adding documents to the index is pretty much straight forward; just issue an http request with the appropriate parameters
curl http://localhost:8983/solr/w5/update?commit=true&tr=xsltfile.xsl --data-binary @xmlfile.xml -H 'Content-type: text/xml; charset=utf-8'
Bibiliography
[1] http://dublincore.org/documents/dces/
[2] http://wiki.apache.org/solr/DataImportHandler
[3] http://lucene.apache.org/solr/downloads.html