Apache Solr DIH Indexing Benchmarks

I have been benchmarking Apache Solr batch indexing of a linearly increasing workload of XML documents… I am using Solr’s Data Input Handler on a Pentium(R) Dual-Core CPU E5200@ 2.50GHz that has 4 GB RAM. I ran 5 repeated runs for each workload and subsequently picked the minimum successful run for workload. The smallest workload (100 documents) took 1.11 seconds, while the largest took 18,261 seconds (~5 hours). Interestingly, throughput plummets when the workload goes beyound 256000 documents, however, it’s possible there are other hidden factors that might have influenced this outcome… I intend to re-run these experiments with randomly generated workloads.

Test Corpus

I harvested Dublin Core encoded XML documents using the OAI-PMH protocol from the NDLTD portal [1]. I am indexing all the 15 repeatable Dublin Core elements.

Approach

I basically created seperate Apache Solr cores for each of the 15 workloads and used the following benchmarking technique.

  • For each of the 15 workloads
    • Loop 5 times
      • Index documents using DIH
      • Get ‘Time taken’ from Solr response when status is ‘idle’
      • Get index size
      • Delete Index
      • Commit

I used a series of primitive Python& shell scripts to automate the benchmarking process and in case YOU are interested in them, the code is below.

def amfumu():
    """Entry point for module.
    Spawns different workload models --wN-- based on file limits.

    keyword arguments:
    none

    Usage: python -c "import simplyctperformance; simplyctperformance.amfumu()"

    """
    workloadstest = (('w1', 5), ('w2', 10), ('w3', 10), ('w4', 15), ('w5', 20), ('w6', 30), ('w7', 40), ('w8', 70), ('w9', 180), ('w10',
 1440), ('w11', 1440), ('w12', 1800), ('w13', 3600), ('w14', 9000), ('w15', 18000))
    for workload in workloadstest:
        workloadname = workload[0]
        workloadsleep = workload[1]
        print "START: ", workloadname
        coreurl = "http://localhost:8983/solr/" + workloadname + "/"
        print coreurl
        for manamba in range(1, 6):
            # purge any indicies available
            solrissuequery(coreurl, 'delete')
            solrissuequery(coreurl, 'commit')
            # issue solr import query
            solrissuequery(coreurl, 'import')
            time.sleep(workloadsleep)
            # loop until status is true
            while solrimportstatus(coreurl) is False:
                # print "Status: Busy... Waiting..." + str(workloadsleep) + "
                # seconds"
                time.sleep(20)
            solrstatusmesseges(coreurl)
        print "END: ", workloadname


def solrissuequery(coreurl, query):
    """Sends Solr HTTP request to Solr server.

    keyword arguments:
    coreurl --Solr core base URL, e.g. http://host:port/core/
    query --commit, delete or import

    """
    headers = {"Content-type": "text/xml", "charset": "utf-8"}
    if query == 'import':
        querycontext = "dataimport?command=full-import"
        solrquery = urlparse.urljoin(coreurl, querycontext)
        urlopen(solrquery)
    elif query == 'delete':
        querycontext = "update"
        solrquery = urlparse.urljoin(coreurl, querycontext)
        solrrequest = urllib2.Request(solrquery, '<delete><query>*:*</query></delete>', headers)
        solrresponse = urllib2.urlopen(solrrequest)
        solrresponse.read()
    elif query == 'commit':
        querycontext = "update"
        solrquery = urlparse.urljoin(coreurl, querycontext)
        solrrequest = urllib2.Request(solrquery, '<commit/>', headers)
        solrresponse = urllib2.urlopen(solrrequest)
        solrresponse.read()


def solrstatusmesseges(coreurl):
    """Prints out name value pairs of Solr import results.

    keyword arguments:
    coreurl --Solr core base URL, e.g. http://host:port/core/

    """
    querycontext = "dataimport"
    solrquery = urlopen(urlparse.urljoin(coreurl, querycontext))
    solrresponse = solrquery.read()
    solrxml = parseString(solrresponse)
    for solrnode in solrxml.getElementsByTagName('lst'):
        if str(solrnode.getAttribute('name')) == 'statusMessages':
            for solrstatus in solrnode.getElementsByTagName('str'):
                if str(solrstatus.getAttribute('name')) != "":
                    print solrstatus.getAttributeNode('name').nodeValue, ":", solrstatus.firstChild.data, ",",
	    print "\n"

Results

Apache Solr DIH Index Benchmarks

Apache Solr DIH Index Benchmarks

The size of the workloads (Top Left), index size (Top Right) increase linearly. The time it takes (Bottom Right) to index the documents increases after the 256k workload mark resulting in a drop in throughput (Bottom Left). Incidentally, I ran out of memory for the last two workloads and had to explicitly tell Apache Solr to use up 2 GB of memory.

Conclusion

My initial thoughts are that the inconsistent structure (Dublin Core elements are all optional) of the documents could have resulted in it taking slightly longer to process documents in the larger batches… I might have to randomize my workload samples and see where that takes me.
Bibliography

[1] http://lightonphiri.org/blog/metadata-harvesting-via-oai-pmh-using-python
[2] http://lucene.apache.org/solr
[3] http://wiki.apache.org/solr/DataImportHandler
[4] http://stackoverflow.com/q/13835334/664424