Metadata Harvesting via OAI-PMH Using Python

I am conducting my last set of experiments –basically a series of performance evaluations, and I will be using metadata, harvested via OAI-PMH [5], from the NDLTD portal [1] as my dataset… It fits in perfectly with what needs to be evaluated; 1,985,695 well-structured records. Incidentally, I had a very interesting chat with my supervisor the other day when we discussed my experimental plan, and one of the things that came up was the question of figuring out whether or not a particular dataset is a prime candidate for performance evaluations…

I had initially planned to use the pyoai module [2], and spent the long weekend trying to figure out how to play around with it, primarily because I wanted to take advantages of its support for the Pairtree specification [3,4]. So I woke up today and as I was drawing up today’s schedule, I decided I was going to write my own script(s). They are basic, and perhaps a little primitive, I assure you; but here’s how I did it.
OAI-PMH Protocol

OAI-PMH protocol specification details are here [5], and technical implementation details are detailed here [6]. I am basically only interested in one (1) of the six (6) verbs –the ListRecords verb, and the logical flow of pull requests that I envisaged making merely involved checking if a resumptionToken was part of the response… all detailed in specification [5]. A point to note is that each NDLTD response contains a total of 1,000 records.

Python Scripts

XML Response

# function for return dom response after parsting oai-pmh URL
def oaipmh_response(URL):
 file = urllib2.urlopen(URL)
 data = file.read()
 file.close()

 dom = parseString(data)
 return dom

Resumption Token

# function for getting value of resumptionToken after parsting oai-pmh URL
def oaipmh_resumptionToken(URL):
 file = urllib2.urlopen(URL)
 data = file.read()
 file.close()

 dom = parseString(data)
 print "START: "+str(datetime.datetime.now())
 return dom.getElementsByTagName('resumptionToken')[0].firstChild.nodeValue

Output Writer

# function for writing to output files
def write_xml_file(inputData, outputFile):
 oaipmhResponse = open(outputFile, mode="w")
 oaipmhResponse.write(inputData)
 oaipmhResponse.close()
 print "END: "+str(datetime.datetime.now())

Main Code

# main code
baseURL = 'http://union.ndltd.org/OAI-PMH/'
getRecordsURL = str(baseURL+'?verb=ListRecords&metadataPrefix=oai_dc')

# initial parse phase
resumptionToken = oaipmh_resumptionToken(getRecordsURL) # get initial resumptionToken
print "Resumption Token: "+resumptionToken
outputFile = 'page-0.xml' # define initial file to use for writing response
write_xml_file(oaipmh_response(getRecordsURL).toxml(), outputFile)

# loop parse phase
pageCounter = 1
while resumptionToken != "":
 print "URL ECONDED TOKEN: "+resumptionToken
 resumptionToken = urllib.urlencode({'resumptionToken':resumptionToken}) # create resumptionToken URL parameter
 print "Resumption Token: "+resumptionToken
 getRecordsURLLoop = str(baseURL+'?verb=ListRecords&'+resumptionToken)
 oaipmhXML = oaipmh_response(getRecordsURLLoop).toxml()
 outputFile = 'page-'+str(pageCounter) # create file name to use for writing response
 write_xml_file(oaipmhXML, outputFile) # write response to output file

 resumptionToken = oaipmh_resumptionToken(getRecordsURLLoop)
 pageCounter += 1 # increament page counter

Harvesting Observations

I started harvesting the records at 14:49pm GMT+2, and at the time of this writing –16:58 GMT+2, I had harvested a total of 924,000 records; a mere 46.53% of the total 1,985,695 records. Incidentally, I did some extrapolations and at this rate, I estimate the processing will take 4.73 hours –I should be done around 19:00 GMT+2

Total Records1,985,695
Records in File1,000
Total # of Files1,986
Estimated Size (GB)7.74
ETA (Hours)4.7

[1] http://www.ndltd.org
[2] http://pypi.python.org/pypi/pyoai/2.4.4
[3] https://confluence.ucop.edu/display/Curation/PairTree
[4] http://pypi.python.org/pypi/Pairtree
[5] http://www.openarchives.org
[6] http://www.oaforum.org/tutorial/english/page3.htm

Categories: Research
Tags: , , ,