I am conducting my last set of experiments –basically a series of performance evaluations, and I will be using metadata, harvested via OAI-PMH [5], from the NDLTD portal [1] as my dataset… It fits in perfectly with what needs to be evaluated; 1,985,695 well-structured records. Incidentally, I had a very interesting chat with my supervisor the other day when we discussed my experimental plan, and one of the things that came up was the question of figuring out whether or not a particular dataset is a prime candidate for performance evaluations…
I had initially planned to use the pyoai module [2], and spent the long weekend trying to figure out how to play around with it, primarily because I wanted to take advantages of its support for the Pairtree specification [3,4]. So I woke up today and as I was drawing up today’s schedule, I decided I was going to write my own script(s). They are basic, and perhaps a little primitive, I assure you; but here’s how I did it.
OAI-PMH Protocol
OAI-PMH protocol specification details are here [5], and technical implementation details are detailed here [6]. I am basically only interested in one (1) of the six (6) verbs –the ListRecords verb, and the logical flow of pull requests that I envisaged making merely involved checking if a resumptionToken was part of the response… all detailed in specification [5]. A point to note is that each NDLTD response contains a total of 1,000 records.
Python Scripts
XML Response
# function for return dom response after parsting oai-pmh URL def oaipmh_response(URL): file = urllib2.urlopen(URL) data = file.read() file.close() dom = parseString(data) return dom
Resumption Token
# function for getting value of resumptionToken after parsting oai-pmh URL def oaipmh_resumptionToken(URL): file = urllib2.urlopen(URL) data = file.read() file.close() dom = parseString(data) print "START: "+str(datetime.datetime.now()) return dom.getElementsByTagName('resumptionToken')[0].firstChild.nodeValue
Output Writer
# function for writing to output files def write_xml_file(inputData, outputFile): oaipmhResponse = open(outputFile, mode="w") oaipmhResponse.write(inputData) oaipmhResponse.close() print "END: "+str(datetime.datetime.now())
Main Code
# main code baseURL = 'http://union.ndltd.org/OAI-PMH/' getRecordsURL = str(baseURL+'?verb=ListRecords&metadataPrefix=oai_dc') # initial parse phase resumptionToken = oaipmh_resumptionToken(getRecordsURL) # get initial resumptionToken print "Resumption Token: "+resumptionToken outputFile = 'page-0.xml' # define initial file to use for writing response write_xml_file(oaipmh_response(getRecordsURL).toxml(), outputFile) # loop parse phase pageCounter = 1 while resumptionToken != "": print "URL ECONDED TOKEN: "+resumptionToken resumptionToken = urllib.urlencode({'resumptionToken':resumptionToken}) # create resumptionToken URL parameter print "Resumption Token: "+resumptionToken getRecordsURLLoop = str(baseURL+'?verb=ListRecords&'+resumptionToken) oaipmhXML = oaipmh_response(getRecordsURLLoop).toxml() outputFile = 'page-'+str(pageCounter) # create file name to use for writing response write_xml_file(oaipmhXML, outputFile) # write response to output file resumptionToken = oaipmh_resumptionToken(getRecordsURLLoop) pageCounter += 1 # increament page counter
Harvesting Observations
I started harvesting the records at 14:49pm GMT+2, and at the time of this writing –16:58 GMT+2, I had harvested a total of 924,000 records; a mere 46.53% of the total 1,985,695 records. Incidentally, I did some extrapolations and at this rate, I estimate the processing will take 4.73 hours –I should be done around 19:00 GMT+2
Total Records | 1,985,695 |
Records in File | 1,000 |
Total # of Files | 1,986 |
Estimated Size (GB) | 7.74 |
ETA (Hours) | 4.7 |
[1] http://www.ndltd.org
[2] http://pypi.python.org/pypi/pyoai/2.4.4
[3] https://confluence.ucop.edu/display/Curation/PairTree
[4] http://pypi.python.org/pypi/Pairtree
[5] http://www.openarchives.org
[6] http://www.oaforum.org/tutorial/english/page3.htm