Metadata Harvesting via OAI-PMH Using Python

I am conducting my last set of experiments –basically a series of performance evaluations, and I will be using metadata, harvested via OAI-PMH [5], from the NDLTD portal [1] as my dataset… It fits in perfectly with what needs to be evaluated; 1,985,695 well-structured records. Incidentally, I had a very interesting chat with my supervisor the other day when we discussed my experimental plan, and one of the things that came up was the question of figuring out whether or not a particular dataset is a prime candidate for performance evaluations…

I had initially planned to use the pyoai module [2], and spent the long weekend trying to figure out how to play around with it, primarily because I wanted to take advantages of its support for the Pairtree specification [3,4]. So I woke up today and as I was drawing up today’s schedule, I decided I was going to write my own script(s). They are basic, and perhaps a little primitive, I assure you; but here’s how I did it.
OAI-PMH Protocol

OAI-PMH protocol specification details are here [5], and technical implementation details are detailed here [6]. I am basically only interested in one (1) of the six (6) verbs –the ListRecords verb, and the logical flow of pull requests that I envisaged making merely involved checking if a resumptionToken was part of the response… all detailed in specification [5]. A point to note is that each NDLTD response contains a total of 1,000 records.

Python Scripts

XML Response

# function for return dom response after parsting oai-pmh URL
def oaipmh_response(URL):
 file = urllib2.urlopen(URL)
 data =

 dom = parseString(data)
 return dom

Resumption Token

# function for getting value of resumptionToken after parsting oai-pmh URL
def oaipmh_resumptionToken(URL):
 file = urllib2.urlopen(URL)
 data =

 dom = parseString(data)
 print "START: "+str(
 return dom.getElementsByTagName('resumptionToken')[0].firstChild.nodeValue

Output Writer

# function for writing to output files
def write_xml_file(inputData, outputFile):
 oaipmhResponse = open(outputFile, mode="w")
 print "END: "+str(

Main Code

# main code
baseURL = ''
getRecordsURL = str(baseURL+'?verb=ListRecords&metadataPrefix=oai_dc')

# initial parse phase
resumptionToken = oaipmh_resumptionToken(getRecordsURL) # get initial resumptionToken
print "Resumption Token: "+resumptionToken
outputFile = 'page-0.xml' # define initial file to use for writing response
write_xml_file(oaipmh_response(getRecordsURL).toxml(), outputFile)

# loop parse phase
pageCounter = 1
while resumptionToken != "":
 print "URL ECONDED TOKEN: "+resumptionToken
 resumptionToken = urllib.urlencode({'resumptionToken':resumptionToken}) # create resumptionToken URL parameter
 print "Resumption Token: "+resumptionToken
 getRecordsURLLoop = str(baseURL+'?verb=ListRecords&'+resumptionToken)
 oaipmhXML = oaipmh_response(getRecordsURLLoop).toxml()
 outputFile = 'page-'+str(pageCounter) # create file name to use for writing response
 write_xml_file(oaipmhXML, outputFile) # write response to output file

 resumptionToken = oaipmh_resumptionToken(getRecordsURLLoop)
 pageCounter += 1 # increament page counter

Harvesting Observations

I started harvesting the records at 14:49pm GMT+2, and at the time of this writing –16:58 GMT+2, I had harvested a total of 924,000 records; a mere 46.53% of the total 1,985,695 records. Incidentally, I did some extrapolations and at this rate, I estimate the processing will take 4.73 hours –I should be done around 19:00 GMT+2

The UNZA CSA dataset is a carefully labeled dataset of 273 student assessment score results for a Computer Systems and Architecture course offered at The University of Zambia. This dataset was created by Lighton Phiri and is, in part, based on work done as part of an undergraduate capstone research project [1].
Dataset [CSV]
Jupyter Notebook [IPYNB]
—Ethical Consideration—
• The original UNZA-centric student identifiers were replaced by MD5 hashes.
• The student names were replaced with randomly assigned Zambian names, extracted from commenters on The ZambianWatchDog Facebook page. As such, the randomly assigned names might not correspond to original student genders.
—Dataset Fields—
The dataset comprises of 69 fields, associated with the following aspects: Demographics, Background Information, Course Work, Moodle Interaction Logs details.

The student demographics were extracted from the Student Information System.

• StudentID---MD5 has representing unique identifiers for observation
• StudentName---Masked student full names
• AcademicYear---The cohort specific to each observation. There are a total of four cohors, associated with the enrollment year 2018 (201701), 2019 (201801), 2020 (201901) and 2021 (202001)
• DateOfBirth---Student date of birth
• Gender---Student gender
• YearOfStudy---Year of study. Typically 1st year, however, there are instances where an observation might be associated with 2nd year
• School---The school/faculty within which the student is registered
• Program---The programme the student is pursuing
• MajorDescription---The minor programme the student is pursuing
• MinorDescription---The description of the minor programme that student is pursuing
• Status---The registration status of the student
• Sponsor---The entity funding the student's education
• Nationality---The nationality of the student
• Comment---The status of the student at the end of the year
• CampusAccommodation---Flag indicating if the student has campus accommodation
• Category---Category of the student
• Mode---Mode of study of the student

Background Information
The background information was elicited from a preliminary survey administered to students at the beginning of the course.

• SurveyHomeTownSuburb---Student hometown
• SurveyProgramMinor---Student minor programme
• SurveyMinorMotivation---Motivation for choosing the minor programme
• SurveyMajorMotivation---Motivation for choosing the major programme
• SurveyStudyComputersHighschool---Status indicating if the student formally took a computing subject in highschool
• SurveyPriorComputerTraining---Status indicating if the student has undergone any formal computing training
• SurveyPriorComputerTrainingDetails---The details of the formal computing training undertaken by the student
• SurveyExperienceUsingComputers---Period within which the student has had experience working/using computers
• SurveyOwnComputer--Status indicating if the student owns a computer
• SurveyAboutYou---Random comment made by student

Course Workload
The course workloads were extract from the Student Information information, with the MinorProgram, MinorClassification and CourseWorkload fields derived.

• Courses---Total number of courses the student is enrolled into
• MinorProgram---Student minor programme
• MinorClassification---Classification of the student minor programme
• CourseWorkload---Computed course workload score

Moodle Interaction Logs
The Moodle interaction logs were extracted from Moodle logs.

• MoodleHits---Total number of daily unique Moodle hits during the academic year
• MoodleHitsWeight---Computed daily unique Moodle hits weight
• MoodleLogComponentAssignment---Total number of hits to the Moodle Assignment component
• MoodleLogComponentChoice---Total number of hits to the Moodle Choice component
• MoodleLogComponentFile---Total number of hits to the Moodle File component
• MoodleLogComponentFolder---Total number of hits to the Moodle Folder component
• MoodleLogComponentForum---Total number of hits to the Moodle Forum component
• MoodleLogComponentOverviewReport---Total number of hits to the Moodle Overview Report component
• MoodleLogComponentSystem---Total number of hits to the Moodle System component
• MoodleLogComponentURL---Total number of hits to the Moodle URL component
• MoodleLogComponentUserReport---Total number of hits to the Moodle User Report component
• MoodleLogComponentUserTours---Total number of hits to the Moodle User Tours component

Assessment Scores
Assessment scores were extracted from spreadsheets compiled by the course instructors for the course.

ICT 1110 assessment are clustered into three main components: quizzes, tests and the final examination. The assessment scores have all been scaled such that scores are between 0 and 100.

• Quiz1---Quiz on History of Computing
• Quiz2---Quiz on Classification of Computers
• Quiz3---Quiz on Abstraction in Computing
• Quiz4---Quiz on History of Computing
• Quiz5---Quiz on Computer Software
• Quiz6---Quiz on Von Neumann Model
• Quiz7---Quiz on Central Processing Unit
• Quiz8---Quiz on Peripherals
• Quiz9---Quiz on I/O Subsystem
• Quiz10---Quiz on Computer Primary Memory
• Quiz11---Quiz on File Organisation and Filesystems
• Quiz12---Quiz on Computer Secondary Storage
• Quiz13---Quiz on Number Systems and Representation
• Quiz14---Quiz on Number Systems and Representation
• Quiz15---Quiz on MIPS Instruction Set Architecture
• Quiz16---Quiz on MIPS Instruction Set Architecture
• Quiz17---Quiz on MIPS Instruction Set Architecture
• Quiz18---Quiz on MIPS Datapath and Control
• Quiz19---Quiz on MIPS Datapath and Control
• Quiz20---Quiz on Digital Logic Structures
• Test1---First test in term 1
• Test2---Second test in term 1
• Test3---First test in term 2
• Test4---Second test in term 2
• MakeUpTest---Make up assessment administered for various reasons
• FinalExamination---Final examination
—Exploratory Data Analysis—

Invalid shortcode
[1] Chaibela, M., Chisha, I., Pungwa, D., Siabbaba, D., & Simukoko, B. (2021). Performance Predictor: A Data Mining and Machine Learning Software for Student Performance Outcomes. The University of Zambia. URL: