XPath Expressions Using XMLlint Navigating Shell

I work with a lot of XML documents and was until recently using XML Copy Editor, both as my XML editor and for executing XPath expressions. I recently started working on a project that, in part, involves converting metadata stored in a legacy MS Access database to XML documents. I wanted an easier way of verifying the XML metadata files and naturally had to make do with XPath. I had previsouly used XMLlint for validating XML documents and at times formatting XML documents (using the format switch), and was rather intruiged to learn (months and months after first using it) that it has a powerful navigating shell that enables one to run XPath expressions.

Here’s how I normally play around with it; for the record, the examples all make use of a portion of metadata records harvested from the ETD Union Catalog [1], as described below.

Profile of Dataset Used in Examples

  • Harvested via OAI-PMH from ETD Union Catalog [1]
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
 <responseDate>2012-09-03T03:49:29Z</responseDate>
 <request verb="ListRecords" metadataPrefix="oai_dc">
 http://union.ndltd.org:8080/union.OAI-PMH/ </request>
 <ListRecords>
 <record>
:
:
  • Only Harvested 1000 records (see resumption token at end of file)
  • XML Files can be downloaded below (dataset1.xml has no namespaces in root node while dataset2.xml does)

Ingesting/Parsing XML Documents

phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell union-catalog.xml
/ > base
union-catalog.xml
/ >

Counting Nodes

phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog-01.xml
/ > xpath count(//OAI-PMH)
Object is a number : 1
/ >
/ > xpath count(//record)
Object is a number : 1000
/ >

Navigating the XML Tree

  • Navigate to the root node
/ > cd OAI-PMH
OAI-PMH > pwd
/OAI-PMH
OAI-PMH >
  • Navigate to the first child of all record nodes (union catalog records are wrapped in record elements)
OAI-PMH > cd OAI-PMH/ListRecords/record[1]
OAI-PMH/ListRecords/record[1] is an empty Node Set
OAI-PMH > cd /OAI-PMH/ListRecords/record[1]
record > pwd
/OAI-PMH/ListRecords/record[1]
record > cat header/identifier
 -------
<identifier>oai:union.ndltd.org:ADTP/100073</identifier>
record >
  • Navigate to last record node
record > cd /OAI-PMH/ListRecords/record[last()]
record > pwd
/OAI-PMH/ListRecords/record[1000]
record >

Displaying Information

  • View record node that is second from last record
record > cat /OAI-PMH/ListRecords/record[last()-1]
 -------
<record>
 <header>
 <identifier>oai:union.ndltd.org:ADTP/173330</identifier>
 <datestamp>2011-09-07T02:15:34Z</datestamp>
 <setSpec>ADTP</setSpec>
 </header>
 <metadata>
 <dc>
 <title>Tool support for social risk mitigation in agile projects</title>
 <creator>Licorish, Sherlock Anthony</creator>
 <subject>Software engineering</subject>
 <subject>Agile methodologies</subject>
 <subject>Project management</subject>
 <subject>Risk management</subject>
 <subject>Software tools</subject>
:
:

Faceted Browsing

  • Find out how many records have a subject related to Software
/ > xpath count(/OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Software")])
Object is a number : 1
/ > cd /OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Software")]
record > pwd
/OAI-PMH/ListRecords/record[999]
record >
  • Go to the first record of all record nodes with a subject related to Engineering
record > cd /OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Engineering")][1]
record > pwd
/OAI-PMH/ListRecords/record[313]
record > cat .
 -------
<record>
:
 <title>Reliability analysis of degrading uncertain structures with applications to fatigue and fracture under random loading</title>
 <creator>Beck, André T.</creator>
 <subject>Reliability (Engineering)</subject>
 <subject>Metals Fatigue</subject>
 <subject>Structural stability Mathematical models</subject>
 <subject>Fracture mechanics</subject>
 <description>School of Engineering Includes bibliographical references (leaves 248-256)</description>
:
</record>

Searching

  • Search randomly for the word Software
/ > grep Software
/OAI-PMH/ListRecords/record[37]/metadata/dc/description : t-- 385 Includes bibliographical references (lea...
/OAI-PMH/ListRecords/record[999]/metadata/dc/subject[1] : t-- 20 Software engineering
/OAI-PMH/ListRecords/record[999]/metadata/dc/subject[5] : t-- 14 Software tools
/OAI-PMH/ListRecords/record[999]/metadata/dc/description : t-- 2406 Software engineering techniques have bee...
/ >

Handling Namespaces

  • Namespaces in root node of dataset2.xml (see downloads section of post)
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
 http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
:
:
  • Register default root namespaces
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog.xml
/ > cd OAI-PMH
OAI-PMH is a 0 Node Set
/ > dir
DOCUMENT
version=1.0
encoding=UTF-8
URL=20120915-xmllint-post-union-catalog.xml
standalone=true
/ > setrootns
/ > cd defaultns:OAI-PMH
OAI-PMH > pwd
/*
OAI-PMH > dir
ELEMENT OAI-PMH
 default namespace href=http://www.openarchives.org/OAI/2.0/
 namespace xsi href=http://www.w3.org/2001/XMLSchema-instanc...
 ATTRIBUTE schemaLocation
 TEXT
 content=http://www.openarchives.org/OAI/2.0/ ...
OAI-PMH > 
  • Explicitly register namespaces
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog.xml
/ > cd OAI-PMH
OAI-PMH is a 0 Node Set
/ > setns a=http://www.openarchives.org/OAI/2.0/
/ > cd a:OAI-PMH
OAI-PMH > pwd
/*
OAI-PMH >

Looking for more Help

  • StackOverflow [2]
  • Man Pages: man xmllint
  • XMLlint Shell Help
/ > help
 base display XML base of the node
 setbase URI change the XML base of the node
 bye leave shell
 cat [node] display node or current node
 cd [path] change directory to path or to root
 dir [path] dumps informations about the node (namespace, attributes, content)
 du [path] show the structure of the subtree under path or the current node
 exit leave shell
 help display this help
 free display memory usage
 load [name] load a new document with name
 ls [path] list contents of path or the current directory
 set xml_fragment replace the current node content with the fragment parsed in context
 xpath expr evaluate the XPath expression in that context and print the result
 setns nsreg register a namespace to a prefix in the XPath evaluation context
 format for nsreg is: prefix=[nsuri] (i.e. prefix= unsets a prefix)
 setrootns register all namespace found on the root element
 the default namespace if any uses 'defaultns' prefix
 pwd display current working directory
 quit leave shell
 save [name] save this document to name or the original name
 write [name] write the current node to the filename
 validate check the document for errors
 relaxng rng validate the document agaisnt the Relax-NG schemas
 grep string search for a string in the subtree
/ >

Bibliography

[1] http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc
[2] http://stackoverflow.com/search?q=xmllint
[3] http://www.xmlsoft.org
[4] http://www.w3.org/TR/xslt
[5] http://linux.byexamples.com/archives/565/your-xml-friend-xpath-command-line-xmllint
[6] http://chihungchan.blogspot.com/2011/01/xmllint-for-xml-namspace.html