I work with a lot of XML documents and was until recently using XML Copy Editor, both as my XML editor and for executing XPath expressions. I recently started working on a project that, in part, involves converting metadata stored in a legacy MS Access database to XML documents. I wanted an easier way of verifying the XML metadata files and naturally had to make do with XPath. I had previsouly used XMLlint for validating XML documents and at times formatting XML documents (using the format switch), and was rather intruiged to learn (months and months after first using it) that it has a powerful navigating shell that enables one to run XPath expressions.
Here’s how I normally play around with it; for the record, the examples all make use of a portion of metadata records harvested from the ETD Union Catalog [1], as described below.
Profile of Dataset Used in Examples
- Harvested via OAI-PMH from ETD Union Catalog [1]
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH> <responseDate>2012-09-03T03:49:29Z</responseDate> <request verb="ListRecords" metadataPrefix="oai_dc"> http://union.ndltd.org:8080/union.OAI-PMH/ </request> <ListRecords> <record> : :
- Only Harvested 1000 records (see resumption token at end of file)
- XML Files can be downloaded below (dataset1.xml has no namespaces in root node while dataset2.xml does)
Ingesting/Parsing XML Documents
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell union-catalog.xml / > base union-catalog.xml / >
Counting Nodes
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog-01.xml / > xpath count(//OAI-PMH) Object is a number : 1 / > / > xpath count(//record) Object is a number : 1000 / >
Navigating the XML Tree
- Navigate to the root node
/ > cd OAI-PMH OAI-PMH > pwd /OAI-PMH OAI-PMH >
- Navigate to the first child of all record nodes (union catalog records are wrapped in record elements)
OAI-PMH > cd OAI-PMH/ListRecords/record[1] OAI-PMH/ListRecords/record[1] is an empty Node Set OAI-PMH > cd /OAI-PMH/ListRecords/record[1] record > pwd /OAI-PMH/ListRecords/record[1] record > cat header/identifier ------- <identifier>oai:union.ndltd.org:ADTP/100073</identifier> record >
- Navigate to last record node
record > cd /OAI-PMH/ListRecords/record[last()] record > pwd /OAI-PMH/ListRecords/record[1000] record >
Displaying Information
- View record node that is second from last record
record > cat /OAI-PMH/ListRecords/record[last()-1] ------- <record> <header> <identifier>oai:union.ndltd.org:ADTP/173330</identifier> <datestamp>2011-09-07T02:15:34Z</datestamp> <setSpec>ADTP</setSpec> </header> <metadata> <dc> <title>Tool support for social risk mitigation in agile projects</title> <creator>Licorish, Sherlock Anthony</creator> <subject>Software engineering</subject> <subject>Agile methodologies</subject> <subject>Project management</subject> <subject>Risk management</subject> <subject>Software tools</subject> : :
Faceted Browsing
- Find out how many records have a subject related to Software
/ > xpath count(/OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Software")]) Object is a number : 1 / > cd /OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Software")] record > pwd /OAI-PMH/ListRecords/record[999] record >
- Go to the first record of all record nodes with a subject related to Engineering
record > cd /OAI-PMH/ListRecords/record[contains(./metadata/dc/subject, "Engineering")][1] record > pwd /OAI-PMH/ListRecords/record[313] record > cat . ------- <record> : <title>Reliability analysis of degrading uncertain structures with applications to fatigue and fracture under random loading</title> <creator>Beck, André T.</creator> <subject>Reliability (Engineering)</subject> <subject>Metals Fatigue</subject> <subject>Structural stability Mathematical models</subject> <subject>Fracture mechanics</subject> <description>School of Engineering Includes bibliographical references (leaves 248-256)</description> : </record>
Searching
- Search randomly for the word Software
/ > grep Software /OAI-PMH/ListRecords/record[37]/metadata/dc/description : t-- 385 Includes bibliographical references (lea... /OAI-PMH/ListRecords/record[999]/metadata/dc/subject[1] : t-- 20 Software engineering /OAI-PMH/ListRecords/record[999]/metadata/dc/subject[5] : t-- 14 Software tools /OAI-PMH/ListRecords/record[999]/metadata/dc/description : t-- 2406 Software engineering techniques have bee... / >
Handling Namespaces
- Namespaces in root node of dataset2.xml (see downloads section of post)
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> : :
- Register default root namespaces
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog.xml / > cd OAI-PMH OAI-PMH is a 0 Node Set / > dir DOCUMENT version=1.0 encoding=UTF-8 URL=20120915-xmllint-post-union-catalog.xml standalone=true / > setrootns / > cd defaultns:OAI-PMH OAI-PMH > pwd /* OAI-PMH > dir ELEMENT OAI-PMH default namespace href=http://www.openarchives.org/OAI/2.0/ namespace xsi href=http://www.w3.org/2001/XMLSchema-instanc... ATTRIBUTE schemaLocation TEXT content=http://www.openarchives.org/OAI/2.0/ ... OAI-PMH >
- Explicitly register namespaces
phiri@lphiri.cs.uct.ac.za:~/Projects/Masters/evaluation/performance/datasets/union-catalog$ xmllint --format --shell 20120915-xmllint-post-union-catalog.xml / > cd OAI-PMH OAI-PMH is a 0 Node Set / > setns a=http://www.openarchives.org/OAI/2.0/ / > cd a:OAI-PMH OAI-PMH > pwd /* OAI-PMH >
Looking for more Help
- StackOverflow [2]
- Man Pages: man xmllint
- XMLlint Shell Help
/ > help base display XML base of the node setbase URI change the XML base of the node bye leave shell cat [node] display node or current node cd [path] change directory to path or to root dir [path] dumps informations about the node (namespace, attributes, content) du [path] show the structure of the subtree under path or the current node exit leave shell help display this help free display memory usage load [name] load a new document with name ls [path] list contents of path or the current directory set xml_fragment replace the current node content with the fragment parsed in context xpath expr evaluate the XPath expression in that context and print the result setns nsreg register a namespace to a prefix in the XPath evaluation context format for nsreg is: prefix=[nsuri] (i.e. prefix= unsets a prefix) setrootns register all namespace found on the root element the default namespace if any uses 'defaultns' prefix pwd display current working directory quit leave shell save [name] save this document to name or the original name write [name] write the current node to the filename validate check the document for errors relaxng rng validate the document agaisnt the Relax-NG schemas grep string search for a string in the subtree / >
Bibliography
[1] http://union.ndltd.org/OAI-PMH/?verb=ListRecords&metadataPrefix=oai_dc
[2] http://stackoverflow.com/search?q=xmllint
[3] http://www.xmlsoft.org
[4] http://www.w3.org/TR/xslt
[5] http://linux.byexamples.com/archives/565/your-xml-friend-xpath-command-line-xmllint
[6] http://chihungchan.blogspot.com/2011/01/xmllint-for-xml-namspace.html