Public Rant – STOP Creating PDF Documents With Empty Metadata Fields

Creating PDF documents with empty metadata fields is bad form… if you are one one of those perpetuating this vile practice, PLEASE STOP IT!. First off, there is a very good reason why those PDF document metadata fields are there—for the most part, it helps search engines like ‘The Google’ to appropriately index and structure your document [1, 2]; in addition though, that meta information provides potential users of your documents the chance to quickly get a sense of what the document is about.

The Problem—Asking ‘The Google’

To illustrate the search-centric adverse effects of leaving PDF metadata fields empty, I did a simple search on the UCT institutional repository [3]. While not all the PDF documents curated within the repository have empty metadata fields, notice how most of them clearly do not have proper descriptive tags—you can easily tell from the generic ‘View/Open – University of Cape Town’ titles… this is just terrible!

Stop creating PDF documents with empty metadata
Stop creating PDF documents with empty metadata

Some Useful Ways of Injecting Metadata

Using a Word Processor

LibreOffice Writer is my preferred word processor of choice, and while there are some fundamental differences between it and other mainstream word processing applications like Microsoft Word, I would imagine the process of injecting metadata fields is somehow similar. In LibreOffice Writer, go to File > Properties > Description tab; from there fill out the relevant metadata fields. Note though that LibreOffice Writer by default pulls author details from ‘User Data’ configuration details (go to Tools > Options > User Data)

Stop creating PDF documents with empty metadata
Stop creating PDF documents with empty metadata

Using LaTeX

Using the hyperref package, make sure the following key/value pairs are in the preamble. For the longest of time, I never used ‘CreationDate’ key and I would always have the creation date reset everytime I regenerated PDF documents.

\usepackage{hyperref}
\hypersetup{
pdfinfo={
Title={Injecting Discriptive Metadata into PDF Documents},
Author={Lighton Phiri},
Creator={Lighton Phiri},
Subject={Publiation Etiquette},
Keywords={Metadata, PDF, Preservation},
CreationDate={D:20141010105922}
}}

Using pdftk

Thus far, I have found that by far the most flexible way to inject metadata is using pdftk [4]; some situations when I particularly find pdftk useful is when I am working with scanned PDF outputs. I religiously use the pdftk command line tool on Ubuntu, however, there are Windows installers available as well [4]. My current pdftk workflow is fairly simple: I have a model metadata file with place-holder descriptive metadata. All I do is make appropriately modifications to it and later inject metadata into desired document.

Creating a model metadata file

phiri@phiri-PROLINE-DH55TC:~/Sandbox$ pdftk model_pdf_document.pdf dump_data output model_pdf_document_metadata.txt

Injecting metadata into working document

phiri@phiri-PROLINE-DH55TC:~/Sandbox$ pdftk candidate_file.pdf update_info model_pdf_document_metadata output candidate_file_output.pdf

Bibliography

[1] https://acrobatusers.com/tutorials/make-your-pdfs-work-well-google-and-other-search-engines
[2] http://searchengineland.com/eleven-tips-for-optimizing-pdfs-for-search-engines-12156
[3] http://open.uct.ac.za
[4] http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit