December 2, 2010
Moving Beyond the Ubiquitous PDF: Is PDF/A a Viable Preservation File Format?
In Getting to Durham Compliance, Sarah wrote "I am concerned that these online journals are becoming PDF dumping grounds with little to no metadata or access points contained within them to assist with the 'access' part of “open access,” which I quoted in Moving Beyond the Ubiquitous PDF for Durham Statement Compliance. My take on the dumping of PDF files focused on how PDFs are not accessible to all if not properly tagged. I would add that this extends well beyond law review articles to include PDF downloadables from legal publishers and our own AALL which promotes "access to all."
In PDF/A: A Viable Addition to the Preservation Toolkit, Daniel W. Noonan, Amy McCrory and Elizabeth L. Black's Nov./Dec. 2010 D-Lib Magazine article address the matter at hand. From the abstract:
PDF/A, the archival version of the PDF file format, is an International Standards Organization (ISO) vetted, open source tool that can be added to the librarian's and archivist's preservation toolkit. This article describes the format itself, the lessons learned as the authors investigated the tools readily available for creating PDF/A files and the design of the pilot to test implementation of the use of the format in The Ohio State University's repository, the Knowledge Bank. Further, we identify issues in conversion of diverse original formats; strategies for time-saving batch conversion; and considerations in deciding whether to attempt full or partial compliance with the standard.
The authors explains: "The PDF/A-1 standard provides two levels of compliance:"
PDF/A-1a denotes "full compliance" and ensures the preservation of a document's logical structure and content text stream in natural reading order. The text extraction is especially important when the document must be displayed on a mobile device (for example a PDA) or other devices in accordance with Section 508 of the US Rehabilitation Act. In such cases, the text must be reorganized on the limited screen size (re-flow). This feature is also known as "Tagged PDFs."
PDF/A-1b denotes "minimal compliance" and ensures that the text (and additional content) can be correctly displayed, but does not guarantee that extracted text will be legible or comprehensible. It therefore does not guarantee compliance with Section 508.
Currently ISO is developing Part 2 (or PDF/A-2) of the standard, that addresses some of the added features within Adobe® Acrobat® PDF versions 1.5, 1.6, and 1.7. PDF/A-2 should be backwards compatible, i.e., all valid PDF/A-1 documents should also be compliant with PDF/A-2, whereas, PDF/A-2 compliant files will not necessarily be PDF/A-1 compliant.
The authors provide illustrative test cases involving the conversion from existing PDF files, word-processing doucments, and from print via scanning to PDF/A and concluded "our tests [show] that PDF/A is most appropriate for files that are primarily text documents, and that it is significantly easier to get files into PDF/A form if those files are born digital or when one has control over making them digital."
In view of the fact that the mindless uploading of of PDF files that are not accessible to screen readers because they are not properly tagged to be accessible by the blind and those with ADD/ADHD who can concentrate better when they use screen readers, we should applaud and take note of the authors' objectives and recommendations:
We are seeking full PDF/A-1a compliance because these PDFs will also be accessible. The PDF/A-1a standard requires the text to be tagged and marked up, which enables screen reader technology to correctly read the document to disabled persons, thus making it an accessible PDF. Therefore, PDF/A-1a is preferred because it is better for archival and accessibility reasons. There are situations where PDF/A-1a is not attainable, such as with documents that are primarily image-based and do not have alternate text identified; in these cases PDF/A-1b is acceptable, at least from an archival and preservation point-of-view.
PDF/A is a valuable, viable preservation tool that we have recommended as an addition to the digital preservation toolkit at The Ohio State University Libraries. For other organizations we recommend the conversion to PDF/A directly from a born digital object; when that is not a possibility, convert via scanning, conducting the conversion at the point of input. For efficiency, utilize batch processing for conversion workflows, and develop policy, standards, and procedures for your organization/institution.
Perhaps we can also hope that AALL's tendency to upload non-complaint Section 508 will stop by using PDF/A-1a and the implementation of the Durham Statement for law reviews will also take this route, should it ever get off the ground.
Unfortuately, first law reviews has to get over the whole ego thing before saving a few tree. See David Walker's LLB post, Unmasking the Ego at Durham ("The conclusion reached at Durham as to why most journals are reluctant to move to an all digital open-access format is the fear of losing prestigious authors. The journal editors seem to believe that if they go all-digital, prestigious authors might be view the electronic-only journal to be less prestigious than a journal in print and not want to submit their work to the electronic journal. In turn, they believe that losing prestigious authors would make their journals less prestigious. So the fight is against two egos: (1) the egos of the authors; and (2) the egos of the editors.")
Meanwhile AALL remains happy to promote the logging industry by publishing LLJ and Spectrum in hardcopy. So much for setting an example in the second decade of the 21st century.
For more, see the PDF/A Competence Center. See also, Christopher Wren's Legal Skills Prof Blog post, Federal courts moving to archival standard for PDF (PDF/A).
On "Dark Archives" for Law Review Preservation. Subscribers to AALL listservs may have noticed the recent announcement that the Legal Information Preservation Alliance and Berkeley Electronic Press have partnered to create the Law Review Preservation Program, "the first comprehensive long-term archiving solution for law reviews published online." For those without access to the AALL listservs, the full text is available here. This is a "dark archive for long-term preservation" that promises the following:
In the event that a law review is no longer available from any university or publisher, it will be triggered from CLOCKSS under an open-access Creative Commons license, guaranteeing that law review articles will remain in the public domain forever.
I could not determine whether the PDF/A format will be the standard for this venture. [JH]