Owing in no small part to its free and open source licensing model, iText is one of the most popular and widely implemented Java libraries for PDF file creation and manipulation. I interviewed Bruno to discuss his thoughts about PDF/UA and how he envisions implementers using iText to manage tagged PDF.
iText has supported creation of tagged PDF for some time, but according to Bruno Lowagie, the original developer of iText, the publication of ISO 14289 (PDF/UA) has provided a solid technical basis for achieving consistent results when implementing tagged PDF. The new standard is helping iText’s developers specify and prioritize additional development of tagged PDF-related features, which will soon translate into more accessible PDF files delivered to end users.
While iText is planning to formally announce support for PDF/UA compliance by Q3 of 2013 once documentation and various convenience features are added, Bruno emphasized that it will already be possible in 13Q1 to use iText to create PDF/UA documents from scratch using iText’s basic building blocks.
Please describe your product or suite of products, and how it (or they) use PDF/UA.
At the core of iText, you’ll find a proven enterprise-grade software library that interfaces with every different aspect of PDF.
iText allows developers to generate documents using high level objects and convenience methods but also permits access to PDF at the lowest level using COS objects and AIM methods, and even rewrite entire content streams.
PDF/UA was already supported for a long time when creating a PDF document on the lowest level, but this required plenty of programming and knowhow. More recently, we started supporting the automatic creation of Tagged PDF based on the high-level objects that are used to create the PDF. This was the first step towards PDF/UA support.
How do you see iText’s role in the electronic document industry.
The iText software library is traditionally used to create or manipulate PDF documents in automated processes. Typically an iText project is deployed in web applications, where content needs to be served dynamically to a browser. In these cases content isn’t available in advance: it’s calculated based on user input or real-time database information. iText can be used to build a standalone solution from scratch, but we also have many customers who are using iText to fill a need for advanced PDF technology in Enterprise Information Management (EIM), as well as many other BI/BA, BPM and ECM products.
How does PDF/UA support work in iText?
One of the traditional ways to create documents with iText, is by using basic building blocks such as Paragraph, Image, List and so on. All these objects implement the Element interface, but now we’ve also introduced the IAccessibleElement interface with methods that allow you to add Attributes and to set a Role. These attributes and the role, is used to create Tagged PDF, the basis for PDF/UA.
We also updated the core of iText to ensure the natural reading order of the content was aligned with the structure tree. We still need to do some work on merging Structured Tree Roots when concatenating PDFs, and so on.
Does the product create PDF/UA files, process PDF/UA files, or what?
Currently we’re focusing on PDF/UA creation. We’re also working on document manipulation (such as splitting and merging) that maintain PDF/UA status. We already support conversion of Tagged PDF into XML. This functionality is useful in the context of PDF/UA and for now, we think it’s sufficient. We don’t have the ambition to create a PDF Accessibility Checker,
How does iText manipulate existing PDF files with respect to tagged PDF?
When I talk about manipulating PDFs, I mean splitting/merging existing files.
All other manipulations can be done via low-level methods provided by iText, but it goes without saying that one’s PDF needs to be really great to do this correctly.
Let’s say we can change structure element attributes, or parse page contents and remove some tagged parts of content. Depending on the demand, you’ll probably see us adding convenient methods for operations that turn out to be useful.
One important note: when flattening a form containing AcroForm fields, we won’t rewrite the complete content stream for now. Flattening an AcroForm form to PDF/UA won’t be supported in the next couple of releases.
As I discussed in a recent blog post, the key requirement of PDF/UA from the assistive technology user’s point of view is: ”Content shall be marked in the structure tree with semantically appropriate tags in a logical reading order.” How does iText address this requirement?
This is how it works:
As soon as you set the Tagged flag with the PdfWriter.setTagged() method, you can focus on using iText’s high-level objects (the so-called basic building blocks). iText will use the order in which you add elements (such as Paragraph, Phrase, etc…) to the document to create an appropriate structure tree automatically, so iText implementers are responsible for getting that right. Note that default roles are chosen depending on the type of the high-level object, but you can set custom roles when needed.
With the setTagged() method and 4 other calls made you can be sure that your PDF will pass as PDF/UA according to the PDF Accessibility Checker. This is available in iText 5.4.0.
Does iText support the PrintField attribute?
One could add the attribute using low-level operations, but not many people use iText to create forms, so we didn’t plan any support for the PrintField attribute.
We do have the necessary infrastructure to parse all content streams and add the flattened field content at the appropriate place. That could be ready by 13Q3.
Can you say if you are planning to implement the PDF Association’s forthcoming Matterhorn Protocol?
Currently, we base our work on the ISO standard. We’re interested in whatever other document is published, but we can’t tell in advance if we’re going to implement it.
If your product includes verification features, will you require the user to verify each affected object, or address all such objects at once?
Verification is outside the scope of iText.
WCAG 2.0 has been around since 2008. Why didn’t you produce software to support that standard; why did you wait for PDF/UA?
When we first looked at the description of WCAG in the context of PDF, we felt it wasn’t as clear as ISO-32000-1, so we stuck to whatever is explained about accessibility in the PDF standard. Today, the PDF Accessibility Checker recognizes the PDF documents we produce as PDF/UA compliant, not as WCAG 2.0.
Apart from accessibility, what do you see as the most likely value end users can get from PDF/UA support in creation or processing software?
Things like PDF to HTML conversion are obvious, but the ability to extract useful data from a PDF document, making a document not only readable for humans but also by machines, will be a huge step forward in business analytics and business process automation.
The big challenge for Enterprise Information Management systems today is the existence of a plethora of unstructured documents. At iText, we’re doing projects that involve extracting data from traditional PDF documents. For instance: read all the lines from a bank statement in PDF and store them in a database; find a national number on the first page of a document and route it to the correct destination; and so on. Without structure, these kind of operations are often difficult to implement and error prone. Let’s change this by creating documents that contain structure!
In the second edition of my book iText in Action, I present an example where I convert an XML file with the first paragraphs of Moby Dick to PDF and back. Because of the use of Tagged PDF, the resulting XML is identical to the original one (even using the same custom tags). Suppose you create PDF invoices from XML, wouldn’t it be great if the instance receiving the PDF invoice is able to extract the original invoice data from the invoice?
I know that all of this was already possible for a long time, if only the documents were created as Tagged PDFs, but I hope that the buzz about PDF/UA will result in a higher percentage of structured documents.
What can you tell me about your release plans?
Better PDF/UA support was scheduled for 13Q2 / 13Q3, but we advanced the development to 2012 due to a customer request which gave the project a higher priority. We’ve made good progress and we’re testing PDF/UA support with a handful of selected customers. The functionality may be released earlier than 13Q3, but we don’t consider a product as released until we’ve documented it, so the release date remains 13Q3.
If there’s one major challenge to single out in healthcare IT today, it would be leveraging the growth and usage of big data. While consumer IT made big advances in the past decade to get a handle of data by marking up content, indexing it, and annotating it for use, enterprise, and healthcare IT in particular, still need to catch up on making data actionable.
A typical healthcare office handles tens of thousands of documents for patient records, legal, finance, billing processes. In pharma and biotech, a typical FDA drug review process, involves multiple stages of trials, testing, applications, marketing and manufacturing for the new drug – all requiring a mind-blowing amount of paperwork. In all these cases, either the collected data is not timely or relevant, or it doesn’t present enough opportunity to easily access, archive for the future or comply with legal standards.
This article provides insights into how using the Portable Document Format (PDF) and accompanying tools within healthcare organizations can be a powerful way to help solve the unstructured data challenge, speed up processes, and reduce the costs for document handling.
We will explain why PDF, with its ability to contain data structure and interactivity, is the perfect document format for meeting the archiving, accessibility and compliance requirements of the healthcare industry. We will also examine the building blocks of a solution that helps create such compliant PDF documents, and deep dive into the ways to organize and structure PDFs.
C2E1_SimplePdf creates a simple "Quick brown fox jumps over the lazy dog" PDF with some images, but without any structure. This results in a regular PDF.
C2E2_TaggedPdf.java uses the same code as the first example, but now we ask iText to introduce structure. This results in a Tagged PDF.
C2E3_PdfA3b.java adapts the first example, so that it conforms to the PDF/A-3 standard, level B (for Basic). The resulting PDF is not a Tagged PDF.
C2E4_PdfA3a.java adapts the third example, so that it conforms to the PDF/A-3 standard, level A (for Accessibility). The resulting PDF is a Tagged PDF.
Files:
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.PdfWriter;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;importsandbox.WrapToTest;/**
* Creates a simple PDF with images and text.
*/
@WrapToTest
publicclass C2E1_SimplePdf {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox1.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/**
* Creates a simple PDF with images and text.
* @param args no arguments needed.
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E1_SimplePdf().createPdf(DEST);}/**
* Creates a simple PDF with images and text
* @param dest the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);
document.open();Paragraph p =newParagraph();
p.setFont(newFont(Font.FontFamily.HELVETICA, 20));Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.PdfWriter;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a Tagged PDF with images and text.
*/publicclass C2E2_TaggedPdf {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox2.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/**
* Creates a tagged PDF with images and text.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E2_TaggedPdf().createPdf(DEST);}/**
* Creates a tagged PDF with images and text.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);//TAGGED PDF//Make document tagged
writer.setTagged();//==========
document.open();Paragraph p =newParagraph();
p.setFont(newFont(Font.FontFamily.HELVETICA, 20));Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.ICC_Profile;importcom.itextpdf.text.pdf.PdfAConformanceLevel;importcom.itextpdf.text.pdf.PdfAWriter;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileInputStream;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a PDF that conforms with PDF/A-3 Level B.
*/
@WrapToTest
publicclass C2E3_PdfA3b {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox3.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A path to a color profile. */publicstaticfinalString ICC ="resources/data/sRGB_CS_profile.icm";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param args No arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E3_PdfA3b().createPdf(DEST);}/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());//PDF/A-3b//Create PdfAWriter with the required conformance levelPdfAWriter writer =PdfAWriter.getInstance(document, newFileOutputStream(dest), PdfAConformanceLevel.PDF_A_3B);
writer.setPdfVersion(PdfWriter.VERSION_1_7);//Create XMP metadata
writer.createXmpMetadata();//====================
document.open();//PDF/A-3b//Set output intentsICC_Profile icc =ICC_Profile.getInstance(newFileInputStream(ICC));
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);//===================Paragraph p =newParagraph();//PDF/A-3b//Embed font
p.setFont(FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));//=============Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.ICC_Profile;importcom.itextpdf.text.pdf.PdfAConformanceLevel;importcom.itextpdf.text.pdf.PdfAWriter;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfString;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileInputStream;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a PDF that conforms with PDF/A-3 Level A.
*/
@WrapToTest
publicclass C2E4_PdfA3a {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox4.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A path to a color profile. */publicstaticfinalString ICC ="resources/data/sRGB_CS_profile.icm";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates a PDF that conforms with PDF/A-3 Level A.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E4_PdfA3a().createPdf(DEST);}/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());//PDF/A-3a//Create PdfAWriter with the required conformance levelPdfAWriter writer =PdfAWriter.getInstance(document, newFileOutputStream(dest), PdfAConformanceLevel.PDF_A_3A);
writer.setPdfVersion(PdfWriter.VERSION_1_7);//====================//TAGGED PDF//Make document tagged
writer.setTagged();//===============//PDF/UA//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("Some title");
writer.createXmpMetadata();//=====================
document.open();//PDF/A-3b//Set output intentsICC_Profile icc =ICC_Profile.getInstance(newFileInputStream(ICC));
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);//===================Paragraph p =newParagraph();//PDF/UA//Embed font
p.setFont(FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));//==================Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Fox"));//==============
p.add(c);
p.add(newChunk(" jumps over the lazy "));
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Dog"));//==================
p.add(c);
document.add(p);
document.close();}}
We have a number of dynamically generated PDFs on our site that were created using iText 2.1.7.
However, we also have a large number of users that have disabilities and use screen readers, like JAWS,
to render our PDFs. We use the setTagged() method to tag the PDFs, but some elements of the PDF appear
out of order. Some even become more jumbled after calling setTagged()!
I read about PDF/UA in a 2013 interview about iText with Bruno Lowagie,
and this seems like something that might help with our problem.
However, I have not been able to find a good example of how to generate a PDF/UA document.
Can you provide an example?
Please take a look at the PdfUA example.
It explains step by step what is needed to be compliant with PDF/UA.
A similar example was presented at the iText Summit in 2014 and at JavaOne.
Watch the iText Summit video tutorial.
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document(PageSize.A4.rotate());
PdfWriter writer =
PdfWriter.getInstance(document, new FileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);
//TAGGED PDF
//Make document tagged
writer.setTagged();
//===============
//PDF/UA
//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("English pangram");
writer.createXmpMetadata();
//=====================
document.open();
Paragraph p = new Paragraph();
//PDF/UA
//Embed font
Font font =
FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20);
p.setFont(font);
//==================
Chunk c = new Chunk("The quick brown ");
p.add(c);
Image i = Image.getInstance(FOX);
c = new Chunk(i, 0, -24);
//PDF/UA
//Set alt text
c.setAccessibleAttribute(PdfName.ALT, new PdfString("Fox"));
//==============
p.add(c);
p.add(new Chunk(" jumps over the lazy "));
i = Image.getInstance(DOG);
c = new Chunk(i, 0, -24);
//PDF/UA
//Set alt text
c.setAccessibleAttribute(PdfName.ALT, new PdfString("Dog"));
//==================
p.add(c);
document.add(p);
p = new Paragraph("\n\n\n\n\n\n\n\n\n\n\n\n", font);
document.add(p);
List list = new List(true);
list.add(new ListItem("quick", font));
list.add(new ListItem("brown", font));
list.add(new ListItem("fox", font));
list.add(new ListItem("jumps", font));
list.add(new ListItem("over", font));
list.add(new ListItem("the", font));
list.add(new ListItem("lazy", font));
list.add(new ListItem("dog", font));
document.add(list);
document.close();
}
You make the document tagged with the setTagged document, but that's not sufficient.
You also need to set document data: the document title needs to be displayed and you need to indicate the language used in the document.
XMP metadata is mandatory.
Furthermore you need to embed all fonts. When you have images, you need a alternate description.
In the example, we replace the words "dog" and "fox" by an image.
To make sure that these images are "read out loud" correctly, we need to use the setAccessibleAttribute() method.
At the end of the example, I added a numbered list.
In another question, you claim that the list is not read out loud correctly by JAWS.
If you check the PDF file created with the above example, more specifically pdfua.pdf,
you'll discover that JAWS reads the document as expected, with the numbers and the text in the right order.
The reason why "it doesn't work" when you try this, is simple. You are using a version of iText that is 3 years older than the PDF/UA standard.
Also: in the version you are using, you are responsible for creating the tag structure at the lowest PDF level when you use the setTagged() method.
In more recent version, iText takes care of this at a high level. You need the latest iText version to achieve what you want.
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/28222277/how-can-i-generate-a-pdf-ua-compatible-pdf-with-itext
*/packagesandbox.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.List;importcom.itextpdf.text.ListItem;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfString;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates an accessible PDF with images and text.
*/
@WrapToTest
publicclass PdfUA {/** The resulting PDF. */publicstaticfinalString DEST ="results/pdfa/pdfua.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates an accessible PDF with images and text.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new PdfUA().createPdf(DEST);}/**
* Creates an accessible PDF with images and text.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);//TAGGED PDF//Make document tagged
writer.setTagged();//===============//PDF/UA//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("English pangram");
writer.createXmpMetadata();//=====================
document.open();Paragraph p =newParagraph();//PDF/UA//Embed fontFont font =FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20);
p.setFont(font);//==================Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Fox"));//==============
p.add(c);
p.add(newChunk(" jumps over the lazy "));
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Dog"));//==================
p.add(c);
document.add(p);
p =newParagraph("\n\n\n\n\n\n\n\n\n\n\n\n", font);
document.add(p);List list =newList(true);
list.add(newListItem("quick", font));
list.add(newListItem("brown", font));
list.add(newListItem("fox", font));
list.add(newListItem("jumps", font));
list.add(newListItem("over", font));
list.add(newListItem("the", font));
list.add(newListItem("lazy", font));
list.add(newListItem("dog", font));
document.add(list);
document.close();}}
/*
* This example was written in answer to the following question:
* http://stackoverflow.com/questions/34036200
*/packagesandbox.pdfa;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.pdf.PdfArray;importcom.itextpdf.text.pdf.PdfDictionary;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfReader;importcom.itextpdf.text.pdf.PdfStamper;importcom.itextpdf.text.pdf.PdfString;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;importsandbox.WrapToTest;
@WrapToTest
publicclass AddAltTags {publicstaticfinalString SRC ="resources/pdfs/no_alt_attribute.pdf";publicstaticfinalString DEST ="results/pdfa/added_alt_attributes.pdf";publicstaticvoid main(String[] args)throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new AddAltTags().manipulatePdf(SRC, DEST);}publicvoid manipulatePdf(String src, String dest)throwsIOException, DocumentException{PdfReader reader =newPdfReader(src);PdfDictionary catalog = reader.getCatalog();PdfDictionary structTreeRoot = catalog.getAsDict(PdfName.STRUCTTREEROOT);
manipulate(structTreeRoot);PdfStamper stamper =newPdfStamper(reader, newFileOutputStream(dest));
stamper.close();}publicvoid manipulate(PdfDictionary element){if(element ==null)return;if(PdfName.FIGURE.equals(element.get(PdfName.S))){
element.put(PdfName.ALT, newPdfString("Figure without an Alt description"));}PdfArray kids = element.getAsArray(PdfName.K);if(kids ==null)return;for(int i =0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));}}
I know that iText can generate tagged PDF documents from scratch, but is it possible to insert alternative text for images in an existing tagged PDF without changing anything else?
I need to implement this feature in a program without using GUI applications such as Adobe Acrobat Pro.
In this example, we take a PDF with images of a fox and a dog where the Alt keys are missing: no_alt_attribute.pdf
Structure element without /Alt key
Code can't recognize a fox or a dog, so we create a new document with Alt attributes saying "Figure without an Alt description": added_alt_attributes.pdf)
Structure element with /Alt key
We add this description by walking through the structure tree, looking for structural elements marked as /Figure elements:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary catalog = reader.getCatalog();
PdfDictionary structTreeRoot =
catalog.getAsDict(PdfName.STRUCTTREEROOT);
manipulate(structTreeRoot);
PdfStamper stamper = new PdfStamper(
reader, new FileOutputStream(dest));
stamper.close();
}
public void manipulate(PdfDictionary element) {
if (element == null)
return;
if (PdfName.FIGURE.equals(element.get(PdfName.S))) {
element.put(PdfName.ALT,
new PdfString("Figure without an Alt description"));
}
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) return;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
}
You can easily port this Java example to C#:
Get the root dictionary from the PdfReader object,
Get the root of the structure tree (a dictionary),
Loop over all the kids of every branch of that tree,
When a lead is a figure, add an /Alt entry.
Once this is done, use PdfStamper to save the altered file.
I know that iText can generate tagged PDF documents from scratch, but is it possible to insert alternative text for images in an existing tagged PDF without changing anything else?
I need to implement this feature in a program without using GUI applications such as Adobe Acrobat Pro.
In this example, we take a PDF with images of a fox and a dog where the Alt keys are missing: no_alt_attribute.pdf
Structure element without /Alt key
Code can't recognize a fox or a dog, so we create a new document with Alt attributes saying "Figure without an Alt description": added_alt_attributes.pdf)
Structure element with /Alt key
We add this description by walking through the structure tree, looking for structural elements marked as /Figure elements:
public void manipulatePdf(String src, String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfDictionary catalog = pdfDoc.getCatalog().getPdfObject();
PdfDictionary structTreeRoot = catalog.getAsDictionary(PdfName.StructTreeRoot);
manipulate(structTreeRoot);
pdfDoc.close();
}
public void manipulate(PdfDictionary element) {
if (element == null) {
return;
}
if (PdfName.Figure.equals(element.get(PdfName.S))) {
element.put(PdfName.Alt, new PdfString("Figure without an Alt description"));
}
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) {
return;
}
for (int i = 0; i < kids.size(); i++) {
manipulate(kids.getAsDictionary(i));
}
}
You can easily port this Java example to C#:
Get the root dictionary from the PdfDocument object,
Get the root of the structure tree (a dictionary),
Loop over all the kids of every branch of that tree,
When a lead is a figure, add an /Alt entry.
Click this link if you want to see how to answer this question in iText 5.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2016 iText Group NV
*//*
* This example was written in answer to the following question:
* http://stackoverflow.com/questions/34036200
*/packagecom.itextpdf.samples.sandbox.pdfa;importcom.itextpdf.kernel.pdf.*;importcom.itextpdf.samples.GenericTest;importcom.itextpdf.test.annotations.type.SampleTest;importorg.junit.experimental.categories.Category;importjava.io.File;importjava.io.IOException;
@Category(SampleTest.class)publicclass AddAltTags extends GenericTest {publicstaticfinalString DEST ="./target/test/resources/sandbox/pdfa/add_alt_tags.pdf";publicstaticfinalString SRC ="./src/test/resources/pdfs/no_alt_attribute.pdf";publicstaticvoid main(String[] args)throwsException{File file =newFile(DEST);
file.getParentFile().mkdirs();new AddAltTags().manipulatePdf(DEST);}publicvoid manipulatePdf(String dest)throwsIOException{
PdfDocument pdfDoc =new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
PdfDictionary catalog = pdfDoc.getCatalog().getPdfObject();
PdfDictionary structTreeRoot = catalog.getAsDictionary(PdfName.StructTreeRoot);
manipulate(structTreeRoot);
pdfDoc.close();}publicvoid manipulate(PdfDictionary element){if(element ==null){return;}if(PdfName.Figure.equals(element.get(PdfName.S))){
element.put(PdfName.Alt, new PdfString("Figure without an Alt description"));}
PdfArray kids = element.getAsArray(PdfName.K);if(kids ==null){return;}for(int i =0; i < kids.size(); i++){
manipulate(kids.getAsDictionary(i));}}}
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2016 iText Group NV
*//**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/28222277/how-can-i-generate-a-pdf-ua-compatible-pdf-with-itext
*/packagecom.itextpdf.samples.sandbox.pdfa;importcom.itextpdf.io.font.PdfEncodings;importcom.itextpdf.io.image.ImageDataFactory;importcom.itextpdf.kernel.font.PdfFont;importcom.itextpdf.kernel.font.PdfFontFactory;importcom.itextpdf.kernel.geom.PageSize;importcom.itextpdf.kernel.pdf.*;importcom.itextpdf.kernel.xmp.XMPException;importcom.itextpdf.layout.Document;importcom.itextpdf.layout.element.*;importcom.itextpdf.samples.GenericTest;importcom.itextpdf.test.annotations.type.SampleTest;importorg.junit.experimental.categories.Category;importjava.io.File;importjava.io.IOException;
@Category(SampleTest.class)publicclass PdfUA extends GenericTest {publicstaticfinalString DEST ="./target/test/resources/sandbox/pdfa/pdf_ua.pdf";publicstaticfinalString DOG ="./src/test/resources/img/dog.bmp";publicstaticfinalString FONT ="./src/test/resources/font/FreeSans.ttf";publicstaticfinalString FOX ="./src/test/resources/img/fox.bmp";publicstaticvoid main(String[] args)throwsException{File file =newFile(DEST);
file.getParentFile().mkdirs();new PdfUA().manipulatePdf(DEST);}publicvoid manipulatePdf(String dest)throwsIOException, XMPException {
PdfDocument pdfDoc =new PdfDocument(new PdfWriter(dest, new WriterProperties().setPdfVersion(PdfVersion.PDF_1_7)));Document document =newDocument(pdfDoc, new PageSize(PageSize.A4).rotate());//TAGGED PDF//Make document tagged
pdfDoc.setTagged();//===============//PDF/UA//Set document metadata
pdfDoc.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
pdfDoc.getCatalog().setLang(new PdfString("en-US"));
PdfDocumentInfo info = pdfDoc.getDocumentInfo();
info.setTitle("English pangram");//=====================
Paragraph p =new Paragraph();//PDF/UA//Embed font
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
p.setFont(font);//==================
Text c =new Text("The quick brown ");
p.add(c);Image i =newImage(ImageDataFactory.create(FOX));//PDF/UA//Set alt text
i.getAccessibilityProperties().setAlternateDescription("Fox");//==============
p.add(i);
p.add(" jumps over the lazy ");
i =newImage(ImageDataFactory.create(DOG));//PDF/UA//Set alt text
i.getAccessibilityProperties().setAlternateDescription("Dog");//==================
p.add(i);
document.add(p);
p =new Paragraph("\n\n\n\n\n\n\n\n\n\n\n\n").setFont(font).setFontSize(20);
document.add(p);List list =newList();
list.add((ListItem)new ListItem("quick").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("brown").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("fox").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("jumps").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("over").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("the").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("lazy").setFont(font).setFontSize(20));
list.add((ListItem)new ListItem("dog").setFont(font).setFontSize(20));
document.add(list);
document.close();}}
We have a number of dynamically generated PDFs on our site that were created using iText 2.1.7.
However, we also have a large number of users that have disabilities and use screen readers, like JAWS,
to render our PDFs. We use the setTagged() method to tag the PDFs, but some elements of the PDF appear
out of order. Some even become more jumbled after calling setTagged()!
I read about PDF/UA in a 2013 interview about iText with Bruno Lowagie,
and this seems like something that might help with our problem.
However, I have not been able to find a good example of how to generate a PDF/UA document.
Can you provide an example?
Please take a look at the PdfUA example.
It explains step by step what is needed to be compliant with PDF/UA.
A similar example was presented at the iText Summit in 2014 and at JavaOne.
Watch the iText Summit video tutorial.
public void manipulatePdf(String dest) throws IOException, XMPException {
PdfDocument pdfDoc = new PdfDocument(new PdfWriter(dest, new WriterProperties().setPdfVersion(PdfVersion.PDF_1_7)));
Document document = new Document(pdfDoc, new PageSize(PageSize.A4).rotate());
//TAGGED PDF
//Make document tagged
pdfDoc.setTagged();
//===============
//PDF/UA
//Set document metadata
pdfDoc.getCatalog().setViewerPreferences(new PdfViewerPreferences().setDisplayDocTitle(true));
pdfDoc.getCatalog().setLang(new PdfString("en-US"));
PdfDocumentInfo info = pdfDoc.getDocumentInfo();
info.setTitle("English pangram");
//=====================
Paragraph p = new Paragraph();
//PDF/UA
//Embed font
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
p.setFont(font);
//==================
Text c = new Text("The quick brown ");
p.add(c);
Image i = new Image(ImageDataFactory.create(FOX));
//PDF/UA
//Set alt text
i.getAccessibilityProperties().setAlternateDescription("Fox");
//==============
p.add(i);
p.add(" jumps over the lazy ");
i = new Image(ImageDataFactory.create(DOG));
//PDF/UA
//Set alt text
i.getAccessibilityProperties().setAlternateDescription("Dog");
//==================
p.add(i);
document.add(p);
p = new Paragraph("\n\n\n\n\n\n\n\n\n\n\n\n").setFont(font).setFontSize(20);
document.add(p);
List list = new List();
list.add((ListItem) new ListItem("quick").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("brown").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("fox").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("jumps").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("over").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("the").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("lazy").setFont(font).setFontSize(20));
list.add((ListItem) new ListItem("dog").setFont(font).setFontSize(20));
document.add(list);
document.close();
}
You make the document tagged with the setTagged document, but that's not sufficient.
You also need to set document data: the document title needs to be displayed and you need to indicate the language used in the document.
XMP metadata is mandatory.
Furthermore you need to embed all fonts. When you have images, you need a alternate description.
In the example, we replace the words "dog" and "fox" by an image.
To make sure that these images are "read out loud" correctly, we need to use the getAccessibilityProperties().setAlternateDescription() method.
At the end of the example, I added a numbered list.
In another question, you claim that the list is not read out loud correctly by JAWS.
If you check the PDF file created with the above example, more specifically pdfua.pdf,
you'll discover that JAWS reads the document as expected, with the numbers and the text in the right order.
The reason why "it doesn't work" when you try this, is simple. You are using a version of iText that is 3 years older than the PDF/UA standard.
Also: in the version you are using, you are responsible for creating the tag structure at the lowest PDF level when you use the setTagged() method.
In more recent version, iText takes care of this at a high level. You need the latest iText version to achieve what you want.
Click this link if you want to see how to answer this question in iText 5.
In chapter 1 to 4, we've created PDF documents using iText 7. In chapters 5 and 6, we've manipulated and reused existing PDF documents. All the PDFs we dealt with in those chapters were PDF documents that complied to ISO 32000, which is the core standard for PDF. ISO 32000 isn't the only ISO standard for PDF, there are many different sub-standards that were created for specific reasons. In this chapter, we'll highlight two:
ISO 14289 is better known as PDF/UA. UA stands for Universal Accessibility. PDFs that comply with the PDF/UA standard can be consumed by anyone, including people who are blind or visually impaired.
ISO 19005 is better known as PDF/A. A stands for Archiving. The goal of this standard is the long-term preservation of digital documents.
In this chapter, we'll learn more about PDF/A and PDF/UA by creating a series of PDF/A and PDF/UA files.
Creating accessible PDF documents
Before we start with a PDF/UA example, let's take a closer look at the problem we want to solve. In chapter 1, we created a document that included images. In the sentence "Quick brown fox jumps over the lazy dog", we replaced the words "fox" and "dog" by images representing a fox and a dog. When this file is read out loud, a machine doesn't know that the first image represents a fox and that the second image represents a dog, hence the file will be read as "Quick brown jumps over the lazy."
In an ordinary PDF, content is painted to a canvas. We might use high-level objects such as List and Table, but once the PDF is created, there is no structure left. A list is a sequence of lines and a text snippet in a list item doesn't know that it's part of a list. A table is just a bunch of lines and text added at absolute positions on a page. A text snippet in a table doesn't know it belongs to a cell in a specific column and a specific row.
Unless we make the PDF a tagged PDF, the document doesn't contain any semantic structure. When there's no semantic structure, the PDF isn't accessible. To be accessible, the document needs to be able to distinguish which part of a page is actual content, and which part is an artifact that isn't part of the actual content (e.g. a header, a page number). A line of text needs to know if its a title, if it's part of a paragraph, and so on. We can add all of this information to the page, by creating a structure tree and by defining content as marked content. This sounds complex, but if you use iText 7's high-level objects, it's sufficient to introduce the method setTagged(). By defining a PdfDocument as a tagged document, the structure we introduce by using objects such as List, Table, Paragraph, will be reflected in the Tagged PDF.
This is only one requirement to make a PDF accessible. The QuickBrownFox_PDFUA example will help us understand the other requirements.
PdfDocument pdf =new PdfDocument(new PdfWriter(dest),new WriterProperties().addXmpMetadata()));
Document document =newDocument(pdf);
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/UA example");
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
We create a PdfDocument and a Document, but this time we tell the 'PdfWriter' to automatically add XMP metadata using the 'addXmpMetadata()' method of 'WriterProperties'. In PDF/UA, it is mandatory to have the same metadata stored in the PDF as XML. This XML may not be compressed. Processors that don't "understand" PDF must be able to detect this XMP metadata and process it. An XMP stream is created automatically based on the entries in the Info dictionary. This Info dictionary is a PDF Object that includes such data as the title of the document. In addition to this requirement, we make sure that we comply to PDF by introducing some extra features:
We tell the PdfDocument that we're going to create Tagged PDF (line 4),
We add a language specifier. In our case, the document knows that the main language used in this document is American English (line 5).
We change the viewer preferences so that the title of the document is always displayed in the top bar of the PDF viewer (line 6-7). Obviously, this implies that we add a title to the metadata of the document (line 8-9).
All fonts need to be embedded (line 11). There are some other requirements relating to fonts, but it would lead us too far right now to discuss these in detail.
All the content needs to be tagged. When an image is encountered, we need to provide a description of that image using alt text (line 17 and line 22).
We have now created a PDF/UA document. When we look at the resulting page in Figure 7.1, we don't see much difference, but if we open the Tags panel, we see that the document has a specific structure.
Figure 7.1: a PDF/UA document and its structure
We see that the <Document> consists of a <P>aragraph that is composed of four parts, two <Span>s and two <Figures>s. We'll create a more complex PDF/UA document later in this chapter, but let's take a look at what makes PDF/A special first.
Creating PDFs for long-term preservation, part 1
Part 1 of ISO 19005 was released in 2005. It was defined as a subset of version 1.4 of Adobe's PDF specification (which, at that time, wasn't an ISO standard yet). ISO 19005-1 introduced a series of obligations and restrictions:
The document needs to be self-contained: all fonts need to be embedded; external movie, sound or other binary files are not allowed.
The document needs to contain metadata in the eXtensible Metadata Platform (XMP) format: ISO 16684 (XMP) describes how to embed XML metadata into a binary file, so that software that doesn't know how to interpret the binary data format can still extract the file's metadata.
Functionality that isn't future-proof isn't allowed: the PDF can't contain any JavaScript and may not be encrypted.
ISO 19005-1:2005 (PDF/A-1) defined two conformance levels:
Level B ("basic"): ensures that the visual appearance of a document will be preserved for the long term.
Level A ("accessible"): ensures that the visual appearance of a document will be preserved for the long term, but also introduces structural and semantic properties. The PDF needs to be a Tagged PDF.
The QuickBrownFox_PDFA_1b example shows how we can create a "Quick brown fox" PDF that complies to PDF/A-1b.
//Initialize PDFA document with output intent
PdfADocument pdf =new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1B,
new PdfOutputIntent("Custom", "", "http://www.color.org",
The first thing that jumps to the eye, is that we are no longer using a PdfDocument instance. Instead, we create a PdfADocument instance. The PdfADocument constructor needs a PdfWriter as its first parameter, but also a conformance level (in this case PdfAConformanceLevel.PDF_A_1B) and a PdfOutputIntent. This output intent tells the document how to interpret the colors that will be used in the document. In line 10, we make sure that the font we're using is embedded.
Figure 7.2: a PDF/A-1 level B document
Looking at the PDF shown in Figure 7.2, we see a blue ribbon with the text "This file claims compliance with the PDF/A standard and has been opened read-only to prevent modification." Allow me to explain two things about this sentence:
This doesn't mean that the PDF is, in effect, compliant with the PDF/A standard. It only claims it is. To be sure, you need to open the Standards panel in Adobe Acrobat. When you click on the "Verify Conformance" link, Acrobat will verify if the document is what it claims to be. In this case, we read "Status: verification succeeded"; we have successfully created a document complying with PDF/A-1B.
The document has been opened read-only, not because you are not allowed to modify it (PDF/A is not a way to protect a PDF against modification), but Adobe Acrobat presents it as read-only because any modification might change the PDF into a PDF that is no longer compliant to the PDF/A standard. It's not trivial to update a PDF/A without breaking its PDF/A status.
Let's adapt our example, and create a PDF/A-1 level A document with the QuickBrownFox_PDFA_1a example.
//Initialize PDFA document with output intent
PdfADocument pdf =new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1A,
new PdfOutputIntent("Custom", "", "http://www.color.org",
We've changed PdfAConformanceLevel.PDF_A_1B into PdfAConformanceLevel.PDF_A_1A in line 3. We've made the PdfADocument a Tagged PDF (line 8) and we've added some alt text for the images. Figure 7.3 is somewhat confusing.
Figure 7.3: a PDF/A-1 level A document
When we look at the Standards panel, we see that the document thinks it conforms to PDF/A-1A and to PDF/UA-1. We don't have a "Verify Conformance" link, so we have to use Preflight. Preflight informs us that there were "No problems found" when executing the "Verify compliance with PDF/A-1a" profile. We can't verify the PDF/UA compliance because PDF/UA involves some requirements that can't be verified by a machine. For instance: a machine wouldn't notice if we switched the description of the image of the fox with the description of the image of the dog. That would make the document inaccessible as the document would spread false information to people depending on screen-readers. In any case, we know that our document doesn't comply to the PDF/UA standard because we omitted a number of essential elements (such as the language).
From the start, it was determined that approved parts of ISO 19005 could never become invalid. New, subsequent parts would only define new, useful features. That's what happened when part 2 and part 3 were created.
Creating PDFs for long-term preservation, part 2 and 3
ISO 19005-2:2011 (PDF/A-2) was introduced to have a PDF/A standard that was based
on the ISO standard (ISO 32000-1) instead of on Adobe's PDF specification. PDF/A-2 also adds
a handful of features that were introduced in PDF 1.5, 1.6 and 1.7:
Useful additions include: support for JPEG2000, Collections, object-level XMP, and optional content.
Useful improvements include: better support for transparency, comment types and annotations, and digital signatures.
PDF/A-2 also defines an extra level besides Level A and Level B:
Level U ("Unicode"): ensures that the visual appearance of a document will be preserved for the long term, and that all text is stored in UNICODE.
ISO 19005-3:2012 (PDF/A-3) was an almost identical copy of PDF/A-2.
There was only one difference with PDF/A-2: in PDF/A-3, attachments don't need to be PDF/A.
You can attach any file to a PDF/A-3, for instance: an XLS file containing calculations of which the results are used in the document,
the original Word document that was used to create the PDF document, and so on. The document itself needs to conform
to all the obligations and restrictions of the PDF/A specification, but these obligations and restrictions do not apply to its attachments.
In the UnitedStates_PDFA_3a example, we'll create a document that complies with PDF/UA as well as with PDF/A-3A. We choose PDF/A3, because we're going to add the CSV file that was used as the source for creating the PDF.
PdfADocument pdf =new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_3A,
new PdfOutputIntent("Custom", "", "http://www.color.org",
Let's examine the different parts of this example.
Line 1-5: We create a PdfADocument (PdfAConformanceLevel.PDF_A_3A) and a Document.
Line 7: Making the PDF a Tagged PDF is a requirement for PDF/UA as well as for PDF/A-3A.
Line 8-12: Setting the language, the document title and the viewer preference to display the title is a requirement for PDF/UA.
Line 14-20: We add a file attachment using specific parameters that are required for PDF/A-3A.
Line 26-27: We embed the fonts which is a requirement for PDF/UA as well as for PDF/A.
Line 28-38: We've seen this code before in the UnitedStates example in chapter 1 (including the process() method).
Line 40: We close the document.
Figure 7.4 demonstrates how using the Table class with Cell objects added as header cells, and Cell objects added as normal cells, resulted in a structure tree that makes the PDF document accessible.
Figure 7.4: a PDF/A-3 level A document
When we open the Attachments panel as shown in Figure 7.5, we see our original united_states.csv file that we can easily extract from the PDF.
Figure 7.5: a PDF/A-3 level A document and its attachment
The examples in this chapter taught us that PDF/UA or PDF/A documents involve extra requirements when compared to ordinary PDFs. "Can we use iText to convert an existing PDF to a PDF/UA or PDF/A document" is a question that is posted frequently on mailing-lists or user forums. I hope that this chapter explains that iText can't do this automatically.
If you have a document that has a picture of a fox and a dog, iText can't add any missing alt text for those images, because iText can't see that fox nor that dog. iText only sees pixels, it can't interpret the image.
If you are using a font that isn't embedded, iText doesn't know what that font looks like. If you don't provide the corresponding font program, iText can never embed that font.
These are only two examples of many that explain why converting an ordinary PDF to PDF/A or PDF/UA isn't trivial. It's very easy to change the PDF so that it shows a blue bar saying that the document complies to PDF/A, but that doesn't many that claim is true.
We also need to pay attention when we merge existing PDF/A documents.
Merging PDF/A documents
When merging PDF/A documents, it's very important that every single document that you are adding to PdfMerger is already a PDF/A document. You can't mix PDF/A documents and ordinary PDF documents into one single PDF and hope the result will be a PDF/A document. The same is true for mixing a PDF/A level A document with a PDF/A level B document. One has a structure tree, the other hasn't; you can't expect the resulting PDF to be a PDF/A level A document.
Figure 7.6 shows how we merged the two PDF/A level A documents we created in the previous sections.
Figure 7.6: merging 2 PDF/A level A documents
When we look at the structure of the tags, we see that the <P>aragraph is now followed by a <Table>. The MergePDFADocuments shows how it's done.
PdfADocument pdf =new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1A,
new PdfOutputIntent("Custom", "", "http://www.color.org",
This example is assembled using parts of two examples we've already seen before:
Lines 1 to 11 are almost identical to the first part of the UnitedStates_PDFA_3a example we've used in the previous section, except that we now use PdfAConformanceLevel.PDF_A_1A and that we don't need a Document object.
Lines 12 to 25 are identical to the last part of the 88th_Oscar_Combine example of the previous chapter. Note that we use a PdfDocument instance instead of a PdfADocument; the PdfADocument will check if the source documents comply.
There's a lot more to be said about PDF/UA and PDF/A, and even about other sub-standards. For instance: there's a German standard for invoicing called ZUGFeRD that is built on top of PDF/A-3, but let's save that for another tutorial.
Summary
In this chapter, we've discovered that there's more to PDF than meets the eye. We've learned how to introduce structure into our documents so that they are accessible for the blind and the visually impaired. We've also made sure that our PDFs were self-contained, for instance by embedding fonts, so that our documents can be archived for the long term.
We'll need several other tutorials to cover the functionality covered in this tutorial in more depth, but these seven chapters should already give you a good impression of what you can do with iText 7.
C2E1_SimplePdf creates a simple "Quick brown fox jumps over the lazy dog" PDF with some images, but without any structure. This results in a regular PDF.
C2E2_TaggedPdf.java uses the same code as the first example, but now we ask iText to introduce structure. This results in a Tagged PDF.
C2E3_PdfA3b.java adapts the first example, so that it conforms to the PDF/A-3 standard, level B (for Basic). The resulting PDF is not a Tagged PDF.
C2E4_PdfA3a.java adapts the third example, so that it conforms to the PDF/A-3 standard, level A (for Accessibility). The resulting PDF is a Tagged PDF.
Files:
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.PdfWriter;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;importsandbox.WrapToTest;/**
* Creates a simple PDF with images and text.
*/
@WrapToTestpublicclass C2E1_SimplePdf {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox1.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/**
* Creates a simple PDF with images and text.
* @param args no arguments needed.
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E1_SimplePdf().createPdf(DEST);}/**
* Creates a simple PDF with images and text
* @param dest the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);
document.open();Paragraph p =newParagraph();
p.setFont(newFont(Font.FontFamily.HELVETICA, 20));Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.PdfWriter;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a Tagged PDF with images and text.
*/publicclass C2E2_TaggedPdf {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox2.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/**
* Creates a tagged PDF with images and text.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E2_TaggedPdf().createPdf(DEST);}/**
* Creates a tagged PDF with images and text.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);//TAGGED PDF//Make document tagged
writer.setTagged();//==========
document.open();Paragraph p =newParagraph();
p.setFont(newFont(Font.FontFamily.HELVETICA, 20));Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.ICC_Profile;importcom.itextpdf.text.pdf.PdfAConformanceLevel;importcom.itextpdf.text.pdf.PdfAWriter;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileInputStream;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a PDF that conforms with PDF/A-3 Level B.
*/
@WrapToTestpublicclass C2E3_PdfA3b {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox3.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A path to a color profile. */publicstaticfinalString ICC ="resources/data/sRGB_CS_profile.icm";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param args No arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E3_PdfA3b().createPdf(DEST);}/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());//PDF/A-3b//Create PdfAWriter with the required conformance levelPdfAWriter writer =PdfAWriter.getInstance(document, newFileOutputStream(dest), PdfAConformanceLevel.PDF_A_3B);
writer.setPdfVersion(PdfWriter.VERSION_1_7);//Create XMP metadata
writer.createXmpMetadata();//====================
document.open();//PDF/A-3b//Set output intentsICC_Profile icc =ICC_Profile.getInstance(newFileInputStream(ICC));
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);//===================Paragraph p =newParagraph();//PDF/A-3b//Embed font
p.setFont(FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));//=============Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);
p.add(c);
c =newChunk(" jumps over the lazy ");
p.add(c);
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);
p.add(c);
document.add(p);
document.close();}}
/*
* This code sample was written in the context of the tutorial:
* ZUGFeRD: The future of Invoicing
*/packagezugferd.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.ICC_Profile;importcom.itextpdf.text.pdf.PdfAConformanceLevel;importcom.itextpdf.text.pdf.PdfAWriter;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfString;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileInputStream;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates a PDF that conforms with PDF/A-3 Level A.
*/
@WrapToTestpublicclass C2E4_PdfA3a {/** The resulting PDF. */publicstaticfinalString DEST ="results/zugferd/pdfa/quickbrownfox4.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A path to a color profile. */publicstaticfinalString ICC ="resources/data/sRGB_CS_profile.icm";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates a PDF that conforms with PDF/A-3 Level A.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new C2E4_PdfA3a().createPdf(DEST);}/**
* Creates a PDF that conforms with PDF/A-3 Level B.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());//PDF/A-3a//Create PdfAWriter with the required conformance levelPdfAWriter writer =PdfAWriter.getInstance(document, newFileOutputStream(dest), PdfAConformanceLevel.PDF_A_3A);
writer.setPdfVersion(PdfWriter.VERSION_1_7);//====================//TAGGED PDF//Make document tagged
writer.setTagged();//===============//PDF/UA//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("Some title");
writer.createXmpMetadata();//=====================
document.open();//PDF/A-3b//Set output intentsICC_Profile icc =ICC_Profile.getInstance(newFileInputStream(ICC));
writer.setOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);//===================Paragraph p =newParagraph();//PDF/UA//Embed font
p.setFont(FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20));//==================Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Fox"));//==============
p.add(c);
p.add(newChunk(" jumps over the lazy "));
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Dog"));//==================
p.add(c);
document.add(p);
document.close();}}
We have a number of dynamically generated PDFs on our site that were created using iText 2.1.7.
However, we also have a large number of users that have disabilities and use screen readers, like JAWS,
to render our PDFs. We use the setTagged() method to tag the PDFs, but some elements of the PDF appear
out of order. Some even become more jumbled after calling setTagged()!
I read about PDF/UA in a 2013 interview about iText with Bruno Lowagie,
and this seems like something that might help with our problem.
However, I have not been able to find a good example of how to generate a PDF/UA document.
Can you provide an example?
Please take a look at the PdfUA example.
It explains step by step what is needed to be compliant with PDF/UA.
A similar example was presented at the iText Summit in 2014 and at JavaOne.
Watch the iText Summit video tutorial.
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document(PageSize.A4.rotate());
PdfWriter writer =
PdfWriter.getInstance(document, new FileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);
//TAGGED PDF
//Make document tagged
writer.setTagged();
//===============
//PDF/UA
//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("English pangram");
writer.createXmpMetadata();
//=====================
document.open();
Paragraph p = new Paragraph();
//PDF/UA
//Embed font
Font font =
FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20);
p.setFont(font);
//==================
Chunk c = new Chunk("The quick brown ");
p.add(c);
Image i = Image.getInstance(FOX);
c = new Chunk(i, 0, -24);
//PDF/UA
//Set alt text
c.setAccessibleAttribute(PdfName.ALT, new PdfString("Fox"));
//==============
p.add(c);
p.add(new Chunk(" jumps over the lazy "));
i = Image.getInstance(DOG);
c = new Chunk(i, 0, -24);
//PDF/UA
//Set alt text
c.setAccessibleAttribute(PdfName.ALT, new PdfString("Dog"));
//==================
p.add(c);
document.add(p);
p = new Paragraph("\n\n\n\n\n\n\n\n\n\n\n\n", font);
document.add(p);
List list = new List(true);
list.add(new ListItem("quick", font));
list.add(new ListItem("brown", font));
list.add(new ListItem("fox", font));
list.add(new ListItem("jumps", font));
list.add(new ListItem("over", font));
list.add(new ListItem("the", font));
list.add(new ListItem("lazy", font));
list.add(new ListItem("dog", font));
document.add(list);
document.close();
}
You make the document tagged with the setTagged document, but that's not sufficient.
You also need to set document data: the document title needs to be displayed and you need to indicate the language used in the document.
XMP metadata is mandatory.
Furthermore you need to embed all fonts. When you have images, you need a alternate description.
In the example, we replace the words "dog" and "fox" by an image.
To make sure that these images are "read out loud" correctly, we need to use the setAccessibleAttribute() method.
At the end of the example, I added a numbered list.
In another question, you claim that the list is not read out loud correctly by JAWS.
If you check the PDF file created with the above example, more specifically pdfua.pdf,
you'll discover that JAWS reads the document as expected, with the numbers and the text in the right order.
The reason why "it doesn't work" when you try this, is simple. You are using a version of iText that is 3 years older than the PDF/UA standard.
Also: in the version you are using, you are responsible for creating the tag structure at the lowest PDF level when you use the setTagged() method.
In more recent version, iText takes care of this at a high level. You need the latest iText version to achieve what you want.
/**
* Example written by Bruno Lowagie in answer to:
* http://stackoverflow.com/questions/28222277/how-can-i-generate-a-pdf-ua-compatible-pdf-with-itext
*/packagesandbox.pdfa;importcom.itextpdf.text.Chunk;importcom.itextpdf.text.Document;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.Font;importcom.itextpdf.text.FontFactory;importcom.itextpdf.text.Image;importcom.itextpdf.text.List;importcom.itextpdf.text.ListItem;importcom.itextpdf.text.PageSize;importcom.itextpdf.text.Paragraph;importcom.itextpdf.text.pdf.BaseFont;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfString;importcom.itextpdf.text.pdf.PdfWriter;importsandbox.WrapToTest;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;/**
* Creates an accessible PDF with images and text.
*/
@WrapToTestpublicclass PdfUA {/** The resulting PDF. */publicstaticfinalString DEST ="results/pdfa/pdfua.pdf";/** An image resource. */publicstaticfinalString FOX ="resources/images/fox.bmp";/** An image resource. */publicstaticfinalString DOG ="resources/images/dog.bmp";/** A font that will be embedded. */publicstaticfinalString FONT ="resources/fonts/FreeSans.ttf";/**
* Creates an accessible PDF with images and text.
* @param args no arguments needed
* @throws IOException
* @throws DocumentException
*/staticpublicvoid main(String args[])throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new PdfUA().createPdf(DEST);}/**
* Creates an accessible PDF with images and text.
* @param dest the path to the resulting PDF
* @throws IOException
* @throws DocumentException
*/publicvoid createPdf(String dest)throwsIOException, DocumentException{Document document =newDocument(PageSize.A4.rotate());PdfWriter writer =PdfWriter.getInstance(document, newFileOutputStream(dest));
writer.setPdfVersion(PdfWriter.VERSION_1_7);//TAGGED PDF//Make document tagged
writer.setTagged();//===============//PDF/UA//Set document metadata
writer.setViewerPreferences(PdfWriter.DisplayDocTitle);
document.addLanguage("en-US");
document.addTitle("English pangram");
writer.createXmpMetadata();//=====================
document.open();Paragraph p =newParagraph();//PDF/UA//Embed fontFont font =FontFactory.getFont(FONT, BaseFont.WINANSI, BaseFont.EMBEDDED, 20);
p.setFont(font);//==================Chunk c =newChunk("The quick brown ");
p.add(c);Image i =Image.getInstance(FOX);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Fox"));//==============
p.add(c);
p.add(newChunk(" jumps over the lazy "));
i =Image.getInstance(DOG);
c =newChunk(i, 0, -24);//PDF/UA//Set alt text
c.setAccessibleAttribute(PdfName.ALT, newPdfString("Dog"));//==================
p.add(c);
document.add(p);
p =newParagraph("\n\n\n\n\n\n\n\n\n\n\n\n", font);
document.add(p);List list =newList(true);
list.add(newListItem("quick", font));
list.add(newListItem("brown", font));
list.add(newListItem("fox", font));
list.add(newListItem("jumps", font));
list.add(newListItem("over", font));
list.add(newListItem("the", font));
list.add(newListItem("lazy", font));
list.add(newListItem("dog", font));
document.add(list);
document.close();}}
/*
* This example was written in answer to the following question:
* http://stackoverflow.com/questions/34036200
*/packagesandbox.pdfa;importcom.itextpdf.text.DocumentException;importcom.itextpdf.text.pdf.PdfArray;importcom.itextpdf.text.pdf.PdfDictionary;importcom.itextpdf.text.pdf.PdfName;importcom.itextpdf.text.pdf.PdfReader;importcom.itextpdf.text.pdf.PdfStamper;importcom.itextpdf.text.pdf.PdfString;importjava.io.File;importjava.io.FileOutputStream;importjava.io.IOException;importsandbox.WrapToTest;
@WrapToTestpublicclass AddAltTags {publicstaticfinalString SRC ="resources/pdfs/no_alt_attribute.pdf";publicstaticfinalString DEST ="results/pdfa/added_alt_attributes.pdf";publicstaticvoid main(String[] args)throwsIOException, DocumentException{File file =newFile(DEST);
file.getParentFile().mkdirs();new AddAltTags().manipulatePdf(SRC, DEST);}publicvoid manipulatePdf(String src, String dest)throwsIOException, DocumentException{PdfReader reader =newPdfReader(src);PdfDictionary catalog = reader.getCatalog();PdfDictionary structTreeRoot = catalog.getAsDict(PdfName.STRUCTTREEROOT);
manipulate(structTreeRoot);PdfStamper stamper =newPdfStamper(reader, newFileOutputStream(dest));
stamper.close();}publicvoid manipulate(PdfDictionary element){if(element ==null)return;if(PdfName.FIGURE.equals(element.get(PdfName.S))){
element.put(PdfName.ALT, newPdfString("Figure without an Alt description"));}PdfArray kids = element.getAsArray(PdfName.K);if(kids ==null)return;for(int i =0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));}}
I know that iText can generate tagged PDF documents from scratch, but is it possible to insert alternative text for images in an existing tagged PDF without changing anything else?
I need to implement this feature in a program without using GUI applications such as Adobe Acrobat Pro.
In this example, we take a PDF with images of a fox and a dog where the Alt keys are missing: no_alt_attribute.pdf
Structure element without /Alt key
Code can't recognize a fox or a dog, so we create a new document with Alt attributes saying "Figure without an Alt description": added_alt_attributes.pdf)
Structure element with /Alt key
We add this description by walking through the structure tree, looking for structural elements marked as /Figure elements:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary catalog = reader.getCatalog();
PdfDictionary structTreeRoot =
catalog.getAsDict(PdfName.STRUCTTREEROOT);
manipulate(structTreeRoot);
PdfStamper stamper = new PdfStamper(
reader, new FileOutputStream(dest));
stamper.close();
}
public void manipulate(PdfDictionary element) {
if (element == null)
return;
if (PdfName.FIGURE.equals(element.get(PdfName.S))) {
element.put(PdfName.ALT,
new PdfString("Figure without an Alt description"));
}
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) return;
for (int i = 0; i < kids.size(); i++)
manipulate(kids.getAsDict(i));
}
You can easily port this Java example to C#:
Get the root dictionary from the PdfReader object,
Get the root of the structure tree (a dictionary),
Loop over all the kids of every branch of that tree,
When a lead is a figure, add an /Alt entry.
Once this is done, use PdfStamper to save the altered file.
In chapter 1 to 4, we've created PDF documents using iText 7. In chapters 5 and 6, we've manipulated and reused existing PDF documents. All the PDFs we dealt with in those chapters were PDF documents that complied to ISO 32000, which is the core standard for PDF. ISO 32000 isn't the only ISO standard for PDF, there are many different sub-standards that were created for specific reasons. In this chapter, we'll highlight two:
ISO 14289 is better known as PDF/UA. UA stands for Universal Accessibility. PDFs that comply with the PDF/UA standard can be consumed by anyone, including people who are blind or visually impaired.
ISO 19005 is better known as PDF/A. A stands for Archiving. The goal of this standard is the long-term preservation of digital documents.
In this chapter, we'll learn more about PDF/A and PDF/UA by creating a series of PDF/A and PDF/UA files.
Creating accessible PDF documents
Before we start with a PDF/UA example, let's take a closer look at the problem we want to solve. In chapter 1, we created a document that included images. In the sentence "Quick brown fox jumps over the lazy dog", we replaced the words "fox" and "dog" by images representing a fox and a dog. When this file is read out loud, a machine doesn't know that the first image represents a fox and that the second image represents a dog, hence the file will be read as "Quick brown jumps over the lazy."
In an ordinary PDF, content is painted to a canvas. We might use high-level objects such as List and Table, but once the PDF is created, there is no structure left. A list is a sequence of lines and a text snippet in a list item doesn't know that it's part of a list. A table is just a bunch of lines and text added at absolute positions on a page. A text snippet in a table doesn't know it belongs to a cell in a specific column and a specific row.
Unless we make the PDF a tagged PDF, the document doesn't contain any semantic structure. When there's no semantic structure, the PDF isn't accessible. To be accessible, the document needs to be able to distinguish which part of a page is actual content, and which part is an artifact that isn't part of the actual content (e.g. a header, a page number). A line of text needs to know if its a title, if it's part of a paragraph, and so on. We can add all of this information to the page, by creating a structure tree and by defining content as marked content. This sounds complex, but if you use iText 7's high-level objects, it's sufficient to introduce the method setTagged(). By defining a PdfDocument as a tagged document, the structure we introduce by using objects such as List, Table, Paragraph, will be reflected in the Tagged PDF.
This is only one requirement to make a PDF accessible. The QuickBrownFox_PDFUA example will help us understand the other requirements.
PdfDocument pdf = new PdfDocument(new PdfWriter(dest),new WriterProperties().addXmpMetadata()));
Document document = new Document(pdf);
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/UA example");
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
//PDF/UA: Set alt text
foxImage.getAccessibilityProperties().setAlternateDescription("Fox");
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
//PDF/UA: Set alt text
dogImage.getAccessibilityProperties().setAlternateDescription("Dog");
p.add(dogImage);
document.add(p);
document.close();
We create a PdfDocument and a Document, but this time we tell the 'PdfWriter' to automatically add XMP metadata using the 'addXmpMetadata()' method of 'WriterProperties'. In PDF/UA, it is mandatory to have the same metadata stored in the PDF as XML. This XML may not be compressed. Processors that don't "understand" PDF must be able to detect this XMP metadata and process it. An XMP stream is created automatically based on the entries in the Info dictionary. This Info dictionary is a PDF Object that includes such data as the title of the document. In addition to this requirement, we make sure that we comply to PDF by introducing some extra features:
We tell the PdfDocument that we're going to create Tagged PDF (line 4),
We add a language specifier. In our case, the document knows that the main language used in this document is American English (line 5).
We change the viewer preferences so that the title of the document is always displayed in the top bar of the PDF viewer (line 6-7). Obviously, this implies that we add a title to the metadata of the document (line 8-9).
All fonts need to be embedded (line 11). There are some other requirements relating to fonts, but it would lead us too far right now to discuss these in detail.
All the content needs to be tagged. When an image is encountered, we need to provide a description of that image using alt text (line 17 and line 22).
We have now created a PDF/UA document. When we look at the resulting page in Figure 7.1, we don't see much difference, but if we open the Tags panel, we see that the document has a specific structure.
Figure 7.1: a PDF/UA document and its structure
We see that the <Document> consists of a <P>aragraph that is composed of four parts, two <Span>s and two <Figures>s. We'll create a more complex PDF/UA document later in this chapter, but let's take a look at what makes PDF/A special first.
Creating PDFs for long-term preservation, part 1
Part 1 of ISO 19005 was released in 2005. It was defined as a subset of version 1.4 of Adobe's PDF specification (which, at that time, wasn't an ISO standard yet). ISO 19005-1 introduced a series of obligations and restrictions:
The document needs to be self-contained: all fonts need to be embedded; external movie, sound or other binary files are not allowed.
The document needs to contain metadata in the eXtensible Metadata Platform (XMP) format: ISO 16684 (XMP) describes how to embed XML metadata into a binary file, so that software that doesn't know how to interpret the binary data format can still extract the file's metadata.
Functionality that isn't future-proof isn't allowed: the PDF can't contain any JavaScript and may not be encrypted.
ISO 19005-1:2005 (PDF/A-1) defined two conformance levels:
Level B ("basic"): ensures that the visual appearance of a document will be preserved for the long term.
Level A ("accessible"): ensures that the visual appearance of a document will be preserved for the long term, but also introduces structural and semantic properties. The PDF needs to be a Tagged PDF.
The QuickBrownFox_PDFA_1b example shows how we can create a "Quick brown fox" PDF that complies to PDF/A-1b.
//Initialize PDFA document with output intent
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1B,
new PdfOutputIntent("Custom", "", "http://www.color.org","sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf);
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
p.add(dogImage);
document.add(p);
document.close();
The first thing that jumps to the eye, is that we are no longer using a PdfDocument instance. Instead, we create a PdfADocument instance. The PdfADocument constructor needs a PdfWriter as its first parameter, but also a conformance level (in this case PdfAConformanceLevel.PDF_A_1B) and a PdfOutputIntent. This output intent tells the document how to interpret the colors that will be used in the document. In line 10, we make sure that the font we're using is embedded.
Figure 7.2: a PDF/A-1 level B document
Looking at the PDF shown in Figure 7.2, we see a blue ribbon with the text "This file claims compliance with the PDF/A standard and has been opened read-only to prevent modification." Allow me to explain two things about this sentence:
This doesn't mean that the PDF is, in effect, compliant with the PDF/A standard. It only claims it is. To be sure, you need to open the Standards panel in Adobe Acrobat. When you click on the "Verify Conformance" link, Acrobat will verify if the document is what it claims to be. In this case, we read "Status: verification succeeded"; we have successfully created a document complying with PDF/A-1B.
The document has been opened read-only, not because you are not allowed to modify it (PDF/A is not a way to protect a PDF against modification), but Adobe Acrobat presents it as read-only because any modification might change the PDF into a PDF that is no longer compliant to the PDF/A standard. It's not trivial to update a PDF/A without breaking its PDF/A status.
Let's adapt our example, and create a PDF/A-1 level A document with the QuickBrownFox_PDFA_1a example.
//Initialize PDFA document with output intent
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1A,
new PdfOutputIntent("Custom", "", "http://www.color.org","sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf);
//Setting some required parameters
pdf.setTagged();
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
//Set alt text
foxImage.getAccessibilityProperties().setAlternateDescription("Fox");
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
//Set alt text
dogImage.getAccessibilityProperties().setAlternateDescription("Dog");
p.add(dogImage);
document.add(p);
document.close();
We've changed PdfAConformanceLevel.PDF_A_1B into PdfAConformanceLevel.PDF_A_1A in line 3. We've made the PdfADocument a Tagged PDF (line 8) and we've added some alt text for the images. Figure 7.3 is somewhat confusing.
Figure 7.3: a PDF/A-1 level A document
When we look at the Standards panel, we see that the document thinks it conforms to PDF/A-1A and to PDF/UA-1. We don't have a "Verify Conformance" link, so we have to use Preflight. Preflight informs us that there were "No problems found" when executing the "Verify compliance with PDF/A-1a" profile. We can't verify the PDF/UA compliance because PDF/UA involves some requirements that can't be verified by a machine. For instance: a machine wouldn't notice if we switched the description of the image of the fox with the description of the image of the dog. That would make the document inaccessible as the document would spread false information to people depending on screen-readers. In any case, we know that our document doesn't comply to the PDF/UA standard because we omitted a number of essential elements (such as the language).
From the start, it was determined that approved parts of ISO 19005 could never become invalid. New, subsequent parts would only define new, useful features. That's what happened when part 2 and part 3 were created.
Creating PDFs for long-term preservation, part 2 and 3
ISO 19005-2:2011 (PDF/A-2) was introduced to have a PDF/A standard that was based
on the ISO standard (ISO 32000-1) instead of on Adobe's PDF specification. PDF/A-2 also adds
a handful of features that were introduced in PDF 1.5, 1.6 and 1.7:
Useful additions include: support for JPEG2000, Collections, object-level XMP, and optional content.
Useful improvements include: better support for transparency, comment types and annotations, and digital signatures.
PDF/A-2 also defines an extra level besides Level A and Level B:
Level U ("Unicode"): ensures that the visual appearance of a document will be preserved for the long term, and that all text is stored in UNICODE.
ISO 19005-3:2012 (PDF/A-3) was an almost identical copy of PDF/A-2.
There was only one difference with PDF/A-2: in PDF/A-3, attachments don't need to be PDF/A.
You can attach any file to a PDF/A-3, for instance: an XLS file containing calculations of which the results are used in the document,
the original Word document that was used to create the PDF document, and so on. The document itself needs to conform
to all the obligations and restrictions of the PDF/A specification, but these obligations and restrictions do not apply to its attachments.
In the UnitedStates_PDFA_3a example, we'll create a document that complies with PDF/UA as well as with PDF/A-3A. We choose PDF/A3, because we're going to add the CSV file that was used as the source for creating the PDF.
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_3A,
new PdfOutputIntent("Custom", "", "http://www.color.org","sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf, PageSize.A4.rotate());
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-3 example");
//Add attachment
PdfDictionary parameters = new PdfDictionary();
parameters.put(PdfName.ModDate, new PdfDate().getPdfObject());
PdfFileSpec fileSpec = PdfFileSpec.createEmbeddedFileSpec(
pdf, Files.readAllBytes(Paths.get(DATA)), "united_states.csv","united_states.csv", new PdfName("text/csv"), parameters,
PdfName.Data, false);
fileSpec.put(new PdfName("AFRelationship"), new PdfName("Data"));
pdf.addFileAttachment("united_states.csv", fileSpec);
PdfArray array = new PdfArray();
array.add(fileSpec.getPdfObject().getIndirectReference());
pdf.getCatalog().put(new PdfName("AF"), array);
//Embed fonts
PdfFont font = PdfFontFactory.createFont(FONT, true);
PdfFont bold = PdfFontFactory.createFont(BOLD_FONT, true);
// Create content
Table table = new Table(new float[]{4, 1, 3, 4, 3, 3, 3, 3, 1});
table.setWidthPercent(100);
BufferedReader br = new BufferedReader(new FileReader(DATA));
String line = br.readLine();
process(table, line, bold, true);
while ((line = br.readLine()) != null) {
process(table, line, font, false);
}
br.close();
document.add(table);
//Close document
document.close();
Let's examine the different parts of this example.
Line 1-5: We create a PdfADocument (PdfAConformanceLevel.PDF_A_3A) and a Document.
Line 7: Making the PDF a Tagged PDF is a requirement for PDF/UA as well as for PDF/A-3A.
Line 8-12: Setting the language, the document title and the viewer preference to display the title is a requirement for PDF/UA.
Line 14-20: We add a file attachment using specific parameters that are required for PDF/A-3A.
Line 26-27: We embed the fonts which is a requirement for PDF/UA as well as for PDF/A.
Line 28-38: We've seen this code before in the UnitedStates example in chapter 1 (including the process() method).
Line 40: We close the document.
Figure 7.4 demonstrates how using the Table class with Cell objects added as header cells, and Cell objects added as normal cells, resulted in a structure tree that makes the PDF document accessible.
Figure 7.4: a PDF/A-3 level A document
When we open the Attachments panel as shown in Figure 7.5, we see our original united_states.csv file that we can easily extract from the PDF.
Figure 7.5: a PDF/A-3 level A document and its attachment
The examples in this chapter taught us that PDF/UA or PDF/A documents involve extra requirements when compared to ordinary PDFs. "Can we use iText to convert an existing PDF to a PDF/UA or PDF/A document" is a question that is posted frequently on mailing-lists or user forums. I hope that this chapter explains that iText can't do this automatically.
If you have a document that has a picture of a fox and a dog, iText can't add any missing alt text for those images, because iText can't see that fox nor that dog. iText only sees pixels, it can't interpret the image.
If you are using a font that isn't embedded, iText doesn't know what that font looks like. If you don't provide the corresponding font program, iText can never embed that font.
These are only two examples of many that explain why converting an ordinary PDF to PDF/A or PDF/UA isn't trivial. It's very easy to change the PDF so that it shows a blue bar saying that the document complies to PDF/A, but that doesn't many that claim is true.
We also need to pay attention when we merge existing PDF/A documents.
Merging PDF/A documents
When merging PDF/A documents, it's very important that every single document that you are adding to PdfMerger is already a PDF/A document. You can't mix PDF/A documents and ordinary PDF documents into one single PDF and hope the result will be a PDF/A document. The same is true for mixing a PDF/A level A document with a PDF/A level B document. One has a structure tree, the other hasn't; you can't expect the resulting PDF to be a PDF/A level A document.
Figure 7.6 shows how we merged the two PDF/A level A documents we created in the previous sections.
Figure 7.6: merging 2 PDF/A level A documents
When we look at the structure of the tags, we see that the <P>aragraph is now followed by a <Table>. The MergePDFADocuments shows how it's done.
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
PdfAConformanceLevel.PDF_A_1A,
new PdfOutputIntent("Custom", "", "http://www.color.org","sRGB IEC61966-2.1", new FileInputStream(INTENT)));
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-1a example");
//Create PdfMerger instance
PdfMerger merger = new PdfMerger(pdf);
//Add pages from the first document
PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(SRC1));
merger.addPages(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
//Add pages from the second pdf document
PdfDocument secondSourcePdf = new PdfDocument(new PdfReader(SRC2));
merger.addPages(secondSourcePdf, 1, secondSourcePdf.getNumberOfPages());
//Merge
merger.merge();
//Close the documents
firstSourcePdf.close();
secondSourcePdf.close();
pdf.close();
This example is assembled using parts of two examples we've already seen before:
Lines 1 to 11 are almost identical to the first part of the UnitedStates_PDFA_3a example we've used in the previous section, except that we now use PdfAConformanceLevel.PDF_A_1A and that we don't need a Document object.
Lines 12 to 25 are identical to the last part of the 88th_Oscar_Combine example of the previous chapter. Note that we use a PdfDocument instance instead of a PdfADocument; the PdfADocument will check if the source documents comply.
There's a lot more to be said about PDF/UA and PDF/A, and even about other sub-standards. For instance: there's a German standard for invoicing called ZUGFeRD that is built on top of PDF/A-3, but let's save that for another tutorial.
Summary
In this chapter, we've discovered that there's more to PDF than meets the eye. We've learned how to introduce structure into our documents so that they are accessible for the blind and the visually impaired. We've also made sure that our PDFs were self-contained, for instance by embedding fonts, so that our documents can be archived for the long term.
We'll need several other tutorials to cover the functionality covered in this tutorial in more depth, but these seven chapters should already give you a good impression of what you can do with iText 7.
I know that iText can generate tagged PDF documents from scratch, but is it possible to insert alternative text for images in an existing tagged PDF without changing anything else?
I need to implement this feature in a program without using GUI applications such as Adobe Acrobat Pro.
In this example, we take a PDF with images of a fox and a dog where the Alt keys are missing: no_alt_attribute.pdf
Structure element without /Alt key
Code can't recognize a fox or a dog, so we create a new document with Alt attributes saying "Figure without an Alt description": added_alt_attributes.pdf)
Structure element with /Alt key
We add this description by walking through the structure tree, looking for structural elements marked as /Figure elements:
public void manipulatePdf(String src, String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfDictionary catalog = pdfDoc.getCatalog().getPdfObject();
PdfDictionary structTreeRoot = catalog.getAsDictionary(PdfName.StructTreeRoot);
manipulate(structTreeRoot);
pdfDoc.close();
}
public void manipulate(PdfDictionary element) {
if (element == null) {
return;
}
if (PdfName.Figure.equals(element.get(PdfName.S))) {
element.put(PdfName.Alt, new PdfString("Figure without an Alt description"));
}
PdfArray kids = element.getAsArray(PdfName.K);
if (kids == null) {
return;
}
for (int i = 0; i < kids.size(); i++) {
manipulate(kids.getAsDictionary(i));
}
}
You can easily port this Java example to C#:
Get the root dictionary from the PdfDocument object,
Get the root of the structure tree (a dictionary),
Loop over all the kids of every branch of that tree,
When a lead is a figure, add an /Alt entry.
Click this link if you want to see how to answer this question in iText 5.
/*
This file is part of the iText (R) project.
Copyright (c) 1998-2016 iText Group NV
*//*
* This example was written in answer to the following question:
* http://stackoverflow.com/questions/34036200
*/packagecom.itextpdf.samples.sandbox.pdfa;importcom.itextpdf.kernel.pdf.*;importcom.itextpdf.samples.GenericTest;importcom.itextpdf.test.annotations.type.SampleTest;importorg.junit.experimental.categories.Category;importjava.io.File;importjava.io.IOException;
@Category(SampleTest.class)publicclass AddAltTags extends GenericTest {publicstaticfinalString DEST ="./target/test/resources/sandbox/pdfa/add_alt_tags.pdf";publicstaticfinalString SRC ="./src/test/resources/pdfs/no_alt_attribute.pdf";publicstaticvoid main(String[] args)throwsException{File file =newFile(DEST);
file.getParentFile().mkdirs();new AddAltTags().manipulatePdf(DEST);}publicvoid manipulatePdf(String dest)throwsIOException{
PdfDocument pdfDoc =new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));
PdfDictionary catalog = pdfDoc.getCatalog().getPdfObject();
PdfDictionary structTreeRoot = catalog.getAsDictionary(PdfName.StructTreeRoot);
manipulate(structTreeRoot);
pdfDoc.close();}publicvoid manipulate(PdfDictionary element){if(element ==null){return;}if(PdfName.Figure.equals(element.get(PdfName.S))){
element.put(PdfName.Alt, new PdfString("Figure without an Alt description"));}
PdfArray kids = element.getAsArray(PdfName.K);if(kids ==null){return;}for(int i =0; i < kids.size(); i++){
manipulate(kids.getAsDictionary(i));}}}