PDFix Blog

Welcome to PDFix blog.

Tag Archives: PDF

PDF responsive revolution

We all dislike reading PDF documents on our smart phones. Pinch in, pinch out, scroll around and attempt to read what was originally designed to be shown on a screen over 10 times larger. Then there’s filling in a PDF Form on your smart phone – close to impossible. So it appears that our beloved smart phone isn’t smart enough to handle viewing PDFs – We have a solution.

Hyper-accurately extracted content can make a PDF responsive

The PDF format was not originally designed to be flexible when viewing. It was designed to simulate printed or displayed content similar to reading a book, magazine or newspaper. But smart phones and other small devices are simply too small to view the entire page, so the user has to zoom and scroll to view the content.

When the ISO Standard for PDF was developed, it did not include small devices. Given PDF was born in 1993 and became a standard in 2008, smart phones and portable devices were still gaining traction. For PDF as it’s currently designed, this is sounding like the beginning of the end for the Trillions of PDFs out there given they don’t behave on mobile devices. The happy news is that a PDF is actually a data container. Because of this, we aren’t limited to viewing a PDF precisely as it was created for a large screen. When the content is carefully extracted and the logical reading order created, the PDF can be presented on all sized devices in a user friendly manner. This is also true for PDF forms which can now be filled out as easily as a responsively designed web-form.

Using the PDFix API and our secret sauce algorithm, we can transform your PDFs to the “responsive” way.

This technology can be used to delight your customers with a fantastic UX, build web pages with embedded PDF, mine for data in PDF with hyper-accurate extraction and last but not least we can also rebuild AcroForms into a responsive design so that your users can fill out PDF forms directly on their mobile device in a clean and simpler way. A Mobile App for reading your PDF in a responsive state is coming soon for iPhone and Android. Follow us or subscribe for updates to be first when we launch.

The magic of data hidden in a PDF

It’s a common myth that the PDF is the final format of a document, where the data and content can only be rendered on a screen or printed on paper. This is far from correct and we will explain why.

PDFix API Standard PDF FeaturesJohn Warnock had a vision. “This project’s goal is to solve a fundamental problem that confronts today’s companies. The problem is concerned with our ability to communicate visual material between different computer applications and systems.” So was born the Portable Document Format.

Over 23 years ago Adobe Systems created the PDF, and at the same time gave everybody “Reader” – a PDF Viewing engine. The engine was restricted in functionality, which gave rise to the belief it was a secure end point of content. This belief has yet to change.

The humble PDF has many more powerful tricks up its sleeve. Designed for viewing and printing, the basic characteristic of a PDF is a fixed layout document that ensures pixel perfect rendering on any screen and paper. Text is not wrapped but placed at precise positions in the document – words are often not complete strings. Each letter can be written independently. Tables can be written as a stream of lines and numbers or text. Each character, or glyph, is an individual entity, not belonging to those around it, despite looking exactly as the author intended when viewed or printed – thanks to the power of structured content.

PDF Tagging provides fantastic assistance in recognizing this structured content. Yet a tagged PDF make up  less than 18% of the Trillions of PDF files in existence and PDF tagging still doesn’t guarantee the correctness of the PDF structure.

It is common practice, when the content is needed, to physically print the PDF, run it through an Optical Character Recognition (OCR) and physically manipulate the content. Even if the goal is to extract only the text, OCR fails to obtain a vast array of information inside the PDF that can actually assist in the correct extraction of the content.

For the successful extraction of data from within a PDF’s structure, the tool must be able to find the semantics, the content structure and its logical reading order. The tool must also find the words, lines, paragraphs, tables, columns, rows, and each individual cell. It must detect vector graphics that make sense only when presented as a whole. Does your extraction tool do this? You don’t want your content changed or manipulated when extracted from a PDF.

So where does this leave us? How do we go about getting the content from over 3.2 Trillion PDF documents that are out there?

There are many tools available for getting the unstructured content from a PDF, both commercial and open source, but not as many for getting it in a structured form. Structured form extraction tools, deal with extraction in different ways, with varied and mostly worse results. In the ideal case, words, lines, paragraphs, tables, graphs, and pictures as well as mathematical formulas, are not only available when viewing the PDF, but they are required for machine processing and learning and reusable to other applications.

In the case of machine processing, extracted data can be transformed into other formats like XML, JSON or CSV. It can also be used directly for big data processing and machine learning. By working with extracted content that has been correctly structured, it is possible, for example, to read a bank account number and the payable amount on an invoice, identify social security numbers from an application form, read numbers or even find the place in a document where a signature is required and who has to sign it.

Extracted content can also benefit the visually impaired, if done correctly. The extracted data can be used to add further structured information and semantics to create PDF/UA (Universal Accessibility) compliant files without any human input. The usual process for creating of a PDF/UA compliant file is a laborious task, involving human intervention to ensure the file correctness. This is usually a very expensive process and unfortunately limited to a small subset of PDF files.

Extracted data can be used in other formats. By retaining the original content of a PDF, it can be manipulated for the required end user device – taking a simple PDF and making it readable on portable devices, without pinching and zooming. This responsive approach is becoming the norm in web pages, so why are leaving the humble PDF in the 90’s? Through the correct extraction of data, and building the semantics of the document, we can also make PDF responsive.

On the surface the myth may be true, but delve a little deeper into the unused power of a PDF, and possibilities are far from final format. There are numerous ways to get data from a PDF, but not all tools are equally effective, correct or accurate. We here at PDFix, have made it our mission to make sure that the information in your PDF isn’t changed or broken. We simply help you gain access to it easier.