PDFs parsing using YOLOV3

April 30, 2020 — 8 min

PDF files or Portable Document Format are a type of files developed by Adobe in order to enable the creation of various forms of content. Particularly, it allows a consistent safety regarding the change in its content. A PDF file can host different types of data: text, images, media, ...etc. It is a tag-structured file which makes it easy to parse it just like an HTML page.


With that being said and for the sake of structure, we can separate the PDF files into two classes:

  • Text-based files: containing text that can be copied and pasted
  • Image-based files: contained images such as scanned documents

In this article, we will go through the main python libraries which enable PDF files parsing both text-based and image-based ones which will be OCRised and then processed as a text-based file. We will also cover in the last chapter how to use the object detection algorithm YOLOV3 in order to parse tables.

Tabe of contents

The summary is as follows:

  1. Image-based pdf files

    1.1. OCRMYPDF

  2. Text-based pdf files

    2.1. PyPDF2
    2.2. PDF2IMG
    2.4. Camelot
    2.5. Camelot mixed with YOLOV3

For the sake of illustration all along this article, we will use this pdf file.

1. Image-based pdf files

1.1. OCRMYPDF

Ocrmypdf is a python package which allow to turn an image-based pdf into a text-based one, where text can be selected, copied and pasted.
In order to install ocrmypdf you can use brew for macOS and Linux using the command line:

brew install ocrmypdf


Once the package is installed, you can ocrise your pdf file by running the following command line:

ocrmypdf input_file.pdf output_file.pdf

where:

  • ocrmypdf: the path variable
  • input_file.pdf: the image-based pdf file
  • output_file.pdf: the output text-based file
ocrmypdf


Once the pdf is turned into a text-based one, it can be treated using all the libraries detailed below.
For further details on ocrmypdf, please visit the following the official website.

2. Text-based pdf files

In this section, we will mainly focus on three python libraries that allow to extract the content of a text-based pdf file.

2.1. PyPDF2

PyPDF2 is a python tool which enables us to parse basic information about the pdf file such the author the title…etc. It also allows the get the text of a given page along with splitting pages and opening encrypted files under the assumption of having the password.
PyPDF2 can be installed used pip by running the following command line:

pip install PyPDF2

We sum all the functionalities listed above the following python scripts:

  • Reading a pdf file:
from PyPDF2 import PdfFileWriter, PdfFileReader
PDF_PATH = "boeing.pdf"
pdf_doc = PdfFileReader(open(PDF_PATH, "rb"))
  • Extracting document information:
print("---------------PDF's info---------------")
print(pdf_doc.documentInfo)
print("PDF is encrypted: " + str(pdf_doc.isEncrypted))
print("---------------Number of pages---------------")
print(pdf_doc.numPages)
>> ---------------PDF's info---------------
>> {'/Producer': 'WebFilings', '/Title': '2019 12 Dec 31 8K Press Release Exhibit 99.1', '/CreationDate': 'D:202001281616'}
>> PDF is encrypted: False
>> ---------------Number of pages---------------
>> 14
  • Splitting documents page by page:
#indexation starts at 0
pdf_page_1 = pdf_doc.getPage(0)
pdf_page_4 = pdf_doc.getPage(3)
print(pdf_page_1)
print(pdf_page_4)
>> {'/Type': '/Page', '/Parent': IndirectObject(1, 0), '/MediaBox': [0, 0, 612, 792], '/Resources': IndirectObject(2, 0), '/Rotate': 0, '/Contents': IndirectObject(4, 0)}
>> {'/Type': '/Page', '/Parent': IndirectObject(1, 0), '/MediaBox': [0, 0, 612, 792], '/Resources': IndirectObject(2, 0), '/Rotate': 0, '/Contents': IndirectObject(10, 0)}
  • Extracting text from a page:
text = pdf_page_1.extractText()
print(text[:500])
>> '1Boeing Reports Fourth-Quarter ResultsFourth Quarter 2019 Financial results continue to be significantly impacted by the 737 MAX grounding Revenue of $17.9 billion, GAAP loss per share of ($1.79) and core (non-GAAP)* loss per share of ($2.33) Full-Year 2019 Revenue of $76.6€billion, GAAP loss per share of ($1.12) and core (non-GAAP)* loss per share of ($3.47) Operating cash flow of ($2.4)€billion; cash and marketable securities of $10.0 billion Total backlog of $463 billion, including over 5,400'
  • Merging documents page by page:
new_pdf = PdfFileWriter()
new_pdf.addPage(pdf_page_1)
new_pdf.addPage(pdf_page_4)
new_pdf.write(open("new_pdf.pdf", "wb"))
print(new_pdf)
>> <PyPDF2.pdf.PdfFileWriter object at 0x11e23cb10>
  • Cropping pages:
print("Upper Left: ", pdf_page_1.cropBox.getUpperLeft())
print("Lower Right: ", pdf_page_1.cropBox.getLowerRight())

x1, y1 = 0, 550
x2, y2 = 612, 320

cropped_page = pdf_page_1
cropped_page.cropBox.upperLeft = (x1, y1)
cropped_page.cropBox.lowerRight = (x2, y2)

cropped_pdf = PdfFileWriter()
cropped_pdf.addPage(cropped_page)
cropped_pdf.write(open("cropped.pdf", "wb"))
cropping
  • Encrypting and decrypting PDF files:
PASSWORD = "password_123"
encrypted_pdf = PdfFileWriter()
encrypted_pdf.addPage(pdf_page_1)
encrypted_pdf.encrypt(PASSWORD)
encrypted_pdf.write(open("encrypted_pdf.pdf", "wb"))

read_encrypted_pdf = PdfFileReader(open("encrypted_pdf.pdf", "rb"))
print(read_encrypted_pdf.isEncrypted)
if read_encrypted_pdf.isEncrypted:
    read_encrypted_pdf.decrypt(PASSWORD)
print(read_encrypted_pdf.documentInfo)
>> True
>> {'/Producer': 'PyPDF2'}
encrypted

For further information on PyPDF2, please visit the official website.

2.2. PDF2IMG

PDF2IMG is a python library which allows to turn pdf pages into images that can be processed, for instance, by computer vision algorithms.
PDF2IMG can be installed using pip by running the following command line:

pip install pdf2image

We can set the first page and the last page to be transformed to images from the pdf file.

from pdf2image import convert_from_path
import matplotlib.pyplot as plt
img_page = convert_from_path(PDF_PATH, first_page=0, last_page=0)[0]
print(img_page)
plt.imshow(img_page)
>> <PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1700x2200 at 0x11DF397D0>
pdf2img

2.3. Camelot

Camelot is a python library specialised in parsing tables of pdfs pages. It can be installed using pip by running the following command line:

pip install camelot-py[cv]

The output of the parsing is a pandas dataframe which is very useful for the data processing.

import camelot
output_camelot = camelot.read_pdf(
    filepath="output_ocr.pdf", pages=str(0), flavor="stream"
)
print(output_camelot)
table = output_camelot[0]
print(table)
print(table.parsing_report)
>> TableList n=1>
>> <Table shape=(18, 8)>
>> {'accuracy': 93.06, 'whitespace': 40.28, 'order': 1, 'page': 0}

When the pdf page contains text, the output of camelot will be a dataframe containing text in the first columns and later the desired table. With some basic processing we can extract it as follows:

camelot

Camelot offers two flavors lattice and stream, I advise to use stream since it is more flexible to tables structure.

2.4. Camelot mixed with YOLOV3

Camelot offers the option of specifying the regions to process through the variable table_areas="x1,y1,x2,y2" where (x1, y1) is left-top and (x2, y2) right-bottom in PDF coordinate space. When filled out, the result of the parsing is significantly enhanced.

Explaining the basic idea

One way to automize the parsing of tables is to train an algorithm capable of returning the coordinates of the bounding boxes circling the tables, as detailed in the following pipeline:

pipeline

If the primitive pdf page is image-based, we can use ocrmypdf to turn into a text-based one in order to be able to get the text inside the table. We, then, carry out the following operations:

  • Transform a pdf page into an image one using pdf2img
  • Use a trained algorithm to detect the regions of tables.
  • Normalize the bounding boxes, using the image dimension, which enables use to get the regions in the pdf space using the pdf dimensions obtained through PyPDF2.
  • Feed the regions to camelot and get the corresponding pandas data-frames.


When detecting a table in pdf image we expand the bounding boxe in order to guarante its full inclusion, as follows:

correction

Tables detection

The algorithm which allows the detection of tables, is nothing but yolov3, I advise your to read my previous article about objects detection.
We finetune the algorithm to detect tables and retrain all the architecture. To do so, we carry out the following steps:

  • Create a training database using Makesense a tool which enables labelling and exporting in yolo’s format:
makesense
  • Train a yolov3 repository modified to fit our purpose on AWS EC2, we get the following results:
results

NB: following the same steps, we can train the algorithms to detect any other object in a pdf page such as graphics and images which can be extracted from the image page. You can check my project in my github.