Python, PDFs, and Window’s Subsytem for Linux

Python, PDFs, and Window’s Subsytem for Linux

  • 1383

Poppler On Window Python, PDFs, and Window’s Subsytem for Linux

Intro:

Portable Document Format (PDFs) are everywhere and importing a popular python-package like PDF2Image, PDFtoText, or PopplerQt5 is a common approach to dealing with them. Unfortunately, unless you are working with a Linux machine, many users are reporting that these packages are returning errors because they rely on Poppler.

Never heard of Poppler?

Poppler is a utility for rendering PDFs and it’s common to Linux systems, but not Windows. So, naturally, if you want to use Poppler and its associated packages, we need to bridge the gap.

Let’s visit google and see what our options are…

A quick Google (StackOverflow) search reveals that there are many other people having this problem and they are still looking for solutions.

The Problem:

Poppler and Python’s PDF-libraries, which leverage Linux-utilities, don’t play well with Windows.

When we look for solutions, many of them are outdated, ineffective, too difficult, etc…

The Solution:

Of the purposed solutions, one solution appears to work well.

Windows Subsystem for Linux (WSL).

Actually, because of how powerful Windows Subsystem for Linux is, it’s a great solution for other problems which require Linux tools on a Windows machine.

So, what is WSL?

Windows Subsystem for Linux is a compatibility layer for Linux binary executables natively on Windows 10. It recently entered version two (WSL 2) and introduced a real Linux kernel. To put it plainly, WSL makes it feel like you’re working on a real Linux machine (and you are).

Installation and Usage Guide — WSL

Reference

In this section, we will, in five short steps, install and setup WSL. Afterwards, we will install and setup Poppler in a few short steps.

Step 1:

Run Window’s Powershell as an administrator.

This is image title

Step 2:

Enable WSL by executing the ‘Enable-WindowsOptionalFeature’ command:
This is image title

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

Windows_Enable_WSL

Step 3:

Activate the changes by restarting your computer.

Note that, Microsoft says, “This reboot is required in order to ensure that WSL can initiate a trusted execution environment.”

Step 4:

Now, you’re back from a restart, your system’s WSL is enabled, and you are ready to install a Linux distribution.

Go to the Window’s Store and search for WSL.

This is image title
Getting WSL from Windows Store

Step 5 (final):

Click Ubuntu and choose to install. Note, mine is already installed, so you have to do some imagining here.

This is image title

Installation and Usage Guide — Poppler:

Step 1:

Enter WSL through a terminal like this one in VS Code. Notice that, once you enter WSL, the terminal prompt will change. You are now operating within a Linux machine! Exciting!

This is image title
Enter WSL

Step 2:

Conduct the following commands within the WSL-prompt. Note that, you can ignore some of the steps that deal with Tesseract-OCR and PyTesseract. These are for the demo-project which I share at the end of the article.

# Author:  Matthew E. Miller
# Date: 1/1/2020
# Medium: https://medium.com/@matthew_earl_miller (where this is being published)
# Github: https://github.com/matmill5
# Linkedin: https://www.linkedin.com/in/matthew-miller-engineer/
# StackOverflow: https://stackoverflow.com/users/11937169/matthew-e-miller?tab=profile

# Command 1: Enter Windows Subsystem for Linux
PS C:\Users\Matthew\Desktop\Project> wsl

# Command 2: Cleanup
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get clean

# Command 3: Update
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get update

# Command 4: Get Python 3 on your WSL
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install python3

Command 5: Get Python PIP
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install python-pip

Command 6: Get poppler-utils
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt install poppler-utils

Command 7: Get pdf2image (dependant on poppler and inspiration for article)
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pdf2image

Command 8: Get pathlib
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pathlib

Command 9: Get pytesseract (if you're doing OCR)
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ pip install pytesseract

Command 10: Get tesseract-ocr (if you're doing OCR)
[email protected]_name:/mnt/c/Users/Matthew/Desktop/Project$ sudo apt-get install tesseract-ocr

Poppler_WSL_Commands

Step 3 — Testing (final):

Run a program with your newly acquired, ready-to-use, Poppler utilities.

I’ve created this demo script, so you can use it if you don’t have your own. Although, you will need a PDF to mess with.

# Tesseract OCR
import pytesseract
from PIL import Image
import sys
from pdf2image import convert_from_path
import os
import io

# If you need to assign tesseract to path
# pytesseract.pytesseract.tesseract_cmd = r'C:\Users\Matthew\AppData\Local\Tesseract-OCR\tesseract.exe'

pdf_path = 'pdfs/A Production Implementation of an Associative Arran Processor -STARAN - Rudolph.pdf'
output_filename = "results.txt"
pages = convert_from_path(pdf_path)
pg_cntr = 1

sub_dir = str("images/" + pdf_path.split('/')[-1].replace('.pdf','')[0:20] + "/")
if not os.path.exists(sub_dir):
    os.makedirs(sub_dir)

for page in pages:
    if pg_cntr <= 20:
        filename = "pg_"+str(pg_cntr)+'_'+pdf_path.split('/')[-1].replace('.pdf','.jpg')
        page.save(sub_dir+filename)
        with io.open(output_filename, 'a+', encoding='utf8') as f:
            f.write(unicode("======================================================== PAGE " + str(pg_cntr) + " ========================================================\n"))
            f.write(unicode(pytesseract.image_to_string(sub_dir+filename)+"\n"))
            f.write(unicode("======================================================== ========================= ========================================================\n"))
        pg_cntr = pg_cntr + 1

Demo_Poppler_On_Windows_OCR

This code works by converting a PDF to JPG. Then, it conducts OCR and writes the OCR-results to an output-file.

Conclusion:

That’s it. You are certified Poppler-On-Windows.

Enjoy the spoils of war! You have gained some seriously new and powerful skills. You are well on your way to becoming a more flexible developer (if you aren’t already).

Newly Acquired Skills:

  • Ability to successful manipulate PDFs with Python.
  • Access to PDF2Image, PDFToText, or other Poppler-utils.
  • Windows Subsystem for Linux. ** A seriously powerful dev-tool **

Now What… What Can You Build?

It’s so important to experiment with these new skills and solidify your understanding. True understanding comes with experience.

My Poppler-On-Windows Project:

I built an OCR application to help document the historical work of emeritus professor and famous computer scientist, Dr. Kenneth E. Batcher. It uses a PDF to image tool for JPEG-conversion. Then, it does OCR on the image and writes the results to an output file. Since this proof of concept works well enough, it’ll eventually be used on document-scans instead of PDFs.