Despite being over 30 years old, the PDF remains the backbone of enterprise document workflows. Contracts, invoices, official correspondence, scanned archive records — they all circulate as PDFs. Yet the format's reputation as an "unchangeable image of paper" hides just how layered working with it programmatically really is.
In this article we cover the core concepts of processing PDFs at the code level, common scenarios, and the mature tools available in the Python ecosystem.
Why Is PDF a "Difficult" Format?
Unlike formats such as Word or HTML, PDF is a presentation format; it describes not the content itself, but how that content should appear on the page. There is no such thing as a "paragraph" inside a PDF file. Instead, you find chunks of text placed at specific coordinates, drawing instructions, and embedded font definitions.
This has two important consequences:
- "Reading" text and "extracting" text are different things. A PDF may look visually flawless while the text inside it is stored on the page in a random order.
- A scanned PDF actually contains no text at all — it consists only of page images, and accessing the text requires OCR.
💡 Key distinction: Telling apart a "born-digital" PDF from a "scanned" PDF is the first step in determining which tool you reach for and when.
Core Categories of Operations
Programmatic PDF tasks can be grouped under five main headings:
📦 Structural Operations
Merging, splitting, rotating pages, deleting/inserting pages. These rearrange the file's structure without touching its content.
🔍 Content Extraction
Extracting text, tables, images, and metadata. The challenge changes entirely depending on whether the source is digital or scanned.
🏭 Generation
Creating PDFs from scratch or filling templates. Dynamic generation of documents such as invoices, reports, and certificates.
📝 Form Operations
Reading and filling AcroForm fields. Common in enterprise application and approval processes.
🔒 Security and Integrity
Encryption, password protection, digital signing, and watermarking. Critical in environments with legal and regulatory requirements.
The Python Ecosystem: Choosing the Right Tool
Python offers a rich range of libraries for PDF processing. But they do not all do the same job; each excels in a different area.
pikepdf — Low-Level Structural Operations
pikepdf is a Python wrapper built on top of the mature C++ qpdf library. It provides direct, safe access to a PDF file's internal structure. It is preferred for merging, splitting, repair, and password operations. Its success in recovering corrupted PDFs is notable.
import pikepdf
# Merging two PDFs
with pikepdf.open("contract.pdf") as pdf:
with pikepdf.open("appendix.pdf") as appendix:
pdf.pages.extend(appendix.pages)
pdf.save("merged.pdf")
# Encryption with a password (AES-256)
with pikepdf.open("report.pdf") as pdf:
pdf.save(
"protected.pdf",
encryption=pikepdf.Encryption(
owner="admin",
user="user",
R=6
),
)
pikepdf's philosophy is to manipulate the file safely without trying to "understand" the content. It is not designed for semantic tasks like text extraction.
Text and Data Extraction
For extracting text from born-digital PDFs, pdfplumber and PyMuPDF (fitz) stand out. pdfplumber is particularly strong in table detection and cell-level positional data extraction; fitz is superior for speed and visual rendering.
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
tables = page.extract_tables()
With scanned documents, text extraction is not directly possible; OCR must be applied first. This is where Tesseract-based solutions or commercial OCR engines come in. OCR quality depends heavily on the resolution of the scan and the language of the document — especially in languages with accented characters, selecting the correct language pack noticeably affects the result.
pyHanko — Digital Signing
When it comes to PAdES-compliant digital signing, pyHanko is the most comprehensive open-source solution in the Python ecosystem. It meets enterprise requirements such as visual signatures, timestamping, long-term validation (LTV), and HSM integration.
from pyhanko.sign import signers, fields
from pyhanko.pdf_utils.incremental_writer import IncrementalPdfFileWriter
signer = signers.SimpleSigner.load(
"private_key.pem", "certificate.pem",
key_passphrase=b"password"
)
with open("document.pdf", "rb") as f:
w = IncrementalPdfFileWriter(f)
fields.append_signature_field(
w, fields.SigFieldSpec(
sig_field_name="Signature",
box=(50, 50, 250, 120)
)
)
with open("signed.pdf", "wb") as out:
signers.sign_pdf(
w,
signers.PdfSignatureMetadata(
field_name="Signature"
),
signer=signer,
output=out
)
💡 PAdES levels: Embedding the timestamp and validation information is critical so that the signature remains valid even after the certificate expires. For legal documents to be archived long-term, at least the B-LT level should be targeted.
Apryse (formerly PDFTron) — The Commercial Industry Standard
Although open-source tools meet many needs, commercial SDKs like Apryse come into play in enterprise scenarios that require high-fidelity rendering, complex form handling, redaction, advanced OCR, and broad format conversions.
Apryse delivers a wide range — from rendering to signing — consistently under a single API, and runs across multiple platforms (Windows, Linux, mobile, browser).
⚠️ Licensing: OEM keys are often platform-locked. The deployment architecture (for example, running on a Linux server) must be planned in line with the license terms. It is important to clearly assess whether you have truly reached the limits of the open-source stack.
For Decision-Makers: Build vs. Buy
Your technical team can integrate most of these libraries within a few days. But the real question is not the library itself, but the engineering around it:
- What percentage of scanned documents will be OCR'd correctly
- Whether the signing infrastructure complies with regulations
- How high-volume batch processing scales
- Resilience in edge cases (corrupted files, unusual fonts, mixed-language content)
The true cost is hidden in these items.
A practical approach: For standard needs such as structural operations and born-digital text extraction, the open-source stack (pikepdf + pdfplumber/PyMuPDF + pyHanko) is usually sufficient and economical. When high-fidelity rendering, enterprise redaction, multi-format conversion, and large-scale OCR are involved, the total cost of ownership of a commercial SDK is often lower than maintaining your own solution.
Conclusion
Although PDF processing may look as simple as "open the file and take what's inside," beneath it lie layers such as font encodings, page coordinate systems, signature cryptography, and OCR accuracy.
The key to choosing the right tool is to break your task into the categories above and use the most appropriate, most mature solution for each. While the Python ecosystem handles the vast majority of these tasks elegantly, commercial SDKs enter the picture as requirements for regulation, scale, and accuracy increase.
A well-designed PDF processing layer becomes an invisible but critical part of enterprise document workflows — and building this layer correctly from the start is always cheaper than paying for it later as technical debt.
YesPDF delivers all the processing categories covered in this article — structural operations, text extraction, OCR, digital signing, and security — in a single enterprise platform. Fully on-premise, GDPR-compliant.
→ Request a demo