Guides

PDF OCR Explained: How Optical Character Recognition Works for Scanned PDFs

PDFWhisk Editorial Team · 22 May 2026 · 7 min read

Compress PDF in-browser — free

Quick answer

Scanned PDFs are a common problem. You receive a document that looks like it contains text but when you try to select a word, the whole page highlights as a single image. The text is not really there, it is just a photograph of text. You cannot search it, copy from it, or have a screen reader read it aloud. Optical character recognition, or OCR, fixes that by analysing the image and identifying the characters it contains.

Best for

Email attachments Job portals Phone uploads Scanned PDFs

In this guide

What you’ll cover

Open tool

What OCR actually does
What affects OCR accuracy
OCR tools that are actually useful
What to do after OCR
OCR and PDF accessibility

On this page

What OCR actually does

OCR is a computer vision process. The software analyses an image pixel by pixel, looking for patterns that match known character shapes. It identifies letter outlines, distinguishes characters from background, handles variation in font, size, and rotation, and assembles identified characters into words and lines. Modern OCR uses neural networks trained on millions of character examples, which is why accuracy has improved dramatically in recent years.

The output is a text layer, a set of recognised text strings with coordinates on the page. In a PDF, this text layer is overlaid on the original image pages so that the document looks identical visually but now has searchable, selectable text data underneath the images. The image quality of the scan is not affected.

What affects OCR accuracy

OCR is not magic. The quality of the output depends on the quality of the input and the characteristics of the document.

Scan resolution is the most important factor. 300 DPI is the minimum recommended for reliable OCR on standard text. At 150 DPI, fine details in letterforms may be lost, reducing accuracy. Below 150 DPI, error rates rise significantly. If you are scanning a document specifically to run OCR, scan at 300 DPI even if you plan to compress the file size afterwards.

Image quality and contrast matters. Clean, high-contrast black text on white paper gives near-perfect results with modern OCR. Faded ink, yellowed paper, coffee stains, or pages that were folded and crinkled reduce accuracy. Grey-on-white or coloured text on coloured backgrounds is harder.

Font type makes a difference. Clearly printed standard typefaces, Times, Helvetica, Arial, Courier, are OCR-friendly. Decorative fonts, stylised script, and handwriting are much harder. Handwriting OCR is a separate, specialised problem that general OCR tools handle poorly.

Layout complexity affects how well the tool can determine reading order. A single column of text on a plain white page is simple. A newspaper-style multi-column layout with mixed text blocks, photos, sidebars, and captions is hard. OCR may produce correct individual words but in the wrong order, particularly across column boundaries.

OCR tools that are actually useful

Adobe Acrobat (paid) has the best OCR implementation for professional use. It handles multi-column layouts well, supports a wide range of languages, and allows manual correction of recognised text. For converting archives of scanned documents, Acrobat's batch processing is the most reliable option.

Google Drive offers free OCR as part of opening a PDF with Google Docs. Upload the scanned PDF to Google Drive, right-click and open with Google Docs, and Google automatically runs OCR. The resulting document contains the recognised text, though layout is usually not preserved for complex documents. For simple typed letters and single-column text, it works well and is completely free.

Tesseract is an open-source OCR engine originally developed by HP and now maintained by Google. It runs on the command line and can be integrated into workflows. Accuracy is good for clean documents. It requires some technical setup but is free and produces no privacy risk since it runs entirely locally.

ABBYY FineReader and IRIS PDF are commercial alternatives with strong reputations for accuracy, particularly with complex layouts and forms. These are mainly relevant for organisations that need to process large volumes of scanned documents regularly.

What to do after OCR

After OCR, verify the output. Common error patterns include:

Confusion between visually similar characters: O and 0, l and 1, rn and m
Incorrect word boundaries in dense text
Numbers from tables being transposed or merged
Punctuation attached to adjacent words

For documents where accuracy matters, particularly contracts, financial figures, medical information, or anything that will be used for data entry, checking the OCR output against the original image is essential.

OCR and PDF accessibility

Running OCR is the first step in making a scanned PDF accessible. Once a text layer exists, screen readers can read the document instead of encountering a blank image. However, OCR alone does not produce a fully accessible PDF, structure tags, alt text for images, reading order, and document language still need to be set. OCR is the prerequisite, not the complete solution.

Compressing a scanned PDF after OCR

Scanned documents are large because each page is an image. After adding a text layer via OCR, you can compress the PDF to reduce the file size for email or portal upload. The compression targets the image data in the pages; the text layer added by OCR is very small and is not significantly affected. Most scanned PDFs compress well, 50 to 80 percent reduction is common, so running OCR and then compressing gives you a searchable document at a manageable file size.

Try it now