Good OCR starts before the software ever sees a page. If the scan is crooked, noisy, or over-compressed, no engine—no matter how clever—will read it well. The upside is that small, deliberate choices at capture and cleanup can lift accuracy by double digits. Here’s a field-tested guide on how to optimize scanned documents for better OCR results without turning it into a science project.
Start with the scan: resolution, color, and optics
For text documents, 300 dpi is the baseline. Jump to 400–600 dpi for small fonts, thin serifs, or archival materials with faint print. Grayscale usually beats pure black-and-white for capture; it preserves subtle edges and makes later binarization more effective. Color helps when you need to drop out form lines or stamps by isolating a channel.
Set scanners to disable heavy “auto-enhance” filters that crush shadows or oversharpen. If your device offers descreen for halftones, enable it when scanning magazines or newspapers. Clean the glass, use a solid black backing for thin paper, and square the page against the guide—simple habits that stop skew and bleed-through before they start.
Phone photos can work if you control geometry and light. Use a scanning app with edge detection, hold the camera parallel to the page, and light from two sides to avoid glare. A cheap copy stand does wonders for consistency, especially with books where curvature demands a steady, centered shot.
| Content | Recommended dpi | Color mode |
|---|---|---|
| Standard printed text | 300 | Grayscale |
| Small fonts/fine print | 400–600 | Grayscale or color |
| Forms with colored lines | 300–400 | Color (for channel dropout) |
Prep the page before pixels
Remove staples, flatten folds, and brush away dust. Highlighters smear; if possible, rescan a clean copy or switch to a darker pen that doesn’t glow in grayscale. If you’re handling onionskin or thin paper, place a black sheet behind it to block show-through.
For books, avoid pressing hard into the gutter; curved lines become broken characters. Use a cradle or props to open the spread gently, then apply dewarping later. When originals are low-contrast, consider reprinting to a fresh copy—starting with a cleaner source wins every time.
File formats and compression that OCR loves
Save masters in lossless or near-lossless formats. For pure text, TIFF with CCITT Group 4 compression is compact and crisp. For mixed pages, PNG keeps detail without artifacts. When you need a deliverable, produce a searchable PDF and keep the raw images in case you need to reprocess.
Beware aggressive JPEG. It smears edges and invents blocks that engines mistake for dots and dashes. If you must use JPEG, export at high quality and never re-save repeatedly. For PDFs, Mixed Raster Content (MRC) can shrink files by separating text from backgrounds while preserving legibility.
| Format | Best use | Notes |
|---|---|---|
| TIFF (G4) | Clean black-and-white text | Lossless, tiny files, archival-friendly |
| PNG (grayscale/color) | General scanning, mixed pages | No artifacts, larger than JPEG |
| PDF (MRC) | Distribution of mixed documents | Small size, keeps text crisp |
| JPEG | Photo-heavy pages | Use high quality; avoid re-encoding |
| PDF/A | Long-term archiving | Standardized rendering, embed fonts/OCR |
Clean up digitally: deskew, despeckle, and balance
Before OCR, fix geometry. Auto-crop, deskew, and, for book pages, dewarp curved lines so baselines are straight. Remove borders and punch holes so the engine focuses on content, not high-contrast clutter at the edges.
Tame noise with gentle despeckling and background normalization. Adaptive thresholding (like Sauvola) handles uneven lighting better than a blunt global cutoff. A light unsharp mask can help, but stop before halos appear—oversharpened text fractures into nonsense.
Free tools can take you far. NAPS2 or VueScan for capture, ScanTailor Advanced for cleanup, and ImageMagick or OpenCV scripts for batch pipelines. Keep settings in a profile so you can repeat success on big stacks without guesswork.
Layout and language matter
Engines need hints. Specify the correct language pack and character set; turning on multiple languages you don’t need invites confusion. If the page has two columns, tables, or side notes, enable layout analysis or draw zones so the reading order makes sense.
For forms, drop out colored lines by removing the red or blue channel, then OCR the remaining black text. For scripts with diacritics or CJK characters, favor higher dpi and disable aggressive noise removal that eats small marks. When recognition consistently misreads domain terms, load a custom dictionary.
Run OCR with intent
Pick an engine that fits your task. Tesseract is flexible and scriptable; ABBYY FineReader and Acrobat bring strong layout retention and table detection. For small type, feed 400–600 dpi images; for clean large text, 300 dpi suffices and runs faster.
Use zoning for tables and receipts, and tell the software where numbers live to curb letter substitutions. If images include stamps or watermarks, mask them or reduce their opacity—bold overlays pull attention away from letterforms. Save outputs as searchable PDFs plus a text or TSV layer for audits.
Quality control you can measure
Don’t trust a pass that you haven’t spot-checked. Sample pages from the start, middle, and end of a batch and review the OCR text alongside the image. Confidence scores, when available, help you flag weak areas automatically.
Plug a spell-checker against language dictionaries and a whitelist of names, product codes, or addresses. Pattern checks—regular expressions for invoice numbers or dates—catch transpositions that look real to a spell-checker but fail business rules. Feed what you learn back into preprocessing rather than correcting every page by hand.
Post-processing that saves hours
Fix common artifacts at scale. Merge hyphenated line breaks that split words across columns, normalize ligatures (fi, fl) to standard letters, and standardize quotation marks and dashes. If your output will be edited, export to DOCX; if it’s for search and archive, keep a PDF/A with the text layer embedded.
Keep originals alongside outputs. When a new project demands better results, you’ll be glad you can rerun the pipeline with improved settings. Documentation beats memory—record dpi, color mode, filters, and OCR options with each batch.
A quick, real-world workflow
Last year I digitized a box of 1990s invoices printed in faint dot-matrix. The winning recipe was 400 dpi grayscale scans, gentle descreen, deskew, then adaptive thresholding to crisp the type without erasing perforations. Tesseract with the correct language pack and a small dictionary of vendor names did the rest.
Here’s the streamlined path I now use for similar work:
- Clean originals; scan at 400 dpi grayscale with descreen on.
- Auto-crop, deskew, and remove borders and hole punches.
- Normalize background; apply light despeckle and unsharp mask.
- For forms, drop color channels to remove ruling lines.
- Zone tables/columns; set language and custom dictionary.
- OCR to searchable PDF and TSV; sample-check low-confidence zones.
- Run hyphen-merge and ligature normalization; archive as PDF/A plus source images.
That pipeline lifted accuracy from “good enough if you squint” to reliably searchable text. Use it as a template, then tailor it to your pages. With a steady capture, a light touch in cleanup, and deliberate engine settings, you’ll know how to optimize scanned documents for better OCR results and keep rework to a minimum.

