Best API for Converting PDFs to Searchable Text with OCR
Scanned contracts, invoices, receipts, legal filings, medical records, and legacy documents all share one problem: the text is trapped in images. OCR (Optical Character Recognition) APIs extract that text programmatically, turning unsearchable PDFs into structured data you can index, analyze, and act on. The cloud giants have invested heavily in this space, but the accuracy, pricing, and feature gaps between providers are significant. Here is how they stack up.
Feature Comparison
| Feature | AWS Textract | Google Cloud Vision | Azure Doc Intelligence | Tesseract (open-source) |
|---|---|---|---|---|
| Cost per Page | $0.0015 (text) / $0.015 (tables) | $0.0015 per page | $0.001-0.01 per page | Free |
| Accuracy (clean docs) | 97-99% | 97-99% | 97-99% | 90-95% |
| Accuracy (poor scans) | 90-95% | 88-93% | 92-96% | 75-85% |
| Languages Supported | 50+ | 200+ | 100+ | 100+ |
| Table Extraction | Yes (excellent) | Basic | Yes (excellent) | No |
| Form/Key-Value Pairs | Yes | No (use Document AI) | Yes | No |
| Handwriting | Yes | Yes | Yes | Limited |
| Max File Size | 500 MB (async) | 20 MB (sync) | 500 MB | Unlimited (local) |
| Free Tier | 1,000 pages/month (12 mo) | 1,000 pages/month | 500 pages/month | Unlimited |
AWS Textract: Best for Structured Document Extraction
Amazon Textract goes beyond basic OCR. While it can extract raw text from any document, its real strength is understanding document structure. The AnalyzeDocument API identifies tables, forms, key-value pairs, and even specific document types like invoices, receipts, and identity documents. For a scanned invoice, Textract does not just read the text; it tells you which number is the total, which is the invoice number, and which is the vendor name.
Pricing breakdown:
- DetectDocumentText: $0.0015 per page. Basic text extraction with bounding boxes and confidence scores.
- AnalyzeDocument (tables and forms): $0.015 per page. Extracts structured data from tables and form fields.
- AnalyzeExpense: $0.01 per page. Specialized extraction for receipts and invoices.
- AnalyzeID: $0.01 per page. Extracts data from identity documents (passports, driver licenses).
The async API handles PDFs up to 500 MB and 3,000 pages, making it suitable for processing large document archives. Results are delivered to an SNS topic or polled via the GetDocumentAnalysis endpoint. For smaller documents (up to 10 MB), the synchronous API returns results in a single request.
Where Textract excels:
- Tables: Textract's table extraction is the best in class. It correctly identifies rows, columns, merged cells, and header rows in complex table layouts. If your documents contain financial tables, inventory lists, or any tabular data, Textract consistently outperforms competitors.
- AWS integration: Native integration with S3, Lambda, and Step Functions makes it easy to build document processing pipelines. Drop a PDF in S3, trigger a Lambda, process with Textract, store results in DynamoDB.
- Confidence scores: Every extracted element includes a confidence score, letting you flag low-confidence results for human review.
Limitations: Textract requires an AWS account and familiarity with AWS SDKs. The API responses are verbose JSON structures that require parsing logic to extract the data you actually want. There is no simple "give me the text" mode for the structured analysis endpoints.
Best for: Invoice processing, form digitization, table-heavy documents, and organizations already on AWS.
Google Cloud Vision: Best Multilingual Support
Google Cloud Vision's OCR benefits from the same machine learning infrastructure that powers Google Translate and Google Lens. The result is exceptional multilingual OCR that handles over 200 languages, including complex scripts like Chinese, Japanese, Korean, Arabic, Hindi, and Thai. For organizations processing documents in multiple languages, Google Cloud Vision is the clear leader.
The API offers two OCR modes:
- TEXT_DETECTION: Optimized for short text in images (signs, labels, screenshots). Returns text with bounding polygons.
- DOCUMENT_TEXT_DETECTION: Optimized for dense document text. Returns text organized by pages, blocks, paragraphs, words, and symbols with full structural hierarchy.
For PDF processing specifically, you use the asyncBatchAnnotate endpoint, which accepts PDFs up to 2,000 pages stored in Google Cloud Storage and writes results back to GCS as JSON. Processing speed is typically 1-3 seconds per page.
Multilingual edge: Google Cloud Vision can detect the language of text automatically and handles mixed-language documents (e.g., a contract with English headers and Japanese body text) without any configuration. AWS Textract and Azure both handle multilingual documents, but Google's accuracy on CJK scripts and right-to-left languages is measurably higher.
Limitations: Google Cloud Vision's table extraction is basic compared to Textract and Azure. It detects text within table cells but does not reliably reconstruct the table structure (rows, columns, headers). For table-heavy documents, you need to step up to Google Document AI, which is a separate product with higher pricing ($0.01-0.065 per page depending on the processor).
Best for: Multilingual documents, mixed-script text, organizations processing documents in Asian, Middle Eastern, or African languages.
Azure AI Document Intelligence: Best Overall Accuracy
Azure AI Document Intelligence (formerly Form Recognizer) delivers the highest accuracy on degraded documents in our testing. Faded faxes, skewed scans, documents with coffee stains and creases, and low-DPI images all produced better results with Azure than with Textract or Google Cloud Vision. Azure's preprocessing pipeline includes automatic deskewing, noise reduction, and contrast adjustment that the other services handle less effectively.
Pricing tiers:
- Read (basic OCR): $0.001 per page. Extracts text, handwriting, and structure from documents.
- Layout: $0.005 per page. Adds table extraction, selection marks, and document structure analysis.
- Prebuilt models: $0.01 per page. Specialized models for invoices, receipts, W-2s, business cards, and ID documents.
- Custom models: $0.05 per page for training, $0.01 per page for inference. Train on your own document types.
The Read tier at $0.001 per page is the cheapest cloud OCR option for basic text extraction. At this price, processing 100,000 pages costs $100, compared to $150 with Textract and $150 with Google Cloud Vision.
Standout features:
- Custom models: Train extraction models on your specific document types with as few as 5 labeled examples. This is transformative for organizations with proprietary form layouts.
- Handwriting recognition: Azure's handwriting OCR is the most accurate of the three cloud providers, particularly for cursive English and printed handwriting.
- Document structure: The Layout model identifies paragraphs, section headings, page numbers, headers, footers, and footnotes, preserving the logical reading order of complex documents.
- Add-on capabilities: High-resolution extraction, font detection, formula recognition, and barcode reading can be enabled per request.
Limitations: Like Textract, Azure requires cloud platform familiarity. The API has undergone several rebrandings (Computer Vision OCR to Form Recognizer to Document Intelligence), and older tutorials and documentation may reference deprecated endpoints. The free tier is limited to 500 pages/month, the smallest of the three cloud providers.
Best for: Poor-quality scans, handwritten documents, organizations needing custom extraction models, and cost-sensitive high-volume processing.
Tesseract and Open-Source Alternatives: Best for Privacy and Control
Tesseract is the most widely used open-source OCR engine. Originally developed by Hewlett-Packard in the 1980s and later maintained by Google, it runs entirely on your own infrastructure. No documents leave your servers, no per-page fees, and no vendor dependency. For organizations with strict data sovereignty requirements (government, healthcare, legal), this matters.
Modern Tesseract (version 5+) uses LSTM neural networks and supports over 100 languages. Accuracy on clean, high-resolution documents is respectable at 90-95%, but it drops significantly on degraded scans, skewed text, or complex layouts.
Running Tesseract on PDFs:
# Convert PDF pages to images, then OCR each page
# Requires: tesseract, poppler-utils (for pdftoppm)
import subprocess
import os
def pdf_to_text(pdf_path, output_dir):
# Convert PDF to images
subprocess.run([
"pdftoppm", "-png", "-r", "300",
pdf_path, f"{output_dir}/page"
])
# OCR each page image
full_text = []
for img in sorted(os.listdir(output_dir)):
if img.endswith(".png"):
result = subprocess.run(
["tesseract", f"{output_dir}/{img}", "stdout"],
capture_output=True, text=True
)
full_text.append(result.stdout)
return "\n".join(full_text)
Other open-source options worth considering:
- PaddleOCR: Developed by Baidu, excellent for Chinese, Japanese, and Korean text. Often outperforms Tesseract on Asian-language documents.
- EasyOCR: Python library supporting 80+ languages with a simpler API than Tesseract. Good accuracy on printed text.
- Surya: A newer open-source OCR model with layout detection and table recognition. Approaching cloud API quality on many document types.
When to go open-source: Choose Tesseract or alternatives when your documents cannot leave your infrastructure, when you process high volumes and want to avoid per-page costs, or when you need to customize the OCR pipeline with preprocessing steps tailored to your specific document types.
When to avoid open-source: If your documents include handwriting, complex tables, mixed layouts, or poor scan quality, the cloud APIs deliver dramatically better results. The accuracy gap on degraded documents (75-85% vs 90-96%) translates directly to downstream data quality issues.
Cost at Scale: 100,000 Pages per Month
| Service | Basic Text OCR | With Tables/Forms | Notes |
|---|---|---|---|
| AWS Textract | $150 | $1,500 | Table extraction 10x base cost |
| Google Cloud Vision | $150 | $1,000-6,500 (Document AI) | Separate product for structured extraction |
| Azure Doc Intelligence | $100 | $500-1,000 | Cheapest at every tier |
| Tesseract (self-hosted) | $50-200 (compute) | N/A (no table extraction) | GPU recommended for speed |
Hidden cost: Self-hosted Tesseract is "free" but requires infrastructure. Processing 100,000 pages per month at 300 DPI with reasonable speed requires a server with 8+ CPU cores or GPU acceleration. Budget $50-200/month in compute costs, plus engineering time for maintenance and error handling.
Accuracy on Real-World Documents
Accuracy varies significantly by document type and condition. Here is how each API tends to perform based on publicly available benchmarks and developer reports:
- Clean printed documents: All three cloud providers deliver excellent results, typically above 97% character accuracy. Tesseract performs well on clean, high-resolution prints but lags behind the cloud APIs.
- Degraded scans and faxes: Azure AI Document Intelligence is widely regarded as the strongest performer on poor-quality scans, thanks to its automatic deskewing and preprocessing. Textract also handles degraded documents well. Tesseract struggles significantly with low-quality inputs.
- Handwritten text: Azure leads in handwriting recognition, particularly for cursive English. Google and Textract offer reasonable handwriting support. Tesseract has very limited handwriting capability.
- Multilingual documents: Google Cloud Vision excels with mixed-language content, especially documents combining Latin scripts with CJK characters. Its automatic language detection handles code-switching better than competitors.
For clean, single-language documents, the accuracy gap between cloud providers is small. The differences become pronounced with degraded scans, handwriting, and multilingual content.
Verdict: Matching APIs to Use Cases
- Invoice and receipt processing: AWS Textract. The AnalyzeExpense API extracts structured invoice fields out of the box, saving significant development time.
- Multilingual document archives: Google Cloud Vision. Unmatched language coverage and automatic language detection.
- Poor-quality scans and handwriting: Azure AI Document Intelligence. Best preprocessing and highest accuracy on degraded documents.
- Cost-sensitive high volume: Azure Read tier at $0.001/page, or self-hosted Tesseract for basic printed text.
- Data sovereignty requirements: Tesseract or PaddleOCR. Documents never leave your infrastructure.
For most teams starting a document processing project, Azure AI Document Intelligence offers the best combination of accuracy, pricing, and features. Start with the Read tier for basic OCR, step up to Layout when you need table extraction, and use custom models when you have proprietary document formats. If you are already invested in AWS or GCP, their respective services are close enough in quality that ecosystem familiarity should guide the decision.