PDF OCR API
Extract text from PDF documents using advanced OCR technology with support for both native text PDFs and scanned document PDFs with high accuracy.
🚀 Key Features
- Universal PDF Support - Process both native text and scanned PDFs
- High Accuracy - Advanced text extraction with layout preservation
- URL-Based Processing - Process files directly from URLs without uploads
- Metadata Extraction - Get document metadata including title, author, creation date
- Credit-Based System - Pay only for successful extractions
- Fast Processing - Quick text extraction with minimal latency
- Error Handling - Comprehensive validation and error responses
📋 Endpoint
POST requests only - All PDF extraction requests must use the POST method:
POST https://scrapingapi.qoest.com/v1/pdf
🔑 Authentication
All requests must include your API token in the Authorization header using Bearer authentication:
Authorization: Bearer YOUR_API_TOKEN
📊 Parameters
| Parameter | Required | Type | Description |
|---|---|---|---|
url | Yes | string | URL pointing to PDF file |
URL Format Requirements
- Must be a valid HTTP/HTTPS URL
- Must end with
.pdfextension - File must be publicly accessible
💰 Pricing
Monthly Subscription Tiers
| Plan | Price | Credits | Cost per Credit |
|---|---|---|---|
| Tier 1 | $10/month | 10,000 credits | $0.001 |
| Tier 2 | $50/month | 55,000 credits | $0.0009 |
| Tier 3 | $100/month | 115,000 credits | $0.00087 |
| Tier 4 | $500/month | 600,000 credits | $0.00083 |
| Tier 5 | $1,000/month | 1,250,000 credits | $0.0008 |
Credit Usage
| Feature | Credits Required |
|---|---|
| PDF Text Extraction | 1 credit per successful extraction |
Usage Examples
- Tier 1 ($10): 10,000 PDF extractions
- Tier 2 ($50): 55,000 PDF extractions
- Tier 3 ($100): 115,000 PDF extractions
📝 Examples
Basic PDF Text Extraction
curl --location 'https://scrapingapi.qoest.com/v1/pdf' \
--header 'Authorization: Bearer YOUR_TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/document.pdf"
}'
Response:
{
"text": "This is the extracted text from the PDF document...",
"pages": 5,
"metadata": {
"title": "Document Title",
"author": "Document Author",
"creation_date": "2024-01-15"
}
}
Processing Different PDF Types
Native Text PDFs
curl --location 'https://scrapingapi.qoest.com/v1/pdf' \
--header 'Authorization: Bearer YOUR_TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/native-text-document.pdf"
}'
Scanned PDFs
curl --location 'https://scrapingapi.qoest.com/v1/pdf' \
--header 'Authorization: Bearer YOUR_TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/scanned-document.pdf"
}'
Research Papers
curl --location 'https://scrapingapi.qoest.com/v1/pdf' \
--header 'Authorization: Bearer YOUR_TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/research-paper.pdf"
}'
📤 Response Format
Successful Response (200)
{
"text": "Extracted text content from all pages of the PDF document. This includes all readable text from the document with proper formatting and structure preserved where possible.",
"pages": 3,
"metadata": {
"title": "Document Title",
"author": "Author Name",
"creation_date": "2024-01-15",
"page_count": 3,
"file_size": "2.5MB",
"pdf_version": "1.4"
}
}
Error Responses
Validation Error (422)
{
"message": "The given data was invalid.",
"errors": {
"url": [
"The url field is required.",
"The url must be a valid URL.",
"The URL must point to a valid PDF file."
]
}
}
Insufficient Credits (403)
{
"message": "Insufficient credits"
}
Processing Failed (400)
{
"message": "Failed to extract data from PDF URL"
}
Authentication Required (401)
{
"message": "Unauthenticated."
}
⚠️ Validation Rules
URL Requirements
- Required field: Must be a valid HTTP/HTTPS URL
- PDF URLs: Must end with .pdf extension
- Accessibility: File must be publicly accessible without authentication
- File size: Recommended maximum 50MB for optimal processing
Credit Requirements
- Minimum balance: Must have at least 1 credit to process requests
- Deduction timing: Credits are deducted only after successful processing
- Failed requests: No credits deducted for failed processing attempts
🚨 Common Issues
- Invalid URL Format: Ensure URL ends with .pdf and is publicly accessible
- Insufficient Credits: Check credit balance before making requests
- File Not Found: Verify the URL is correct and file exists
- Password Protected PDFs: Remove password protection before processing
- Large Files: Very large files may timeout - consider optimizing file size
- Authentication: Ensure Bearer token is correctly formatted and valid
- Corrupted PDFs: Ensure PDF file is not corrupted or damaged
🎯 Use Cases
Academic Research
Extract text from research papers, books, and academic documents.
curl "https://scrapingapi.qoest.com/v1/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/research-paper.pdf"}'
Legal Document Processing
Process legal documents, contracts, and compliance materials.
curl "https://scrapingapi.qoest.com/v1/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/legal-document.pdf"}'
Business Document Analysis
Extract text from reports, proposals, and business documents.
curl "https://scrapingapi.qoest.com/v1/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/business-report.pdf"}'
Content Management
Process documents for content indexing and search functionality.
curl "https://scrapingapi.qoest.com/v1/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/manual.pdf"}'
Data Migration
Convert legacy PDF documents to searchable text formats.
curl "https://scrapingapi.qoest.com/v1/pdf" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/legacy-document.pdf"}'
📊 Best Practices
PDF Processing Tips
- Text-based PDFs: Work best with native text (not scanned images)
- Scanned PDFs: Also supported but may have lower accuracy
- File Size: Optimize large PDFs for faster processing
- Password Protection: Remove password protection before processing
- Quality: Higher quality scans produce better text extraction results
Performance Optimization
- Batch Processing: Process multiple files in sequence rather than parallel
- Error Handling: Automatic retries to achieve 99%+ uptime
- Credit Monitoring: Monitor credit usage to avoid service interruption
- URL Validation: Validate URLs before sending requests
- File Preparation: Ensure PDFs are optimized and accessible
Supported PDF Types
- Native Text PDFs: Best accuracy and fastest processing
- Scanned PDFs: OCR processing with good accuracy
- Mixed Content: PDFs with both text and images
- Multi-page Documents: Full document processing with page count
👤 User Management
Check User Profile
curl --location 'https://scrapingapi.qoest.com/v1/me' \
--header 'Authorization: Bearer YOUR_TOKEN'
Response:
{
"user": {
"id": 1,
"name": "Your Name",
"email": "[email protected]",
"credits": 9850,
"created_at": "2024-01-15T10:00:00.000000Z",
"updated_at": "2024-01-15T10:00:00.000000Z"
}
}
Add Credits
curl --location 'https://scrapingapi.qoest.com/v1/add-credits' \
--header 'Authorization: Bearer YOUR_TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"amount": 1000
}'
Response:
{
"message": "Credits added successfully",
"remaining_credits": 10850
}
📚 Related APIs
- Image OCR - Extract text from images
- Web Scraping - Extract website data
- Google Search - Extract search results