PDF to Markdown Converter
Production-ready Python tool for converting PDF documents to Markdown format with support for text extraction, table parsing, and layout preservation.
Features
- Multiple Extraction Modes: Text, table, or mixed content extraction
- Layout Preservation: Maintains original document formatting
- Table Support: Automatically converts PDF tables to Markdown tables
- Error Handling: Comprehensive error handling with detailed logging
- CLI Interface: Simple command-line interface with multiple options
- Type Safety: Full type hints for better code quality
- Production Ready: Logging, error boundaries, and proper resource management
Installation
Basic Installation
pip install pdfplumber
Install with All Dependencies
pip install -r requirements.txt
Development Installation
pip install -r requirements.txt
pip install pytest pytest-cov black mypy pylint
Quick Start
Basic Usage
Convert a PDF to Markdown (output filename matches input):
python pdf_to_markdown.py document.pdf
Specify Output File
python pdf_to_markdown.py input.pdf -o output.md
Extract Tables
python pdf_to_markdown.py data.pdf --mode table -o tables.md
Mixed Mode (Text + Tables)
python pdf_to_markdown.py report.pdf --mode mixed -o report.md
Verbose Logging
python pdf_to_markdown.py document.pdf -v
Usage Examples
Example 1: Simple Document Conversion
# Convert a basic PDF document
python pdf_to_markdown.py research_paper.pdf
# Output: research_paper.md
Example 2: Financial Report with Tables
# Extract tables from financial reports
python pdf_to_markdown.py quarterly_report.pdf --mode table -o q3_tables.md
Example 3: Mixed Content Document
# Process document with both text and tables
python pdf_to_markdown.py annual_report.pdf --mode mixed -o full_report.md
Example 4: Batch Processing
# Process multiple PDFs in a directory
for pdf in *.pdf; do
python pdf_to_markdown.py "$pdf" -o "markdown/${pdf%.pdf}.md"
done
Command-Line Options
usage: pdf_to_markdown.py [-h] [-o OUTPUT] [--mode {text,table,mixed}]
[--no-layout] [-v]
input
positional arguments:
input Input PDF file path
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output Markdown file path (default: input filename
with .md extension)
--mode {text,table,mixed}
Extraction mode (default: text)
--no-layout Disable layout preservation
-v, --verbose Enable verbose logging
Programmatic Usage
Basic Conversion
from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter, ConversionConfig
# Create converter with default settings
converter = PDFToMarkdownConverter()
# Convert PDF
markdown = converter.convert(
pdf_path=Path("input.pdf"),
output_path=Path("output.md")
)
print(f"Converted {len(markdown)} characters")
Custom Configuration
from pdf_to_markdown import (
PDFToMarkdownConverter,
ConversionConfig,
ExtractionMode
)
# Configure converter
config = ConversionConfig(
mode=ExtractionMode.MIXED,
preserve_layout=True,
extract_images=False,
table_settings={
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 3,
}
)
# Create converter with custom config
converter = PDFToMarkdownConverter(config)
# Convert
result = converter.convert(Path("document.pdf"))
Error Handling
from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter
import logging
# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)
converter = PDFToMarkdownConverter()
try:
markdown = converter.convert(
pdf_path=Path("input.pdf"),
output_path=Path("output.md")
)
print(f"✓ Success: {len(markdown)} characters converted")
except FileNotFoundError as e:
print(f"✗ File not found: {e}")
except PermissionError as e:
print(f"✗ Permission denied: {e}")
except Exception as e:
print(f"✗ Conversion failed: {e}")
Output Format
The converter generates Markdown with the following structure:
# DocumentName
*Converted from PDF with N pages*
---
## Page 1
[Page content here]
### Table 1
| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Cell 1 | Cell 2 | Cell 3 |
---
## Page 2
[Next page content...]
Extraction Modes
Text Mode (Default)
Extracts only text content from PDFs. Best for:
- Research papers
- Books and articles
- Text-heavy documents
python pdf_to_markdown.py document.pdf --mode text
Table Mode
Extracts only tables from PDFs. Best for:
- Financial reports
- Data sheets
- Statistical documents
python pdf_to_markdown.py data.pdf --mode table
Mixed Mode
Extracts both text and tables. Best for:
- Annual reports
- Comprehensive documents
- Mixed-content files
python pdf_to_markdown.py report.pdf --mode mixed
Configuration Options
Table Extraction Settings
Customize table extraction behavior:
table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines", # or "text", "explicit"
"snap_tolerance": 3, # Pixel tolerance for line snapping
"join_tolerance": 3, # Tolerance for joining lines
"edge_min_length": 3, # Minimum line length
"min_words_vertical": 3, # Min words for vertical detection
"min_words_horizontal": 1, # Min words for horizontal detection
}
config = ConversionConfig(table_settings=table_settings)
converter = PDFToMarkdownConverter(config)
Layout Preservation
Enable or disable layout preservation:
# Preserve layout (default)
config = ConversionConfig(preserve_layout=True)
# Disable layout preservation for cleaner text
config = ConversionConfig(preserve_layout=False)
Troubleshooting
Issue: "pdfplumber not installed"
Solution: Install the required package:
pip install pdfplumber
Issue: "Permission denied"
Solution: Ensure you have read permissions for the PDF and write permissions for the output directory:
chmod +r input.pdf
mkdir -p output && chmod +w output
Issue: "Failed to extract text from page"
Cause: PDF may be scanned images or use non-standard fonts.
Solution: Use OCR preprocessing:
# Install Tesseract OCR
sudo apt-get install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS
# Use OCR-enabled PDF tool first
Issue: Tables not extracted correctly
Solution: Try adjusting table settings:
config = ConversionConfig(
table_settings={
"vertical_strategy": "text", # Try "text" instead of "lines"
"snap_tolerance": 5, # Increase tolerance
}
)
Issue: Missing content
Solution: Enable verbose logging to diagnose:
python pdf_to_markdown.py document.pdf -v
Performance Considerations
- Large PDFs: Processing time scales linearly with page count (~1-2 seconds per page)
- Complex Tables: Table extraction adds overhead (~50-100ms per table)
- Memory Usage: ~10-50MB per PDF, depending on complexity
- Batch Processing: Use multiprocessing for large batches:
from multiprocessing import Pool
from pathlib import Path
def convert_pdf(pdf_path):
converter = PDFToMarkdownConverter()
return converter.convert(pdf_path, pdf_path.with_suffix('.md'))
# Process multiple PDFs in parallel
with Pool(processes=4) as pool:
results = pool.map(convert_pdf, Path('.').glob('*.pdf'))
Limitations
- OCR: Does not perform OCR on scanned PDFs (requires external preprocessing)
- Images: Text extraction only; images are not embedded in Markdown
- Fonts: Complex font rendering may not preserve exact formatting
- Encryption: Cannot process password-protected PDFs without decryption
- Forms: Interactive PDF forms are converted to static text
Advanced Usage
Custom Page Processing
from pdf_to_markdown import PDFToMarkdownConverter
class CustomConverter(PDFToMarkdownConverter):
def _process_page(self, page, page_num):
# Add custom preprocessing
content = super()._process_page(page, page_num)
# Custom postprocessing
content = content.replace("specific_term", "**specific_term**")
return content
converter = CustomConverter()
result = converter.convert(Path("document.pdf"))
Metadata Extraction
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
metadata = pdf.metadata
print(f"Title: {metadata.get('Title')}")
print(f"Author: {metadata.get('Author')}")
print(f"Pages: {len(pdf.pages)}")
Testing
Run tests with pytest:
# Run all tests
pytest
# Run with coverage
pytest --cov=pdf_to_markdown --cov-report=html
# Run specific test
pytest tests/test_converter.py::test_basic_conversion
Contributing
Contributions welcome! Areas for improvement:
- OCR integration for scanned PDFs
- Image extraction and embedding
- Form field extraction
- Enhanced table detection algorithms
- Performance optimization for large PDFs
License
MIT License - See LICENSE file for details
Support
For issues and questions:
- GitHub Issues: Report bugs or request features
- Documentation: Check this readme and inline code documentation
- Logging: Enable verbose mode (
-v) for detailed diagnostics
Version History
- 1.0.0 (Current): Initial production release
- Text and table extraction
- Multiple extraction modes
- CLI interface
- Comprehensive error handling
- Full type hints