PDF to Markdown Converter

Production-ready Python tool for converting PDF documents to Markdown format with support for text extraction, table parsing, and layout preservation.

Features

Multiple Extraction Modes: Text, table, or mixed content extraction
Layout Preservation: Maintains original document formatting
Table Support: Automatically converts PDF tables to Markdown tables
Error Handling: Comprehensive error handling with detailed logging
CLI Interface: Simple command-line interface with multiple options
Type Safety: Full type hints for better code quality
Production Ready: Logging, error boundaries, and proper resource management

Installation

Basic Installation

pip install pdfplumber

Install with All Dependencies

pip install -r requirements.txt

Development Installation

pip install -r requirements.txt
pip install pytest pytest-cov black mypy pylint

Quick Start

Basic Usage

Convert a PDF to Markdown (output filename matches input):

python pdf_to_markdown.py document.pdf

Specify Output File

python pdf_to_markdown.py input.pdf -o output.md

Extract Tables

python pdf_to_markdown.py data.pdf --mode table -o tables.md

Mixed Mode (Text + Tables)

python pdf_to_markdown.py report.pdf --mode mixed -o report.md

Verbose Logging

python pdf_to_markdown.py document.pdf -v

Usage Examples

Example 1: Simple Document Conversion

# Convert a basic PDF document
python pdf_to_markdown.py research_paper.pdf
# Output: research_paper.md

Example 2: Financial Report with Tables

# Extract tables from financial reports
python pdf_to_markdown.py quarterly_report.pdf --mode table -o q3_tables.md

Example 3: Mixed Content Document

# Process document with both text and tables
python pdf_to_markdown.py annual_report.pdf --mode mixed -o full_report.md

Example 4: Batch Processing

# Process multiple PDFs in a directory
for pdf in *.pdf; do
    python pdf_to_markdown.py "$pdf" -o "markdown/${pdf%.pdf}.md"
done

Command-Line Options

usage: pdf_to_markdown.py [-h] [-o OUTPUT] [--mode {text,table,mixed}]
                          [--no-layout] [-v]
                          input

positional arguments:
  input                 Input PDF file path

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output Markdown file path (default: input filename
                        with .md extension)
  --mode {text,table,mixed}
                        Extraction mode (default: text)
  --no-layout           Disable layout preservation
  -v, --verbose         Enable verbose logging

Programmatic Usage

Basic Conversion

from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter, ConversionConfig

# Create converter with default settings
converter = PDFToMarkdownConverter()

# Convert PDF
markdown = converter.convert(
    pdf_path=Path("input.pdf"),
    output_path=Path("output.md")
)

print(f"Converted {len(markdown)} characters")

Custom Configuration

from pdf_to_markdown import (
    PDFToMarkdownConverter,
    ConversionConfig,
    ExtractionMode
)

# Configure converter
config = ConversionConfig(
    mode=ExtractionMode.MIXED,
    preserve_layout=True,
    extract_images=False,
    table_settings={
        "vertical_strategy": "lines",
        "horizontal_strategy": "lines",
        "snap_tolerance": 3,
    }
)

# Create converter with custom config
converter = PDFToMarkdownConverter(config)

# Convert
result = converter.convert(Path("document.pdf"))

Error Handling

from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter
import logging

# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)

converter = PDFToMarkdownConverter()

try:
    markdown = converter.convert(
        pdf_path=Path("input.pdf"),
        output_path=Path("output.md")
    )
    print(f"✓ Success: {len(markdown)} characters converted")
    
except FileNotFoundError as e:
    print(f"✗ File not found: {e}")
    
except PermissionError as e:
    print(f"✗ Permission denied: {e}")
    
except Exception as e:
    print(f"✗ Conversion failed: {e}")

Output Format

The converter generates Markdown with the following structure:

# DocumentName

*Converted from PDF with N pages*

---
## Page 1

[Page content here]

### Table 1

| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Cell 1 | Cell 2 | Cell 3 |

---
## Page 2

[Next page content...]

Extraction Modes

Text Mode (Default)

Extracts only text content from PDFs. Best for:

Research papers
Books and articles
Text-heavy documents

python pdf_to_markdown.py document.pdf --mode text

Table Mode

Extracts only tables from PDFs. Best for:

Financial reports
Data sheets
Statistical documents

python pdf_to_markdown.py data.pdf --mode table

Mixed Mode

Extracts both text and tables. Best for:

Annual reports
Comprehensive documents
Mixed-content files

python pdf_to_markdown.py report.pdf --mode mixed

Configuration Options

Table Extraction Settings

Customize table extraction behavior:

table_settings = {
    "vertical_strategy": "lines",      # or "text", "explicit"
    "horizontal_strategy": "lines",    # or "text", "explicit"
    "snap_tolerance": 3,               # Pixel tolerance for line snapping
    "join_tolerance": 3,               # Tolerance for joining lines
    "edge_min_length": 3,              # Minimum line length
    "min_words_vertical": 3,           # Min words for vertical detection
    "min_words_horizontal": 1,         # Min words for horizontal detection
}

config = ConversionConfig(table_settings=table_settings)
converter = PDFToMarkdownConverter(config)

Layout Preservation

Enable or disable layout preservation:

# Preserve layout (default)
config = ConversionConfig(preserve_layout=True)

# Disable layout preservation for cleaner text
config = ConversionConfig(preserve_layout=False)

Troubleshooting

Issue: "pdfplumber not installed"

Solution: Install the required package:

pip install pdfplumber

Issue: "Permission denied"

Solution: Ensure you have read permissions for the PDF and write permissions for the output directory:

chmod +r input.pdf
mkdir -p output && chmod +w output

Issue: "Failed to extract text from page"

Cause: PDF may be scanned images or use non-standard fonts.

Solution: Use OCR preprocessing:

# Install Tesseract OCR
sudo apt-get install tesseract-ocr  # Ubuntu/Debian
brew install tesseract              # macOS

# Use OCR-enabled PDF tool first

Issue: Tables not extracted correctly

Solution: Try adjusting table settings:

config = ConversionConfig(
    table_settings={
        "vertical_strategy": "text",   # Try "text" instead of "lines"
        "snap_tolerance": 5,            # Increase tolerance
    }
)

Issue: Missing content

Solution: Enable verbose logging to diagnose:

python pdf_to_markdown.py document.pdf -v

Performance Considerations

Large PDFs: Processing time scales linearly with page count (~1-2 seconds per page)
Complex Tables: Table extraction adds overhead (~50-100ms per table)
Memory Usage: ~10-50MB per PDF, depending on complexity
Batch Processing: Use multiprocessing for large batches:

from multiprocessing import Pool
from pathlib import Path

def convert_pdf(pdf_path):
    converter = PDFToMarkdownConverter()
    return converter.convert(pdf_path, pdf_path.with_suffix('.md'))

# Process multiple PDFs in parallel
with Pool(processes=4) as pool:
    results = pool.map(convert_pdf, Path('.').glob('*.pdf'))

Limitations

OCR: Does not perform OCR on scanned PDFs (requires external preprocessing)
Images: Text extraction only; images are not embedded in Markdown
Fonts: Complex font rendering may not preserve exact formatting
Encryption: Cannot process password-protected PDFs without decryption
Forms: Interactive PDF forms are converted to static text

Advanced Usage

Custom Page Processing

from pdf_to_markdown import PDFToMarkdownConverter

class CustomConverter(PDFToMarkdownConverter):
    def _process_page(self, page, page_num):
        # Add custom preprocessing
        content = super()._process_page(page, page_num)
        
        # Custom postprocessing
        content = content.replace("specific_term", "**specific_term**")
        
        return content

converter = CustomConverter()
result = converter.convert(Path("document.pdf"))

Metadata Extraction

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    metadata = pdf.metadata
    print(f"Title: {metadata.get('Title')}")
    print(f"Author: {metadata.get('Author')}")
    print(f"Pages: {len(pdf.pages)}")

Testing

Run tests with pytest:

# Run all tests
pytest

# Run with coverage
pytest --cov=pdf_to_markdown --cov-report=html

# Run specific test
pytest tests/test_converter.py::test_basic_conversion

Contributing

Contributions welcome! Areas for improvement:

OCR integration for scanned PDFs
Image extraction and embedding
Form field extraction
Enhanced table detection algorithms
Performance optimization for large PDFs

License

MIT License - See LICENSE file for details

Support

For issues and questions:

GitHub Issues: Report bugs or request features
Documentation: Check this readme and inline code documentation
Logging: Enable verbose mode (-v) for detailed diagnostics

Version History

1.0.0 (Current): Initial production release
- Text and table extraction
- Multiple extraction modes
- CLI interface
- Comprehensive error handling
- Full type hints

Features​

Installation​

Basic Installation​

Install with All Dependencies​

Development Installation​

Quick Start​

Basic Usage​

Specify Output File​

Extract Tables​

Mixed Mode (Text + Tables)​

Verbose Logging​

Usage Examples​

Example 1: Simple Document Conversion​

Example 2: Financial Report with Tables​

Example 3: Mixed Content Document​

Example 4: Batch Processing​

Command-Line Options​

Programmatic Usage​

Basic Conversion​

Custom Configuration​

Error Handling​

Output Format​

Extraction Modes​

Text Mode (Default)​

Table Mode​

Mixed Mode​

Configuration Options​

Table Extraction Settings​

Layout Preservation​

Troubleshooting​

Issue: "pdfplumber not installed"​

Issue: "Permission denied"​

Issue: "Failed to extract text from page"​

Issue: Tables not extracted correctly​

Issue: Missing content​

Performance Considerations​

Limitations​

Advanced Usage​

Custom Page Processing​

Metadata Extraction​

Testing​

Contributing​

License​

Support​

Version History​