Skip to main content

PDF to Markdown Converter

Production-ready Python tool for converting PDF documents to Markdown format with support for text extraction, table parsing, and layout preservation.

Features

  • Multiple Extraction Modes: Text, table, or mixed content extraction
  • Layout Preservation: Maintains original document formatting
  • Table Support: Automatically converts PDF tables to Markdown tables
  • Error Handling: Comprehensive error handling with detailed logging
  • CLI Interface: Simple command-line interface with multiple options
  • Type Safety: Full type hints for better code quality
  • Production Ready: Logging, error boundaries, and proper resource management

Installation

Basic Installation

pip install pdfplumber

Install with All Dependencies

pip install -r requirements.txt

Development Installation

pip install -r requirements.txt
pip install pytest pytest-cov black mypy pylint

Quick Start

Basic Usage

Convert a PDF to Markdown (output filename matches input):

python pdf_to_markdown.py document.pdf

Specify Output File

python pdf_to_markdown.py input.pdf -o output.md

Extract Tables

python pdf_to_markdown.py data.pdf --mode table -o tables.md

Mixed Mode (Text + Tables)

python pdf_to_markdown.py report.pdf --mode mixed -o report.md

Verbose Logging

python pdf_to_markdown.py document.pdf -v

Usage Examples

Example 1: Simple Document Conversion

# Convert a basic PDF document
python pdf_to_markdown.py research_paper.pdf
# Output: research_paper.md

Example 2: Financial Report with Tables

# Extract tables from financial reports
python pdf_to_markdown.py quarterly_report.pdf --mode table -o q3_tables.md

Example 3: Mixed Content Document

# Process document with both text and tables
python pdf_to_markdown.py annual_report.pdf --mode mixed -o full_report.md

Example 4: Batch Processing

# Process multiple PDFs in a directory
for pdf in *.pdf; do
python pdf_to_markdown.py "$pdf" -o "markdown/${pdf%.pdf}.md"
done

Command-Line Options

usage: pdf_to_markdown.py [-h] [-o OUTPUT] [--mode {text,table,mixed}]
[--no-layout] [-v]
input

positional arguments:
input Input PDF file path

optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output Markdown file path (default: input filename
with .md extension)
--mode {text,table,mixed}
Extraction mode (default: text)
--no-layout Disable layout preservation
-v, --verbose Enable verbose logging

Programmatic Usage

Basic Conversion

from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter, ConversionConfig

# Create converter with default settings
converter = PDFToMarkdownConverter()

# Convert PDF
markdown = converter.convert(
pdf_path=Path("input.pdf"),
output_path=Path("output.md")
)

print(f"Converted {len(markdown)} characters")

Custom Configuration

from pdf_to_markdown import (
PDFToMarkdownConverter,
ConversionConfig,
ExtractionMode
)

# Configure converter
config = ConversionConfig(
mode=ExtractionMode.MIXED,
preserve_layout=True,
extract_images=False,
table_settings={
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"snap_tolerance": 3,
}
)

# Create converter with custom config
converter = PDFToMarkdownConverter(config)

# Convert
result = converter.convert(Path("document.pdf"))

Error Handling

from pathlib import Path
from pdf_to_markdown import PDFToMarkdownConverter
import logging

# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)

converter = PDFToMarkdownConverter()

try:
markdown = converter.convert(
pdf_path=Path("input.pdf"),
output_path=Path("output.md")
)
print(f"✓ Success: {len(markdown)} characters converted")

except FileNotFoundError as e:
print(f"✗ File not found: {e}")

except PermissionError as e:
print(f"✗ Permission denied: {e}")

except Exception as e:
print(f"✗ Conversion failed: {e}")

Output Format

The converter generates Markdown with the following structure:

# DocumentName

*Converted from PDF with N pages*

---
## Page 1

[Page content here]

### Table 1

| Header 1 | Header 2 | Header 3 |
| --- | --- | --- |
| Cell 1 | Cell 2 | Cell 3 |

---
## Page 2

[Next page content...]

Extraction Modes

Text Mode (Default)

Extracts only text content from PDFs. Best for:

  • Research papers
  • Books and articles
  • Text-heavy documents
python pdf_to_markdown.py document.pdf --mode text

Table Mode

Extracts only tables from PDFs. Best for:

  • Financial reports
  • Data sheets
  • Statistical documents
python pdf_to_markdown.py data.pdf --mode table

Mixed Mode

Extracts both text and tables. Best for:

  • Annual reports
  • Comprehensive documents
  • Mixed-content files
python pdf_to_markdown.py report.pdf --mode mixed

Configuration Options

Table Extraction Settings

Customize table extraction behavior:

table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines", # or "text", "explicit"
"snap_tolerance": 3, # Pixel tolerance for line snapping
"join_tolerance": 3, # Tolerance for joining lines
"edge_min_length": 3, # Minimum line length
"min_words_vertical": 3, # Min words for vertical detection
"min_words_horizontal": 1, # Min words for horizontal detection
}

config = ConversionConfig(table_settings=table_settings)
converter = PDFToMarkdownConverter(config)

Layout Preservation

Enable or disable layout preservation:

# Preserve layout (default)
config = ConversionConfig(preserve_layout=True)

# Disable layout preservation for cleaner text
config = ConversionConfig(preserve_layout=False)

Troubleshooting

Issue: "pdfplumber not installed"

Solution: Install the required package:

pip install pdfplumber

Issue: "Permission denied"

Solution: Ensure you have read permissions for the PDF and write permissions for the output directory:

chmod +r input.pdf
mkdir -p output && chmod +w output

Issue: "Failed to extract text from page"

Cause: PDF may be scanned images or use non-standard fonts.

Solution: Use OCR preprocessing:

# Install Tesseract OCR
sudo apt-get install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS

# Use OCR-enabled PDF tool first

Issue: Tables not extracted correctly

Solution: Try adjusting table settings:

config = ConversionConfig(
table_settings={
"vertical_strategy": "text", # Try "text" instead of "lines"
"snap_tolerance": 5, # Increase tolerance
}
)

Issue: Missing content

Solution: Enable verbose logging to diagnose:

python pdf_to_markdown.py document.pdf -v

Performance Considerations

  • Large PDFs: Processing time scales linearly with page count (~1-2 seconds per page)
  • Complex Tables: Table extraction adds overhead (~50-100ms per table)
  • Memory Usage: ~10-50MB per PDF, depending on complexity
  • Batch Processing: Use multiprocessing for large batches:
from multiprocessing import Pool
from pathlib import Path

def convert_pdf(pdf_path):
converter = PDFToMarkdownConverter()
return converter.convert(pdf_path, pdf_path.with_suffix('.md'))

# Process multiple PDFs in parallel
with Pool(processes=4) as pool:
results = pool.map(convert_pdf, Path('.').glob('*.pdf'))

Limitations

  • OCR: Does not perform OCR on scanned PDFs (requires external preprocessing)
  • Images: Text extraction only; images are not embedded in Markdown
  • Fonts: Complex font rendering may not preserve exact formatting
  • Encryption: Cannot process password-protected PDFs without decryption
  • Forms: Interactive PDF forms are converted to static text

Advanced Usage

Custom Page Processing

from pdf_to_markdown import PDFToMarkdownConverter

class CustomConverter(PDFToMarkdownConverter):
def _process_page(self, page, page_num):
# Add custom preprocessing
content = super()._process_page(page, page_num)

# Custom postprocessing
content = content.replace("specific_term", "**specific_term**")

return content

converter = CustomConverter()
result = converter.convert(Path("document.pdf"))

Metadata Extraction

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
metadata = pdf.metadata
print(f"Title: {metadata.get('Title')}")
print(f"Author: {metadata.get('Author')}")
print(f"Pages: {len(pdf.pages)}")

Testing

Run tests with pytest:

# Run all tests
pytest

# Run with coverage
pytest --cov=pdf_to_markdown --cov-report=html

# Run specific test
pytest tests/test_converter.py::test_basic_conversion

Contributing

Contributions welcome! Areas for improvement:

  1. OCR integration for scanned PDFs
  2. Image extraction and embedding
  3. Form field extraction
  4. Enhanced table detection algorithms
  5. Performance optimization for large PDFs

License

MIT License - See LICENSE file for details

Support

For issues and questions:

  • GitHub Issues: Report bugs or request features
  • Documentation: Check this readme and inline code documentation
  • Logging: Enable verbose mode (-v) for detailed diagnostics

Version History

  • 1.0.0 (Current): Initial production release
    • Text and table extraction
    • Multiple extraction modes
    • CLI interface
    • Comprehensive error handling
    • Full type hints