Skip to main content

Analysis and Refactoring Plan for app.py

Current Structure Analysis

Core Components

  1. Configuration and Setup

    • Environment variables loading
    • Logging configuration
    • Application constants
    • Flask app initialization with CORS and rate limiting
  2. File Management

    • File upload handling
    • File type validation
    • Temporary file cleanup
  3. Text Processing

    • Basic chunking without overlap
    • No UUID tracking between chunks
    • Simple prompt structure
  4. API Endpoints

    • /health
    • /upload (with rate limiting)

Current Limitations

  1. Chunking Mechanism

    • No overlap between chunks
    • Potential for breaking words at chunk boundaries
    • No relationship tracking between chunks
  2. Document Management

    • No document-level UUID
    • No chunk sequence tracking
    • Limited metadata
  3. Prompt Structure

    • Basic prompt formatting
    • No structured JSON format
    • Limited context preservation between chunks

Context

The current situation requires a decision because:

  • Requirement 1
  • Constraint 2
  • Need 3

Status

Accepted | YYYY-MM-DD

Proposed Refactoring Structure

1. Module Organization

app/
├── __init__.py
├── main.py # Application entry point
├── config/
│ ├── __init__.py
│ ├── settings.py # Configuration settings
│ └── logging.py # Logging configuration
├── core/
│ ├── __init__.py
│ ├── document.py # Document processing
│ ├── chunking.py # Chunk management with UUID
│ └── prompting.py # Anthropic prompt generation
├── api/
│ ├── __init__.py
│ ├── routes.py # API endpoints
│ └── middleware.py # Rate limiting, error handling
├── utils/
│ ├── __init__.py
│ ├── file_handler.py # File operations
│ └── validators.py # Input validation
└── services/
├── __init__.py
└── anthropic.py # Anthropic API integration

2. Key Improvements

Document Processing

# core/document.py
from dataclasses import dataclass
from typing import List, Dict
import uuid

@dataclass
class DocumentMetadata:
doc_uuid: str
filename: str
file_type: str
total_chunks: int
created_at: float

class Document:
def __init__(self, content: str, filename: str):
self.doc_uuid = str(uuid.uuid4())
self.content = content
self.metadata = DocumentMetadata(
doc_uuid=self.doc_uuid,
filename=filename,
file_type=self._determine_file_type(filename),
total_chunks=0,
created_at=time.time()
)

Chunk Management

# core/chunking.py
@dataclass
class ChunkMetadata:
doc_uuid: str
chunk_uuid: str
chunk_sequence: str
previous_uuid: Optional[str]
next_uuid: Optional[str]

class ChunkManager:
def __init__(self, chunk_size: int, overlap_percentage: float = 0.1):
self.chunk_size = chunk_size
self.overlap_size = int(chunk_size * overlap_percentage)

def create_chunks(self, document: Document) -> List[Dict]:
chunks = []
previous_uuid = None
chunk_sequence = 1

for chunk_text in self._generate_overlapping_chunks(document.content):
current_uuid = str(uuid.uuid4())
chunk = self._create_chunk(
document,
chunk_text,
current_uuid,
previous_uuid,
chunk_sequence
)
chunks.append(chunk)
previous_uuid = current_uuid
chunk_sequence += 1

return chunks

Prompt Generation

# core/prompting.py
class PromptGenerator:
def create_analysis_prompt(self, chunk: Dict, user_query: str) -> Dict:
return {
"metadata": {
"doc_uuid": chunk["chunk_metadata"]["doc_uuid"],
"chunk_uuid": chunk["chunk_metadata"]["chunk_uuid"],
"chunk_sequence": chunk["chunk_metadata"]["chunk_sequence"],
"previous_uuid": chunk["chunk_metadata"]["previous_uuid"],
"next_uuid": chunk["chunk_metadata"]["next_uuid"]
},
"content": chunk["chunk_text"],
"query": user_query,
"instructions": {
"context": f"This is chunk {chunk['chunk_metadata']['chunk_sequence']}",
"tasks": [
"Analyze the content step by step",
"Consider the relationship with previous and next chunks",
"Provide structured insights in markdown format"
]
}
}

3. API Endpoint Preservation

# api/routes.py
@app.route('/upload', methods=['POST'])
@limiter.limit("10 per minute")
def upload_file():
"""
Maintains the same endpoint interface while using new infrastructure
"""
try:
file = request.files['file']
prompt = request.form['prompt']
context = request.form.get('context', '')

# Validation remains the same
validate_file_upload(file)

# Process using new structure
document = Document(file.read().decode('utf-8'), file.filename)
chunk_manager = ChunkManager(CHUNK_SIZE, OVERLAP_PERCENTAGE)
chunks = chunk_manager.create_chunks(document)

# Generate prompts
prompt_generator = PromptGenerator()
responses = []

for chunk in chunks:
prompt_json = prompt_generator.create_analysis_prompt(
chunk,
prompt
)
response = process_chunk_with_anthropic(prompt_json)
responses.append(response)

return jsonify({
"status": "success",
"responses": responses
})

except Exception as e:
logger.exception("Error processing upload")
return jsonify({
"error": "Internal server error",
"details": str(e)
}), 500

Implementation Plan

  1. Phase 1: Core Infrastructure

    • Create new directory structure
    • Implement Document and ChunkManager classes
    • Set up configuration management
    • Establish logging infrastructure
  2. Phase 2: Service Layer

    • Implement PromptGenerator
    • Create Anthropic service wrapper
    • Set up file handling utilities
  3. Phase 3: API Layer

    • Migrate existing endpoints
    • Implement error handling
    • Add middleware
  4. Phase 4: Testing and Integration

    • Unit tests for new components
    • Integration tests for API endpoints
    • Performance testing with large documents

Key Benefits

  1. Improved Maintainability

    • Clear separation of concerns
    • Modular components
    • Better error handling
    • Comprehensive logging
  2. Enhanced Functionality

    • Robust chunk relationship tracking
    • Improved context preservation
    • Better document management
  3. Future-Proofing

    • Easy to extend with new features
    • Simple to modify prompt structure
    • Flexible document processing pipeline

Migration Strategy

  1. Create new structure alongside existing code
  2. Implement new features in parallel
  3. Gradually migrate functionality
  4. Maintain existing endpoints throughout
  5. Add comprehensive tests
  6. Deploy with feature flags if needed

Would you like me to focus on implementing any specific part of this refactoring plan?