Analysis and Refactoring Plan for app.py

Current Structure Analysis

Core Components

Configuration and Setup
- Environment variables loading
- Logging configuration
- Application constants
- Flask app initialization with CORS and rate limiting
File Management
- File upload handling
- File type validation
- Temporary file cleanup
Text Processing
- Basic chunking without overlap
- No UUID tracking between chunks
- Simple prompt structure
API Endpoints
- /health
- /upload (with rate limiting)

Current Limitations

Chunking Mechanism
- No overlap between chunks
- Potential for breaking words at chunk boundaries
- No relationship tracking between chunks
Document Management
- No document-level UUID
- No chunk sequence tracking
- Limited metadata
Prompt Structure
- Basic prompt formatting
- No structured JSON format
- Limited context preservation between chunks

Context

The current situation requires a decision because:

Requirement 1
Constraint 2
Need 3

Status

Accepted | YYYY-MM-DD

Proposed Refactoring Structure

1. Module Organization

app/
├── __init__.py
├── main.py                 # Application entry point
├── config/
│   ├── __init__.py
│   ├── settings.py        # Configuration settings
│   └── logging.py         # Logging configuration
├── core/
│   ├── __init__.py
│   ├── document.py        # Document processing
│   ├── chunking.py        # Chunk management with UUID
│   └── prompting.py       # Anthropic prompt generation
├── api/
│   ├── __init__.py
│   ├── routes.py          # API endpoints
│   └── middleware.py      # Rate limiting, error handling
├── utils/
│   ├── __init__.py
│   ├── file_handler.py    # File operations
│   └── validators.py      # Input validation
└── services/
    ├── __init__.py
    └── anthropic.py       # Anthropic API integration

2. Key Improvements

Document Processing

# core/document.py
from dataclasses import dataclass
from typing import List, Dict
import uuid

@dataclass
class DocumentMetadata:
    doc_uuid: str
    filename: str
    file_type: str
    total_chunks: int
    created_at: float

class Document:
    def __init__(self, content: str, filename: str):
        self.doc_uuid = str(uuid.uuid4())
        self.content = content
        self.metadata = DocumentMetadata(
            doc_uuid=self.doc_uuid,
            filename=filename,
            file_type=self._determine_file_type(filename),
            total_chunks=0,
            created_at=time.time()
        )

Chunk Management

# core/chunking.py
@dataclass
class ChunkMetadata:
    doc_uuid: str
    chunk_uuid: str
    chunk_sequence: str
    previous_uuid: Optional[str]
    next_uuid: Optional[str]

class ChunkManager:
    def __init__(self, chunk_size: int, overlap_percentage: float = 0.1):
        self.chunk_size = chunk_size
        self.overlap_size = int(chunk_size * overlap_percentage)

    def create_chunks(self, document: Document) -> List[Dict]:
        chunks = []
        previous_uuid = None
        chunk_sequence = 1

        for chunk_text in self._generate_overlapping_chunks(document.content):
            current_uuid = str(uuid.uuid4())
            chunk = self._create_chunk(
                document,
                chunk_text,
                current_uuid,
                previous_uuid,
                chunk_sequence
            )
            chunks.append(chunk)
            previous_uuid = current_uuid
            chunk_sequence += 1

        return chunks

Prompt Generation

# core/prompting.py
class PromptGenerator:
    def create_analysis_prompt(self, chunk: Dict, user_query: str) -> Dict:
        return {
            "metadata": {
                "doc_uuid": chunk["chunk_metadata"]["doc_uuid"],
                "chunk_uuid": chunk["chunk_metadata"]["chunk_uuid"],
                "chunk_sequence": chunk["chunk_metadata"]["chunk_sequence"],
                "previous_uuid": chunk["chunk_metadata"]["previous_uuid"],
                "next_uuid": chunk["chunk_metadata"]["next_uuid"]
            },
            "content": chunk["chunk_text"],
            "query": user_query,
            "instructions": {
                "context": f"This is chunk {chunk['chunk_metadata']['chunk_sequence']}",
                "tasks": [
                    "Analyze the content step by step",
                    "Consider the relationship with previous and next chunks",
                    "Provide structured insights in markdown format"
                ]
            }
        }

3. API Endpoint Preservation

# api/routes.py
@app.route('/upload', methods=['POST'])
@limiter.limit("10 per minute")
def upload_file():
    """
    Maintains the same endpoint interface while using new infrastructure
    """
    try:
        file = request.files['file']
        prompt = request.form['prompt']
        context = request.form.get('context', '')

        # Validation remains the same
        validate_file_upload(file)

        # Process using new structure
        document = Document(file.read().decode('utf-8'), file.filename)
        chunk_manager = ChunkManager(CHUNK_SIZE, OVERLAP_PERCENTAGE)
        chunks = chunk_manager.create_chunks(document)

        # Generate prompts
        prompt_generator = PromptGenerator()
        responses = []
        
        for chunk in chunks:
            prompt_json = prompt_generator.create_analysis_prompt(
                chunk,
                prompt
            )
            response = process_chunk_with_anthropic(prompt_json)
            responses.append(response)

        return jsonify({
            "status": "success",
            "responses": responses
        })

    except Exception as e:
        logger.exception("Error processing upload")
        return jsonify({
            "error": "Internal server error",
            "details": str(e)
        }), 500

Implementation Plan

Phase 1: Core Infrastructure
- Create new directory structure
- Implement Document and ChunkManager classes
- Set up configuration management
- Establish logging infrastructure
Phase 2: Service Layer
- Implement PromptGenerator
- Create Anthropic service wrapper
- Set up file handling utilities
Phase 3: API Layer
- Migrate existing endpoints
- Implement error handling
- Add middleware
Phase 4: Testing and Integration
- Unit tests for new components
- Integration tests for API endpoints
- Performance testing with large documents

Key Benefits

Improved Maintainability
- Clear separation of concerns
- Modular components
- Better error handling
- Comprehensive logging
Enhanced Functionality
- Robust chunk relationship tracking
- Improved context preservation
- Better document management
Future-Proofing
- Easy to extend with new features
- Simple to modify prompt structure
- Flexible document processing pipeline

Migration Strategy

Create new structure alongside existing code
Implement new features in parallel
Gradually migrate functionality
Maintain existing endpoints throughout
Add comprehensive tests
Deploy with feature flags if needed

Would you like me to focus on implementing any specific part of this refactoring plan?

Current Structure Analysis​

Core Components​

Current Limitations​

Context​

Status​

Proposed Refactoring Structure​

1. Module Organization​

2. Key Improvements​

Document Processing​

Chunk Management​

Prompt Generation​

3. API Endpoint Preservation​

Implementation Plan​

Key Benefits​

Migration Strategy​

Current Structure Analysis

Core Components

Current Limitations

Context

Status

Proposed Refactoring Structure

1. Module Organization

2. Key Improvements

Document Processing

Chunk Management

Prompt Generation

3. API Endpoint Preservation

Implementation Plan

Key Benefits

Migration Strategy