Analysis and Refactoring Plan for app.py
Current Structure Analysis
Core Components
-
Configuration and Setup
- Environment variables loading
- Logging configuration
- Application constants
- Flask app initialization with CORS and rate limiting
-
File Management
- File upload handling
- File type validation
- Temporary file cleanup
-
Text Processing
- Basic chunking without overlap
- No UUID tracking between chunks
- Simple prompt structure
-
API Endpoints
- /health
- /upload (with rate limiting)
Current Limitations
-
Chunking Mechanism
- No overlap between chunks
- Potential for breaking words at chunk boundaries
- No relationship tracking between chunks
-
Document Management
- No document-level UUID
- No chunk sequence tracking
- Limited metadata
-
Prompt Structure
- Basic prompt formatting
- No structured JSON format
- Limited context preservation between chunks
Context
The current situation requires a decision because:
- Requirement 1
- Constraint 2
- Need 3
Status
Accepted | YYYY-MM-DD
Proposed Refactoring Structure
1. Module Organization
app/
├── __init__.py
├── main.py # Application entry point
├── config/
│ ├── __init__.py
│ ├── settings.py # Configuration settings
│ └── logging.py # Logging configuration
├── core/
│ ├── __init__.py
│ ├── document.py # Document processing
│ ├── chunking.py # Chunk management with UUID
│ └── prompting.py # Anthropic prompt generation
├── api/
│ ├── __init__.py
│ ├── routes.py # API endpoints
│ └── middleware.py # Rate limiting, error handling
├── utils/
│ ├── __init__.py
│ ├── file_handler.py # File operations
│ └── validators.py # Input validation
└── services/
├── __init__.py
└── anthropic.py # Anthropic API integration
2. Key Improvements
Document Processing
# core/document.py
from dataclasses import dataclass
from typing import List, Dict
import uuid
@dataclass
class DocumentMetadata:
doc_uuid: str
filename: str
file_type: str
total_chunks: int
created_at: float
class Document:
def __init__(self, content: str, filename: str):
self.doc_uuid = str(uuid.uuid4())
self.content = content
self.metadata = DocumentMetadata(
doc_uuid=self.doc_uuid,
filename=filename,
file_type=self._determine_file_type(filename),
total_chunks=0,
created_at=time.time()
)
Chunk Management
# core/chunking.py
@dataclass
class ChunkMetadata:
doc_uuid: str
chunk_uuid: str
chunk_sequence: str
previous_uuid: Optional[str]
next_uuid: Optional[str]
class ChunkManager:
def __init__(self, chunk_size: int, overlap_percentage: float = 0.1):
self.chunk_size = chunk_size
self.overlap_size = int(chunk_size * overlap_percentage)
def create_chunks(self, document: Document) -> List[Dict]:
chunks = []
previous_uuid = None
chunk_sequence = 1
for chunk_text in self._generate_overlapping_chunks(document.content):
current_uuid = str(uuid.uuid4())
chunk = self._create_chunk(
document,
chunk_text,
current_uuid,
previous_uuid,
chunk_sequence
)
chunks.append(chunk)
previous_uuid = current_uuid
chunk_sequence += 1
return chunks
Prompt Generation
# core/prompting.py
class PromptGenerator:
def create_analysis_prompt(self, chunk: Dict, user_query: str) -> Dict:
return {
"metadata": {
"doc_uuid": chunk["chunk_metadata"]["doc_uuid"],
"chunk_uuid": chunk["chunk_metadata"]["chunk_uuid"],
"chunk_sequence": chunk["chunk_metadata"]["chunk_sequence"],
"previous_uuid": chunk["chunk_metadata"]["previous_uuid"],
"next_uuid": chunk["chunk_metadata"]["next_uuid"]
},
"content": chunk["chunk_text"],
"query": user_query,
"instructions": {
"context": f"This is chunk {chunk['chunk_metadata']['chunk_sequence']}",
"tasks": [
"Analyze the content step by step",
"Consider the relationship with previous and next chunks",
"Provide structured insights in markdown format"
]
}
}
3. API Endpoint Preservation
# api/routes.py
@app.route('/upload', methods=['POST'])
@limiter.limit("10 per minute")
def upload_file():
"""
Maintains the same endpoint interface while using new infrastructure
"""
try:
file = request.files['file']
prompt = request.form['prompt']
context = request.form.get('context', '')
# Validation remains the same
validate_file_upload(file)
# Process using new structure
document = Document(file.read().decode('utf-8'), file.filename)
chunk_manager = ChunkManager(CHUNK_SIZE, OVERLAP_PERCENTAGE)
chunks = chunk_manager.create_chunks(document)
# Generate prompts
prompt_generator = PromptGenerator()
responses = []
for chunk in chunks:
prompt_json = prompt_generator.create_analysis_prompt(
chunk,
prompt
)
response = process_chunk_with_anthropic(prompt_json)
responses.append(response)
return jsonify({
"status": "success",
"responses": responses
})
except Exception as e:
logger.exception("Error processing upload")
return jsonify({
"error": "Internal server error",
"details": str(e)
}), 500
Implementation Plan
-
Phase 1: Core Infrastructure
- Create new directory structure
- Implement Document and ChunkManager classes
- Set up configuration management
- Establish logging infrastructure
-
Phase 2: Service Layer
- Implement PromptGenerator
- Create Anthropic service wrapper
- Set up file handling utilities
-
Phase 3: API Layer
- Migrate existing endpoints
- Implement error handling
- Add middleware
-
Phase 4: Testing and Integration
- Unit tests for new components
- Integration tests for API endpoints
- Performance testing with large documents
Key Benefits
-
Improved Maintainability
- Clear separation of concerns
- Modular components
- Better error handling
- Comprehensive logging
-
Enhanced Functionality
- Robust chunk relationship tracking
- Improved context preservation
- Better document management
-
Future-Proofing
- Easy to extend with new features
- Simple to modify prompt structure
- Flexible document processing pipeline
Migration Strategy
- Create new structure alongside existing code
- Implement new features in parallel
- Gradually migrate functionality
- Maintain existing endpoints throughout
- Add comprehensive tests
- Deploy with feature flags if needed
Would you like me to focus on implementing any specific part of this refactoring plan?