Phase 10: Production Readiness & Deployment
Objective: Optimize for production, add monitoring, implement security best practices, and create deployment configurations.
1. Performance Optimization
1.1 Database Optimization
-
Add database indexes:
- Index on
id(primary key, UUID) - Index on
name(unique, human-readable identifier) - Index on
owner(for RLS) - Index on
modified(for sorting) - Index on foreign keys
- Composite indexes for common queries
- Index on
-
Implement query optimization:
- Use
select_in_loadingfor relationships - Avoid N+1 queries
- Add query logging in dev mode
- Analyze slow queries
- Use
1.2 Caching Strategy
-
Implement multi-level caching:
- L1: In-memory cache (per-request)
- L2: Redis cache (shared)
- Cache metadata (DocType schemas)
- Cache permission checks
- Cache frequently accessed documents
-
Add cache invalidation:
- Invalidate on document update
- Invalidate on permission change
- Use cache tags for bulk invalidation
1.3 Connection Pooling
-
Configure SQLAlchemy pool:
- Pool size: 20
- Max overflow: 10
- Pool timeout: 30s
- Pool recycle: 3600s
-
Configure Redis pool:
- Max connections: 50
1.4 Async Optimization
- Use
asyncio.gather()for parallel operations - Avoid blocking calls in async functions
- Use connection pools efficiently
2. Security Hardening
2.1 Input Validation
- Validate all inputs with Pydantic
- Sanitize user-provided data
- Prevent SQL injection (use parameterized queries)
- Prevent XSS (API-only, no HTML rendering)
2.2 Authentication Security
-
Support JWT token validation:
- Verify signature
- Check expiration
- Validate issuer
-
Add rate limiting:
- Limit login attempts
- Limit API requests per user
- Use Redis for rate limit storage
2.3 Authorization Security
- Enforce permissions on all endpoints
- Apply RLS to all queries
- Validate object-level permissions
- Log permission denials
2.4 Secrets Management
- Never commit secrets to Git
- Use environment variables
- Support secret managers:
- AWS Secrets Manager
- Google Secret Manager
- HashiCorp Vault
2.5 HTTPS & CORS
- Enforce HTTPS in production
- Configure CORS properly:
- Whitelist allowed origins
- Restrict methods and headers
3. Monitoring & Observability
3.1 Logging
-
Implement structured logging:
- Use JSON format
- Include request ID
- Include user context
- Log levels: DEBUG, INFO, WARNING, ERROR
-
Log important events:
- API requests (with timing)
- Database queries (in dev)
- Permission checks
- Errors and exceptions
3.2 Metrics
-
Add Prometheus metrics:
- Request count by endpoint
- Request duration histogram
- Database query duration
- Cache hit/miss rate
- Job queue length
-
Create
/metricsendpoint
3.3 Tracing
- Implement OpenTelemetry:
- Trace HTTP requests
- Trace database queries
- Trace background jobs
- Export to Jaeger or Zipkin
3.4 Health Checks
-
Create
/healthendpoint:- Check database connection
- Check Redis connection
- Return 200 if healthy, 503 if not
-
Create
/readyendpoint:- Check if app is ready to serve traffic
- Check migrations are up to date
4. Error Handling & Resilience
4.1 Graceful Degradation
- Handle database failures gracefully
- Handle Redis failures (fallback to no cache)
- Retry transient errors
4.2 Circuit Breaker
- Implement circuit breaker for external services:
- Webhook calls
- Email sending
- Print service (Gotenberg)
4.3 Timeouts
- Set timeouts for all operations:
- HTTP requests: 30s
- Database queries: 10s
- Background jobs: 300s
5. Deployment Configurations
5.1 Docker
-
Create production
Dockerfile:FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN uv pip install --no-cache-dir .
CMD ["m", "start", "--host", "0.0.0.0"] -
Multi-stage build for smaller images
-
Use non-root user
-
Add health check
5.2 Docker Compose
-
Create
docker-compose.yml:- API service
- Worker service
- PostgreSQL
- Redis
- Gotenberg (for PDFs)
-
Add volumes for persistence
-
Add networks for isolation
5.3 Kubernetes
-
Create Kubernetes manifests:
- Deployment for API
- Deployment for Worker
- StatefulSet for PostgreSQL
- StatefulSet for Redis
- Service for API
- Ingress for routing
-
Add resource limits:
- CPU: 500m - 2000m
- Memory: 512Mi - 2Gi
-
Add liveness and readiness probes
5.4 Cloud Platforms
- Create deployment guides for:
- Google Cloud Run
- AWS ECS/Fargate
- Azure Container Apps
- Heroku
- Railway
- Render
6. CI/CD Pipeline
6.1 GitLab CI
- Configure
.gitlab-ci.yml(already exists in repo):stages:
- test
- build
- deploy
test:
stage: test
script:
- uv sync
- uv run pytest
- uv run mypy --strict .
- uv run ruff check .
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
build:
stage: build
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
rules:
- if: $CI_COMMIT_BRANCH == "main"
deploy-staging:
stage: deploy
script:
- kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
environment: staging
rules:
- if: $CI_COMMIT_BRANCH == "main"
deploy-production:
stage: deploy
script:
- kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_TAG
environment: production
rules:
- if: $CI_COMMIT_TAG =~ /^v.*/
6.2 Pre-commit Hooks
- Setup pre-commit:
- Run ruff formatter
- Run mypy
- Run tests
6.3 Release Versioning & Publishing
-
Adopt Semantic Versioning (SemVer):
-
MAJOR.MINOR.PATCHformat - Document breaking change policy
- Use
0.x.yfor pre-1.0 releases
-
-
CHANGELOG management:
- Create
CHANGELOG.mdfollowing Keep a Changelog format - Sections: Added, Changed, Deprecated, Removed, Fixed, Security
- Option: Use git-cliff for automation
- Link CHANGELOG to documentation site
- Create
-
Conventional Commits (optional but recommended):
-
feat:→ MINOR bump -
fix:→ PATCH bump -
feat!:orBREAKING CHANGE:→ MAJOR bump - Setup commitlint for enforcement
-
-
Release workflow (add to
.gitlab-ci.yml):release:
stage: deploy
script:
- uv build
- uv publish
variables:
UV_PUBLISH_TOKEN: $PYPI_TOKEN
rules:
- if: $CI_COMMIT_TAG =~ /^v.*/
release-notes:
stage: deploy
image: registry.gitlab.com/gitlab-org/release-cli:latest
script:
- echo "Creating release for $CI_COMMIT_TAG"
release:
tag_name: $CI_COMMIT_TAG
description: ./CHANGELOG.md
rules:
- if: $CI_COMMIT_TAG =~ /^v.*/ -
Version sources:
- Single source of truth in
pyproject.toml - Use dynamic versioning from git tags OR
__version__in package - Expose version via
m --versionCLI
- Single source of truth in
-
Pre-release versions:
- Support alpha/beta/rc:
1.0.0-alpha.1,1.0.0-beta.2,1.0.0-rc.1 - Publish pre-releases to TestPyPI first
- Support alpha/beta/rc:
6.4 Git Branching Strategy
[!NOTE] Framework M's hexagonal architecture (adapters are swappable, not forked) enables a simpler branching model than Frappe's
version-XXbranches. Breaking changes are isolated to Protocol interfaces, not scattered across the codebase.
-
Trunk-Based Development (recommended for M):
main (default)
│
├── feature/xyz # Short-lived feature branches
├── fix/issue-123 # Bug fix branches
│
└── Tags: v0.1.0, v0.2.0, v1.0.0, v1.0.1, v1.1.0 ... -
Branch naming conventions:
-
main— Always deployable, latest development -
feature/<name>— New features (merge to main via PR) -
fix/<issue-id>— Bug fixes (merge to main via PR) -
release/v<major>.<minor>— Created only when LTS support needed
-
-
Release branches (only for LTS):
main
│
├─── v1.0.0 (tag)
│
└─── release/v1 ← Created when v2.0.0 is released
│
├─── v1.0.1 (backport tag)
└─── v1.0.2 (backport tag) -
LTS (Long-Term Support) Policy:
- Support last 2 major versions (not 3 like Frappe — M is simpler)
- Security fixes: Backport to all supported versions
- Bug fixes: Backport to latest minor of each supported major
- Features: Only on
main, no backports
-
Backporting workflow:
- Fix on
mainfirst (always forward-fix) - Cherry-pick to
release/vXbranch - Tag patch release:
vX.Y.Z - Tool: Use
git cherry-pick -xto track source commit
- Fix on
-
Why simpler than Frappe?
- Hexagonal architecture = Adapters can be versioned independently
- No DB schema in core = Migrations are app-specific, not framework-specific
- Protocol stability = Breaking changes are explicit, not hidden
- Apps can pin framework version = No forced upgrade cascades
7. Database Migration Strategy
[!IMPORTANT] Migrations are generated by Alembic (Phase 02). This section defines when and how to apply them safely.
7.1 Change Classification
| Category | Examples | Auto-Apply (Dev)? | Production Strategy |
|---|---|---|---|
| Safe | Add nullable column, add table, add index | ✅ Yes | Apply directly |
| Review | Add non-nullable column, change type (compatible) | ⚠️ Generate only | Review → Staging → Prod |
| Dangerous | Drop column, rename column, change type (incompatible) | ❌ No | Manual migration with data handling |
7.2 Safe Changes (Auto-Migration in Dev Mode)
These changes are automatically applied in development (m start --dev):
- Add table: New DocType → new table
- Add nullable column: New optional field →
ALTER TABLE ADD COLUMN ... NULL - Add index:
Meta.indexes→CREATE INDEX - Add foreign key: Link field →
ALTER TABLE ADD CONSTRAINT
# Dev mode: auto-applies safe migrations
m start --dev # Detects changes, generates & applies migration
# Production: never auto-apply
m start # Requires explicit `m migrate` first
7.3 Review-Required Changes
These changes generate migrations but require human review:
-
Add non-nullable column without default:
- Migration will fail if table has existing rows
- Solution: Add with default → backfill → remove default
-
Change column type (compatible):
-
str→Text(widening): Usually safe -
int→bigint(widening): Usually safe - Review for edge cases
-
# CLI shows warning for review-required changes
$ m migrate:create add_phone_field
⚠️ Review required: 'phone' is non-nullable without default
Existing rows will cause migration to fail.
Recommendation:
1. Add field with default: phone: str = ""
2. Run migration
3. Backfill data
4. Remove default if desired
7.4 Dangerous Changes (Manual Handling Required)
These changes require explicit data migration code:
Drop Column
# Migration: drop_old_field.py
def upgrade():
# Step 1: Ensure no code references the column
# Step 2: Drop the column
op.drop_column('invoice', 'old_field')
def downgrade():
# Re-add column (data is lost!)
op.add_column('invoice', sa.Column('old_field', sa.String(255)))
Rename Column
# Migration: rename_customer_to_party.py
def upgrade():
# Rename preserves data
op.alter_column('invoice', 'customer', new_column_name='party')
def downgrade():
op.alter_column('invoice', 'party', new_column_name='customer')
Type Change (Incompatible)
# Migration: age_str_to_int.py
def upgrade():
# Step 1: Add new column
op.add_column('person', sa.Column('age_int', sa.Integer(), nullable=True))
# Step 2: Migrate data with transformation
connection = op.get_bind()
connection.execute(text("""
UPDATE person
SET age_int = CASE
WHEN age ~ '^[0-9]+$' THEN age::integer
ELSE NULL
END
"""))
# Step 3: Drop old column
op.drop_column('person', 'age')
# Step 4: Rename new column
op.alter_column('person', 'age_int', new_column_name='age')
def downgrade():
# Reverse transformation
op.alter_column('person', 'age', type_=sa.String(10),
postgresql_using='age::varchar')
7.5 Data Migration Patterns
Pattern A: Backfill in Migration (Small Tables)
def upgrade():
# For tables `< 100k` rows: inline backfill
op.add_column('user', sa.Column('full_name', sa.String(255)))
connection = op.get_bind()
connection.execute(text("""
UPDATE user SET full_name = first_name || ' ' || last_name
"""))
Pattern B: Background Job (Large Tables)
# Migration: add_computed_field.py
def upgrade():
# Step 1: Add nullable column
op.add_column('invoice', sa.Column('tax_amount', sa.Numeric(), nullable=True))
# Step 2: Document: run `m job:run backfill_tax_amounts` after migration
# Job: jobs/backfill_tax_amounts.py
@job
async def backfill_tax_amounts(batch_size: int = 1000):
"""Backfill tax_amount for existing invoices."""
async with uow_factory() as uow:
invoices = await repo.list(
filters=[Filter("tax_amount", "is", None)],
limit=batch_size
)
for inv in invoices.items:
inv.tax_amount = inv.total * inv.tax_rate
await repo.save(uow.session, inv)
await uow.commit()
Pattern C: Two-Phase Migration (Zero Downtime)
Phase 1 (Deploy v1.1):
- Add new column (nullable)
- Code writes to BOTH old and new columns
- Background job backfills new column
Phase 2 (Deploy v1.2):
- Code reads from new column only
- Drop old column in migration
7.6 Zero-Downtime Migrations
- Never drop columns in the same deploy as code changes
- Always add columns as nullable first
- Always backfill data before making non-nullable
- Always deploy code changes before dropping columns
Deploy Timeline:
v1.0: old_field exists, code uses old_field
│
v1.1: new_field added (nullable), code writes to BOTH
│ ← Run backfill job
v1.2: code reads from new_field only
│
v1.3: drop old_field migration
7.7 Migration Workflow (Production)
# 1. Generate migration (dev environment)
m migrate:create add_user_phone_field
# 2. Review generated migration
cat alembic/versions/xxxx_add_user_phone_field.py
# 3. Test on local/staging
m migrate
m migrate:rollback # Verify rollback works
# 4. Commit migration file to Git
git add alembic/versions/xxxx_add_user_phone_field.py
git commit -m "feat: add phone field to user"
# 5. Deploy to staging
# CI/CD runs: m migrate
# 6. Verify staging works
# 7. Deploy to production
# CI/CD runs: m migrate
7.8 CLI Commands Reference
| Command | Description |
|---|---|
m migrate | Apply all pending migrations |
m migrate:status | Show current migration state |
m migrate:create \<name\> | Generate new migration |
m migrate:rollback | Rollback last migration |
m migrate:history | Show migration history |
m migrate:heads | Show latest migration(s) |
8. Backup & Recovery
8.1 Database Backups
- Automated daily backups
- Store backups in S3 or equivalent
- Retention policy (30 days)
- Test restore process
8.2 Disaster Recovery
- Document recovery procedures
- Test recovery regularly
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 24 hours
9. Scaling Strategy
9.1 Horizontal Scaling
- API servers: Stateless, scale horizontally
- Workers: Scale based on queue length
- Database: Use read replicas for read-heavy workloads
9.2 Load Balancing
- Use load balancer (Nginx, ALB)
- Health check endpoints
- Session affinity not required (stateless)
10. Documentation
10.1 Deployment Guide
- Write step-by-step deployment guide
- Include environment variables reference
- Include troubleshooting section
10.2 Operations Runbook
- Document common operations:
- Scaling up/down
- Running migrations
- Backup and restore
- Monitoring and alerts
11. Testing
11.1 Load Testing
-
Use Locust or k6 for load testing
-
Test scenarios:
- 1000 concurrent users
- 10,000 requests/minute
- Sustained load for 1 hour
-
Measure:
- Response times (p50, p95, p99)
- Error rate
- Resource usage
11.2 Security Testing
- Run security scans:
- OWASP ZAP
- Dependency vulnerability scan
- Container image scan
Validation Checklist
Final checks before production:
- All tests pass (unit, integration, E2E)
- Load testing shows acceptable performance
- Security scans show no critical issues
- Monitoring and alerts are configured
- Backups are automated and tested
- Documentation is complete
- Deployment pipeline works
- Health checks return 200
- Logs are structured and useful
- Secrets are not in code
Anti-Patterns to Avoid
❌ Don't: Deploy without testing ✅ Do: Test thoroughly in staging
❌ Don't: Ignore monitoring ✅ Do: Set up alerts for critical metrics
❌ Don't: Hardcode configuration ✅ Do: Use environment variables
❌ Don't: Run as root in containers ✅ Do: Use non-root user
❌ Don't: Skip backups ✅ Do: Automate and test backups regularly