Skip to main content

Phase 10: Production Readiness & Deployment

Objective: Optimize for production, add monitoring, implement security best practices, and create deployment configurations.


1. Performance Optimization

1.1 Database Optimization

  • Add database indexes:

    • Index on id (primary key, UUID)
    • Index on name (unique, human-readable identifier)
    • Index on owner (for RLS)
    • Index on modified (for sorting)
    • Index on foreign keys
    • Composite indexes for common queries
  • Implement query optimization:

    • Use select_in_loading for relationships
    • Avoid N+1 queries
    • Add query logging in dev mode
    • Analyze slow queries

1.2 Caching Strategy

  • Implement multi-level caching:

    • L1: In-memory cache (per-request)
    • L2: Redis cache (shared)
    • Cache metadata (DocType schemas)
    • Cache permission checks
    • Cache frequently accessed documents
  • Add cache invalidation:

    • Invalidate on document update
    • Invalidate on permission change
    • Use cache tags for bulk invalidation

1.3 Connection Pooling

  • Configure SQLAlchemy pool:

    • Pool size: 20
    • Max overflow: 10
    • Pool timeout: 30s
    • Pool recycle: 3600s
  • Configure Redis pool:

    • Max connections: 50

1.4 Async Optimization

  • Use asyncio.gather() for parallel operations
  • Avoid blocking calls in async functions
  • Use connection pools efficiently

2. Security Hardening

2.1 Input Validation

  • Validate all inputs with Pydantic
  • Sanitize user-provided data
  • Prevent SQL injection (use parameterized queries)
  • Prevent XSS (API-only, no HTML rendering)

2.2 Authentication Security

  • Support JWT token validation:

    • Verify signature
    • Check expiration
    • Validate issuer
  • Add rate limiting:

    • Limit login attempts
    • Limit API requests per user
    • Use Redis for rate limit storage

2.3 Authorization Security

  • Enforce permissions on all endpoints
  • Apply RLS to all queries
  • Validate object-level permissions
  • Log permission denials

2.4 Secrets Management

  • Never commit secrets to Git
  • Use environment variables
  • Support secret managers:
    • AWS Secrets Manager
    • Google Secret Manager
    • HashiCorp Vault

2.5 HTTPS & CORS

  • Enforce HTTPS in production
  • Configure CORS properly:
    • Whitelist allowed origins
    • Restrict methods and headers

3. Monitoring & Observability

3.1 Logging

  • Implement structured logging:

    • Use JSON format
    • Include request ID
    • Include user context
    • Log levels: DEBUG, INFO, WARNING, ERROR
  • Log important events:

    • API requests (with timing)
    • Database queries (in dev)
    • Permission checks
    • Errors and exceptions

3.2 Metrics

  • Add Prometheus metrics:

    • Request count by endpoint
    • Request duration histogram
    • Database query duration
    • Cache hit/miss rate
    • Job queue length
  • Create /metrics endpoint

3.3 Tracing

  • Implement OpenTelemetry:
    • Trace HTTP requests
    • Trace database queries
    • Trace background jobs
    • Export to Jaeger or Zipkin

3.4 Health Checks

  • Create /health endpoint:

    • Check database connection
    • Check Redis connection
    • Return 200 if healthy, 503 if not
  • Create /ready endpoint:

    • Check if app is ready to serve traffic
    • Check migrations are up to date

4. Error Handling & Resilience

4.1 Graceful Degradation

  • Handle database failures gracefully
  • Handle Redis failures (fallback to no cache)
  • Retry transient errors

4.2 Circuit Breaker

  • Implement circuit breaker for external services:
    • Webhook calls
    • Email sending
    • Print service (Gotenberg)

4.3 Timeouts

  • Set timeouts for all operations:
    • HTTP requests: 30s
    • Database queries: 10s
    • Background jobs: 300s

5. Deployment Configurations

5.1 Docker

  • Create production Dockerfile:

    FROM python:3.12-slim
    WORKDIR /app
    COPY . .
    RUN uv pip install --no-cache-dir .
    CMD ["m", "start", "--host", "0.0.0.0"]
  • Multi-stage build for smaller images

  • Use non-root user

  • Add health check

5.2 Docker Compose

  • Create docker-compose.yml:

    • API service
    • Worker service
    • PostgreSQL
    • Redis
    • Gotenberg (for PDFs)
  • Add volumes for persistence

  • Add networks for isolation

5.3 Kubernetes

  • Create Kubernetes manifests:

    • Deployment for API
    • Deployment for Worker
    • StatefulSet for PostgreSQL
    • StatefulSet for Redis
    • Service for API
    • Ingress for routing
  • Add resource limits:

    • CPU: 500m - 2000m
    • Memory: 512Mi - 2Gi
  • Add liveness and readiness probes

5.4 Cloud Platforms

  • Create deployment guides for:
    • Google Cloud Run
    • AWS ECS/Fargate
    • Azure Container Apps
    • Heroku
    • Railway
    • Render

6. CI/CD Pipeline

6.1 GitLab CI

  • Configure .gitlab-ci.yml (already exists in repo):
    stages:
    - test
    - build
    - deploy

    test:
    stage: test
    script:
    - uv sync
    - uv run pytest
    - uv run mypy --strict .
    - uv run ruff check .
    rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

    build:
    stage: build
    script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    rules:
    - if: $CI_COMMIT_BRANCH == "main"

    deploy-staging:
    stage: deploy
    script:
    - kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    environment: staging
    rules:
    - if: $CI_COMMIT_BRANCH == "main"

    deploy-production:
    stage: deploy
    script:
    - kubectl set image deployment/app app=$CI_REGISTRY_IMAGE:$CI_COMMIT_TAG
    environment: production
    rules:
    - if: $CI_COMMIT_TAG =~ /^v.*/

6.2 Pre-commit Hooks

  • Setup pre-commit:
    • Run ruff formatter
    • Run mypy
    • Run tests

6.3 Release Versioning & Publishing

  • Adopt Semantic Versioning (SemVer):

    • MAJOR.MINOR.PATCH format
    • Document breaking change policy
    • Use 0.x.y for pre-1.0 releases
  • CHANGELOG management:

    • Create CHANGELOG.md following Keep a Changelog format
    • Sections: Added, Changed, Deprecated, Removed, Fixed, Security
    • Option: Use git-cliff for automation
    • Link CHANGELOG to documentation site
  • Conventional Commits (optional but recommended):

    • feat: → MINOR bump
    • fix: → PATCH bump
    • feat!: or BREAKING CHANGE: → MAJOR bump
    • Setup commitlint for enforcement
  • Release workflow (add to .gitlab-ci.yml):

    release:
    stage: deploy
    script:
    - uv build
    - uv publish
    variables:
    UV_PUBLISH_TOKEN: $PYPI_TOKEN
    rules:
    - if: $CI_COMMIT_TAG =~ /^v.*/

    release-notes:
    stage: deploy
    image: registry.gitlab.com/gitlab-org/release-cli:latest
    script:
    - echo "Creating release for $CI_COMMIT_TAG"
    release:
    tag_name: $CI_COMMIT_TAG
    description: ./CHANGELOG.md
    rules:
    - if: $CI_COMMIT_TAG =~ /^v.*/
  • Version sources:

    • Single source of truth in pyproject.toml
    • Use dynamic versioning from git tags OR __version__ in package
    • Expose version via m --version CLI
  • Pre-release versions:

    • Support alpha/beta/rc: 1.0.0-alpha.1, 1.0.0-beta.2, 1.0.0-rc.1
    • Publish pre-releases to TestPyPI first

6.4 Git Branching Strategy

[!NOTE] Framework M's hexagonal architecture (adapters are swappable, not forked) enables a simpler branching model than Frappe's version-XX branches. Breaking changes are isolated to Protocol interfaces, not scattered across the codebase.

  • Trunk-Based Development (recommended for M):

    main (default)

    ├── feature/xyz # Short-lived feature branches
    ├── fix/issue-123 # Bug fix branches

    └── Tags: v0.1.0, v0.2.0, v1.0.0, v1.0.1, v1.1.0 ...
  • Branch naming conventions:

    • main — Always deployable, latest development
    • feature/<name> — New features (merge to main via PR)
    • fix/<issue-id> — Bug fixes (merge to main via PR)
    • release/v<major>.<minor> — Created only when LTS support needed
  • Release branches (only for LTS):

    main

    ├─── v1.0.0 (tag)

    └─── release/v1 ← Created when v2.0.0 is released

    ├─── v1.0.1 (backport tag)
    └─── v1.0.2 (backport tag)
  • LTS (Long-Term Support) Policy:

    • Support last 2 major versions (not 3 like Frappe — M is simpler)
    • Security fixes: Backport to all supported versions
    • Bug fixes: Backport to latest minor of each supported major
    • Features: Only on main, no backports
  • Backporting workflow:

    • Fix on main first (always forward-fix)
    • Cherry-pick to release/vX branch
    • Tag patch release: vX.Y.Z
    • Tool: Use git cherry-pick -x to track source commit
  • Why simpler than Frappe?

    • Hexagonal architecture = Adapters can be versioned independently
    • No DB schema in core = Migrations are app-specific, not framework-specific
    • Protocol stability = Breaking changes are explicit, not hidden
    • Apps can pin framework version = No forced upgrade cascades

7. Database Migration Strategy

[!IMPORTANT] Migrations are generated by Alembic (Phase 02). This section defines when and how to apply them safely.

7.1 Change Classification

CategoryExamplesAuto-Apply (Dev)?Production Strategy
SafeAdd nullable column, add table, add index✅ YesApply directly
ReviewAdd non-nullable column, change type (compatible)⚠️ Generate onlyReview → Staging → Prod
DangerousDrop column, rename column, change type (incompatible)❌ NoManual migration with data handling

7.2 Safe Changes (Auto-Migration in Dev Mode)

These changes are automatically applied in development (m start --dev):

  • Add table: New DocType → new table
  • Add nullable column: New optional field → ALTER TABLE ADD COLUMN ... NULL
  • Add index: Meta.indexesCREATE INDEX
  • Add foreign key: Link field → ALTER TABLE ADD CONSTRAINT
# Dev mode: auto-applies safe migrations
m start --dev # Detects changes, generates & applies migration

# Production: never auto-apply
m start # Requires explicit `m migrate` first

7.3 Review-Required Changes

These changes generate migrations but require human review:

  • Add non-nullable column without default:

    • Migration will fail if table has existing rows
    • Solution: Add with default → backfill → remove default
  • Change column type (compatible):

    • strText (widening): Usually safe
    • intbigint (widening): Usually safe
    • Review for edge cases
# CLI shows warning for review-required changes
$ m migrate:create add_phone_field

⚠️ Review required: 'phone' is non-nullable without default
Existing rows will cause migration to fail.

Recommendation:
1. Add field with default: phone: str = ""
2. Run migration
3. Backfill data
4. Remove default if desired

7.4 Dangerous Changes (Manual Handling Required)

These changes require explicit data migration code:

Drop Column

# Migration: drop_old_field.py
def upgrade():
# Step 1: Ensure no code references the column
# Step 2: Drop the column
op.drop_column('invoice', 'old_field')

def downgrade():
# Re-add column (data is lost!)
op.add_column('invoice', sa.Column('old_field', sa.String(255)))

Rename Column

# Migration: rename_customer_to_party.py
def upgrade():
# Rename preserves data
op.alter_column('invoice', 'customer', new_column_name='party')

def downgrade():
op.alter_column('invoice', 'party', new_column_name='customer')

Type Change (Incompatible)

# Migration: age_str_to_int.py
def upgrade():
# Step 1: Add new column
op.add_column('person', sa.Column('age_int', sa.Integer(), nullable=True))

# Step 2: Migrate data with transformation
connection = op.get_bind()
connection.execute(text("""
UPDATE person
SET age_int = CASE
WHEN age ~ '^[0-9]+$' THEN age::integer
ELSE NULL
END
"""))

# Step 3: Drop old column
op.drop_column('person', 'age')

# Step 4: Rename new column
op.alter_column('person', 'age_int', new_column_name='age')

def downgrade():
# Reverse transformation
op.alter_column('person', 'age', type_=sa.String(10),
postgresql_using='age::varchar')

7.5 Data Migration Patterns

Pattern A: Backfill in Migration (Small Tables)

def upgrade():
# For tables `< 100k` rows: inline backfill
op.add_column('user', sa.Column('full_name', sa.String(255)))
connection = op.get_bind()
connection.execute(text("""
UPDATE user SET full_name = first_name || ' ' || last_name
"""))

Pattern B: Background Job (Large Tables)

# Migration: add_computed_field.py
def upgrade():
# Step 1: Add nullable column
op.add_column('invoice', sa.Column('tax_amount', sa.Numeric(), nullable=True))
# Step 2: Document: run `m job:run backfill_tax_amounts` after migration

# Job: jobs/backfill_tax_amounts.py
@job
async def backfill_tax_amounts(batch_size: int = 1000):
"""Backfill tax_amount for existing invoices."""
async with uow_factory() as uow:
invoices = await repo.list(
filters=[Filter("tax_amount", "is", None)],
limit=batch_size
)
for inv in invoices.items:
inv.tax_amount = inv.total * inv.tax_rate
await repo.save(uow.session, inv)
await uow.commit()

Pattern C: Two-Phase Migration (Zero Downtime)

Phase 1 (Deploy v1.1):
- Add new column (nullable)
- Code writes to BOTH old and new columns
- Background job backfills new column

Phase 2 (Deploy v1.2):
- Code reads from new column only
- Drop old column in migration

7.6 Zero-Downtime Migrations

  • Never drop columns in the same deploy as code changes
  • Always add columns as nullable first
  • Always backfill data before making non-nullable
  • Always deploy code changes before dropping columns
Deploy Timeline:

v1.0: old_field exists, code uses old_field

v1.1: new_field added (nullable), code writes to BOTH
│ ← Run backfill job
v1.2: code reads from new_field only

v1.3: drop old_field migration

7.7 Migration Workflow (Production)

# 1. Generate migration (dev environment)
m migrate:create add_user_phone_field

# 2. Review generated migration
cat alembic/versions/xxxx_add_user_phone_field.py

# 3. Test on local/staging
m migrate
m migrate:rollback # Verify rollback works

# 4. Commit migration file to Git
git add alembic/versions/xxxx_add_user_phone_field.py
git commit -m "feat: add phone field to user"

# 5. Deploy to staging
# CI/CD runs: m migrate

# 6. Verify staging works

# 7. Deploy to production
# CI/CD runs: m migrate

7.8 CLI Commands Reference

CommandDescription
m migrateApply all pending migrations
m migrate:statusShow current migration state
m migrate:create \<name\>Generate new migration
m migrate:rollbackRollback last migration
m migrate:historyShow migration history
m migrate:headsShow latest migration(s)

8. Backup & Recovery

8.1 Database Backups

  • Automated daily backups
  • Store backups in S3 or equivalent
  • Retention policy (30 days)
  • Test restore process

8.2 Disaster Recovery

  • Document recovery procedures
  • Test recovery regularly
  • RTO (Recovery Time Objective): 4 hours
  • RPO (Recovery Point Objective): 24 hours

9. Scaling Strategy

9.1 Horizontal Scaling

  • API servers: Stateless, scale horizontally
  • Workers: Scale based on queue length
  • Database: Use read replicas for read-heavy workloads

9.2 Load Balancing

  • Use load balancer (Nginx, ALB)
  • Health check endpoints
  • Session affinity not required (stateless)

10. Documentation

10.1 Deployment Guide

  • Write step-by-step deployment guide
  • Include environment variables reference
  • Include troubleshooting section

10.2 Operations Runbook

  • Document common operations:
    • Scaling up/down
    • Running migrations
    • Backup and restore
    • Monitoring and alerts

11. Testing

11.1 Load Testing

  • Use Locust or k6 for load testing

  • Test scenarios:

    • 1000 concurrent users
    • 10,000 requests/minute
    • Sustained load for 1 hour
  • Measure:

    • Response times (p50, p95, p99)
    • Error rate
    • Resource usage

11.2 Security Testing

  • Run security scans:
    • OWASP ZAP
    • Dependency vulnerability scan
    • Container image scan

Validation Checklist

Final checks before production:

  • All tests pass (unit, integration, E2E)
  • Load testing shows acceptable performance
  • Security scans show no critical issues
  • Monitoring and alerts are configured
  • Backups are automated and tested
  • Documentation is complete
  • Deployment pipeline works
  • Health checks return 200
  • Logs are structured and useful
  • Secrets are not in code

Anti-Patterns to Avoid

Don't: Deploy without testing ✅ Do: Test thoroughly in staging

Don't: Ignore monitoring ✅ Do: Set up alerts for critical metrics

Don't: Hardcode configuration ✅ Do: Use environment variables

Don't: Run as root in containers ✅ Do: Use non-root user

Don't: Skip backups ✅ Do: Automate and test backups regularly