RFC: DocType Discovery & Scanning

Status: Draft
Created: 2026-01-14
Package: framework-m-studio

Summary

Define how Studio discovers and scans DocTypes from the filesystem. Must be deterministic, fast (1000s of DocTypes), and cache-free.

Requirements

Requirement	Rationale
Deterministic	Same filesystem state → same output order. Critical for diffs, testing, CI.
Fast	`<1s` for 1000 DocTypes. Developers expect instant feedback.
No caching	Fresh scan every time. Avoid stale data, cache invalidation bugs.
Recursive	Scan all nested directories under workspace root.

Discovery Algorithm

Step 1: Find Python Files (Fast Filesystem Scan)

# Use pathlib with glob - fast, stdlib, no external deps
from pathlib import Path

def find_python_files(root: Path) -> list[Path]:
    """Find all .py files, sorted for determinism."""
    files = list(root.rglob("*.py"))

    # Exclude common non-doctype directories
    exclude_patterns = {
        "__pycache__", ".git", ".venv", "venv",
        "node_modules", ".pytest_cache", "dist", "build"
    }

    files = [
        f for f in files
        if not any(p in f.parts for p in exclude_patterns)
    ]

    # Sort for determinism
    return sorted(files)

Step 2: Fast DocType Detection (AST, not Import)

Critical: Do NOT import modules to detect DocTypes. Use AST scanning.

import ast
from dataclasses import dataclass

@dataclass
class DocTypeInfo:
    name: str
    file_path: Path
    line_number: int
    bases: list[str]

def scan_file_for_doctypes(file_path: Path) -> list[DocTypeInfo]:
    """Fast AST scan - no imports, no side effects."""
    try:
        source = file_path.read_text(encoding="utf-8")
        tree = ast.parse(source, filename=str(file_path))
    except (SyntaxError, UnicodeDecodeError):
        return []  # Skip invalid files

    doctypes = []

    for node in ast.walk(tree):
        if isinstance(node, ast.ClassDef):
            # Check if class inherits from DocType/BaseDocType
            base_names = [get_base_name(b) for b in node.bases]

            if is_doctype_class(base_names):
                doctypes.append(DocTypeInfo(
                    name=node.name,
                    file_path=file_path,
                    line_number=node.lineno,
                    bases=base_names,
                ))

    return doctypes

def get_base_name(base: ast.expr) -> str:
    """Extract base class name from AST node."""
    if isinstance(base, ast.Name):
        return base.id
    elif isinstance(base, ast.Attribute):
        return base.attr
    elif isinstance(base, ast.Subscript):
        # Handle Generic[T] style
        return get_base_name(base.value)
    return ""

def is_doctype_class(bases: list[str]) -> bool:
    """Check if any base indicates a DocType."""
    doctype_bases = {"DocType", "BaseDocType", "BaseChildDocType"}
    return bool(set(bases) & doctype_bases)

Step 3: Parallel Scanning (Optional, for Large Codebases)

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def scan_doctypes_parallel(
    root: Path,
    max_workers: int = 4
) -> list[DocTypeInfo]:
    """Parallel scan for large codebases."""
    files = find_python_files(root)

    # For small counts, sequential is faster (no process overhead)
    if len(files) < 100:
        return scan_doctypes_sequential(files)

    # Parallel for large codebases
    loop = asyncio.get_event_loop()
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        results = await loop.run_in_executor(
            executor,
            lambda: [scan_file_for_doctypes(f) for f in files]
        )

    # Flatten and sort for determinism
    doctypes = [dt for file_results in results for dt in file_results]
    return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))

Determinism Guarantees

Aspect	Strategy
File order	`sorted(files)` - alphabetical by full path
DocType order	`sorted(doctypes, key=(file_path, name))`
Output format	Dataclass with defined field order
Filesystem race	Accept: if file changes mid-scan, result reflects point-in-time

Performance Targets

Metric	Target	Strategy
100 DocTypes	`<100ms`	Sequential scan
1000 DocTypes	`<500ms`	Sequential with fast glob
5000 DocTypes	`<1s`	Parallel with ProcessPoolExecutor

Why No Caching?

Problem with Caching	Consequence
Cache invalidation	Developer edits file, sees stale DocType
Filesystem watching	Complex, platform-specific, race conditions
Cache persistence	Where to store? Disk? Memory? Per-workspace?
Determinism	Cache hit vs miss could change output order

Solution: Just be fast enough that caching isn't needed.

API Surface

# Public API in framework-m-studio

class DocTypeScanner:
    """Scan workspace for DocType definitions."""

    def __init__(self, workspace_root: Path):
        self.root = workspace_root

    def scan(self) -> list[DocTypeInfo]:
        """
        Synchronous scan of all DocTypes.

        Returns:
            Sorted list of DocTypeInfo (deterministic order).
        """
        files = find_python_files(self.root)
        doctypes = []
        for f in files:
            doctypes.extend(scan_file_for_doctypes(f))
        return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))

    async def scan_async(self) -> list[DocTypeInfo]:
        """
        Async parallel scan for large workspaces.
        """
        return await scan_doctypes_parallel(self.root)

    def scan_file(self, file_path: Path) -> list[DocTypeInfo]:
        """
        Scan single file. For incremental use cases.
        """
        return scan_file_for_doctypes(file_path)

CLI Integration

# List all DocTypes (uses scanner)
$ m studio:list-doctypes

Name                 File                                    Line
─────────────────────────────────────────────────────────────────
Customer             apps/crm/doctypes/customer.py           12
Invoice              apps/accounting/doctypes/invoice.py     8
InvoiceItem          apps/accounting/doctypes/invoice.py     45
...

Found 1,247 DocTypes in 0.42s

# JSON output for scripting
$ m studio:list-doctypes --json
[
  {"name": "Customer", "file": "apps/crm/doctypes/customer.py", "line": 12},
  ...
]

Studio API Endpoint

# GET /studio/api/doctypes
@get("/studio/api/doctypes")
async def list_doctypes(workspace: WorkspaceContext) -> list[DocTypeInfo]:
    """List all DocTypes in workspace (fresh scan, no cache)."""
    scanner = DocTypeScanner(workspace.root)
    return await scanner.scan_async()

Edge Cases

Case	Handling
Syntax error in .py file	Skip file, log warning
Non-UTF8 file	Skip file
Circular imports	N/A - we use AST, not import
Symlinks	Follow by default (pathlib behavior)
Permission denied	Skip file, log warning
Empty workspace	Return empty list

What Scanner Does NOT Do

❌ Import Python modules
❌ Execute any code
❌ Validate DocType correctness
❌ Cache results
❌ Watch for changes

Validation and detailed parsing happen in a separate phase (LibCST parser).

Testing

def test_scanner_determinism():
    """Same filesystem → same output."""
    scanner = DocTypeScanner(test_workspace)

    result1 = scanner.scan()
    result2 = scanner.scan()

    assert result1 == result2

def test_scanner_performance():
    """1000 files in `<500ms`."""
    scanner = DocTypeScanner(large_workspace)

    start = time.perf_counter()
    result = scanner.scan()
    elapsed = time.perf_counter() - start

    assert elapsed < 0.5
    assert len(result) > 1000

References

Phase 07: Studio
Python AST module
pathlib.rglob

Summary​

Requirements​

Discovery Algorithm​

Step 1: Find Python Files (Fast Filesystem Scan)​

Step 2: Fast DocType Detection (AST, not Import)​

Step 3: Parallel Scanning (Optional, for Large Codebases)​

Determinism Guarantees​

Performance Targets​

Why No Caching?​

API Surface​

CLI Integration​

Studio API Endpoint​

Edge Cases​

What Scanner Does NOT Do​

Testing​

References​