Skip to main content

RFC: DocType Discovery & Scanning

  • Status: Draft
  • Created: 2026-01-14
  • Package: framework-m-studio

Summary

Define how Studio discovers and scans DocTypes from the filesystem. Must be deterministic, fast (1000s of DocTypes), and cache-free.

Requirements

RequirementRationale
DeterministicSame filesystem state → same output order. Critical for diffs, testing, CI.
Fast<1s for 1000 DocTypes. Developers expect instant feedback.
No cachingFresh scan every time. Avoid stale data, cache invalidation bugs.
RecursiveScan all nested directories under workspace root.

Discovery Algorithm

Step 1: Find Python Files (Fast Filesystem Scan)

# Use pathlib with glob - fast, stdlib, no external deps
from pathlib import Path

def find_python_files(root: Path) -> list[Path]:
"""Find all .py files, sorted for determinism."""
files = list(root.rglob("*.py"))

# Exclude common non-doctype directories
exclude_patterns = {
"__pycache__", ".git", ".venv", "venv",
"node_modules", ".pytest_cache", "dist", "build"
}

files = [
f for f in files
if not any(p in f.parts for p in exclude_patterns)
]

# Sort for determinism
return sorted(files)

Step 2: Fast DocType Detection (AST, not Import)

Critical: Do NOT import modules to detect DocTypes. Use AST scanning.

import ast
from dataclasses import dataclass

@dataclass
class DocTypeInfo:
name: str
file_path: Path
line_number: int
bases: list[str]

def scan_file_for_doctypes(file_path: Path) -> list[DocTypeInfo]:
"""Fast AST scan - no imports, no side effects."""
try:
source = file_path.read_text(encoding="utf-8")
tree = ast.parse(source, filename=str(file_path))
except (SyntaxError, UnicodeDecodeError):
return [] # Skip invalid files

doctypes = []

for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
# Check if class inherits from DocType/BaseDocType
base_names = [get_base_name(b) for b in node.bases]

if is_doctype_class(base_names):
doctypes.append(DocTypeInfo(
name=node.name,
file_path=file_path,
line_number=node.lineno,
bases=base_names,
))

return doctypes

def get_base_name(base: ast.expr) -> str:
"""Extract base class name from AST node."""
if isinstance(base, ast.Name):
return base.id
elif isinstance(base, ast.Attribute):
return base.attr
elif isinstance(base, ast.Subscript):
# Handle Generic[T] style
return get_base_name(base.value)
return ""

def is_doctype_class(bases: list[str]) -> bool:
"""Check if any base indicates a DocType."""
doctype_bases = {"DocType", "BaseDocType", "BaseChildDocType"}
return bool(set(bases) & doctype_bases)

Step 3: Parallel Scanning (Optional, for Large Codebases)

import asyncio
from concurrent.futures import ProcessPoolExecutor

async def scan_doctypes_parallel(
root: Path,
max_workers: int = 4
) -> list[DocTypeInfo]:
"""Parallel scan for large codebases."""
files = find_python_files(root)

# For small counts, sequential is faster (no process overhead)
if len(files) < 100:
return scan_doctypes_sequential(files)

# Parallel for large codebases
loop = asyncio.get_event_loop()
with ProcessPoolExecutor(max_workers=max_workers) as executor:
results = await loop.run_in_executor(
executor,
lambda: [scan_file_for_doctypes(f) for f in files]
)

# Flatten and sort for determinism
doctypes = [dt for file_results in results for dt in file_results]
return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))

Determinism Guarantees

AspectStrategy
File ordersorted(files) - alphabetical by full path
DocType ordersorted(doctypes, key=(file_path, name))
Output formatDataclass with defined field order
Filesystem raceAccept: if file changes mid-scan, result reflects point-in-time

Performance Targets

MetricTargetStrategy
100 DocTypes<100msSequential scan
1000 DocTypes<500msSequential with fast glob
5000 DocTypes<1sParallel with ProcessPoolExecutor

Why No Caching?

Problem with CachingConsequence
Cache invalidationDeveloper edits file, sees stale DocType
Filesystem watchingComplex, platform-specific, race conditions
Cache persistenceWhere to store? Disk? Memory? Per-workspace?
DeterminismCache hit vs miss could change output order

Solution: Just be fast enough that caching isn't needed.

API Surface

# Public API in framework-m-studio

class DocTypeScanner:
"""Scan workspace for DocType definitions."""

def __init__(self, workspace_root: Path):
self.root = workspace_root

def scan(self) -> list[DocTypeInfo]:
"""
Synchronous scan of all DocTypes.

Returns:
Sorted list of DocTypeInfo (deterministic order).
"""
files = find_python_files(self.root)
doctypes = []
for f in files:
doctypes.extend(scan_file_for_doctypes(f))
return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))

async def scan_async(self) -> list[DocTypeInfo]:
"""
Async parallel scan for large workspaces.
"""
return await scan_doctypes_parallel(self.root)

def scan_file(self, file_path: Path) -> list[DocTypeInfo]:
"""
Scan single file. For incremental use cases.
"""
return scan_file_for_doctypes(file_path)

CLI Integration

# List all DocTypes (uses scanner)
$ m studio:list-doctypes

Name File Line
─────────────────────────────────────────────────────────────────
Customer apps/crm/doctypes/customer.py 12
Invoice apps/accounting/doctypes/invoice.py 8
InvoiceItem apps/accounting/doctypes/invoice.py 45
...

Found 1,247 DocTypes in 0.42s

# JSON output for scripting
$ m studio:list-doctypes --json
[
{"name": "Customer", "file": "apps/crm/doctypes/customer.py", "line": 12},
...
]

Studio API Endpoint

# GET /studio/api/doctypes
@get("/studio/api/doctypes")
async def list_doctypes(workspace: WorkspaceContext) -> list[DocTypeInfo]:
"""List all DocTypes in workspace (fresh scan, no cache)."""
scanner = DocTypeScanner(workspace.root)
return await scanner.scan_async()

Edge Cases

CaseHandling
Syntax error in .py fileSkip file, log warning
Non-UTF8 fileSkip file
Circular importsN/A - we use AST, not import
SymlinksFollow by default (pathlib behavior)
Permission deniedSkip file, log warning
Empty workspaceReturn empty list

What Scanner Does NOT Do

  • ❌ Import Python modules
  • ❌ Execute any code
  • ❌ Validate DocType correctness
  • ❌ Cache results
  • ❌ Watch for changes

Validation and detailed parsing happen in a separate phase (LibCST parser).

Testing

def test_scanner_determinism():
"""Same filesystem → same output."""
scanner = DocTypeScanner(test_workspace)

result1 = scanner.scan()
result2 = scanner.scan()

assert result1 == result2

def test_scanner_performance():
"""1000 files in `<500ms`."""
scanner = DocTypeScanner(large_workspace)

start = time.perf_counter()
result = scanner.scan()
elapsed = time.perf_counter() - start

assert elapsed < 0.5
assert len(result) > 1000

References