RFC: DocType Discovery & Scanning
- Status: Draft
- Created: 2026-01-14
- Package:
framework-m-studio
Summary
Define how Studio discovers and scans DocTypes from the filesystem. Must be deterministic, fast (1000s of DocTypes), and cache-free.
Requirements
| Requirement | Rationale |
|---|---|
| Deterministic | Same filesystem state → same output order. Critical for diffs, testing, CI. |
| Fast | <1s for 1000 DocTypes. Developers expect instant feedback. |
| No caching | Fresh scan every time. Avoid stale data, cache invalidation bugs. |
| Recursive | Scan all nested directories under workspace root. |
Discovery Algorithm
Step 1: Find Python Files (Fast Filesystem Scan)
# Use pathlib with glob - fast, stdlib, no external deps
from pathlib import Path
def find_python_files(root: Path) -> list[Path]:
"""Find all .py files, sorted for determinism."""
files = list(root.rglob("*.py"))
# Exclude common non-doctype directories
exclude_patterns = {
"__pycache__", ".git", ".venv", "venv",
"node_modules", ".pytest_cache", "dist", "build"
}
files = [
f for f in files
if not any(p in f.parts for p in exclude_patterns)
]
# Sort for determinism
return sorted(files)
Step 2: Fast DocType Detection (AST, not Import)
Critical: Do NOT import modules to detect DocTypes. Use AST scanning.
import ast
from dataclasses import dataclass
@dataclass
class DocTypeInfo:
name: str
file_path: Path
line_number: int
bases: list[str]
def scan_file_for_doctypes(file_path: Path) -> list[DocTypeInfo]:
"""Fast AST scan - no imports, no side effects."""
try:
source = file_path.read_text(encoding="utf-8")
tree = ast.parse(source, filename=str(file_path))
except (SyntaxError, UnicodeDecodeError):
return [] # Skip invalid files
doctypes = []
for node in ast.walk(tree):
if isinstance(node, ast.ClassDef):
# Check if class inherits from DocType/BaseDocType
base_names = [get_base_name(b) for b in node.bases]
if is_doctype_class(base_names):
doctypes.append(DocTypeInfo(
name=node.name,
file_path=file_path,
line_number=node.lineno,
bases=base_names,
))
return doctypes
def get_base_name(base: ast.expr) -> str:
"""Extract base class name from AST node."""
if isinstance(base, ast.Name):
return base.id
elif isinstance(base, ast.Attribute):
return base.attr
elif isinstance(base, ast.Subscript):
# Handle Generic[T] style
return get_base_name(base.value)
return ""
def is_doctype_class(bases: list[str]) -> bool:
"""Check if any base indicates a DocType."""
doctype_bases = {"DocType", "BaseDocType", "BaseChildDocType"}
return bool(set(bases) & doctype_bases)
Step 3: Parallel Scanning (Optional, for Large Codebases)
import asyncio
from concurrent.futures import ProcessPoolExecutor
async def scan_doctypes_parallel(
root: Path,
max_workers: int = 4
) -> list[DocTypeInfo]:
"""Parallel scan for large codebases."""
files = find_python_files(root)
# For small counts, sequential is faster (no process overhead)
if len(files) < 100:
return scan_doctypes_sequential(files)
# Parallel for large codebases
loop = asyncio.get_event_loop()
with ProcessPoolExecutor(max_workers=max_workers) as executor:
results = await loop.run_in_executor(
executor,
lambda: [scan_file_for_doctypes(f) for f in files]
)
# Flatten and sort for determinism
doctypes = [dt for file_results in results for dt in file_results]
return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))
Determinism Guarantees
| Aspect | Strategy |
|---|---|
| File order | sorted(files) - alphabetical by full path |
| DocType order | sorted(doctypes, key=(file_path, name)) |
| Output format | Dataclass with defined field order |
| Filesystem race | Accept: if file changes mid-scan, result reflects point-in-time |
Performance Targets
| Metric | Target | Strategy |
|---|---|---|
| 100 DocTypes | <100ms | Sequential scan |
| 1000 DocTypes | <500ms | Sequential with fast glob |
| 5000 DocTypes | <1s | Parallel with ProcessPoolExecutor |
Why No Caching?
| Problem with Caching | Consequence |
|---|---|
| Cache invalidation | Developer edits file, sees stale DocType |
| Filesystem watching | Complex, platform-specific, race conditions |
| Cache persistence | Where to store? Disk? Memory? Per-workspace? |
| Determinism | Cache hit vs miss could change output order |
Solution: Just be fast enough that caching isn't needed.
API Surface
# Public API in framework-m-studio
class DocTypeScanner:
"""Scan workspace for DocType definitions."""
def __init__(self, workspace_root: Path):
self.root = workspace_root
def scan(self) -> list[DocTypeInfo]:
"""
Synchronous scan of all DocTypes.
Returns:
Sorted list of DocTypeInfo (deterministic order).
"""
files = find_python_files(self.root)
doctypes = []
for f in files:
doctypes.extend(scan_file_for_doctypes(f))
return sorted(doctypes, key=lambda d: (str(d.file_path), d.name))
async def scan_async(self) -> list[DocTypeInfo]:
"""
Async parallel scan for large workspaces.
"""
return await scan_doctypes_parallel(self.root)
def scan_file(self, file_path: Path) -> list[DocTypeInfo]:
"""
Scan single file. For incremental use cases.
"""
return scan_file_for_doctypes(file_path)
CLI Integration
# List all DocTypes (uses scanner)
$ m studio:list-doctypes
Name File Line
─────────────────────────────────────────────────────────────────
Customer apps/crm/doctypes/customer.py 12
Invoice apps/accounting/doctypes/invoice.py 8
InvoiceItem apps/accounting/doctypes/invoice.py 45
...
Found 1,247 DocTypes in 0.42s
# JSON output for scripting
$ m studio:list-doctypes --json
[
{"name": "Customer", "file": "apps/crm/doctypes/customer.py", "line": 12},
...
]
Studio API Endpoint
# GET /studio/api/doctypes
@get("/studio/api/doctypes")
async def list_doctypes(workspace: WorkspaceContext) -> list[DocTypeInfo]:
"""List all DocTypes in workspace (fresh scan, no cache)."""
scanner = DocTypeScanner(workspace.root)
return await scanner.scan_async()
Edge Cases
| Case | Handling |
|---|---|
| Syntax error in .py file | Skip file, log warning |
| Non-UTF8 file | Skip file |
| Circular imports | N/A - we use AST, not import |
| Symlinks | Follow by default (pathlib behavior) |
| Permission denied | Skip file, log warning |
| Empty workspace | Return empty list |
What Scanner Does NOT Do
- ❌ Import Python modules
- ❌ Execute any code
- ❌ Validate DocType correctness
- ❌ Cache results
- ❌ Watch for changes
Validation and detailed parsing happen in a separate phase (LibCST parser).
Testing
def test_scanner_determinism():
"""Same filesystem → same output."""
scanner = DocTypeScanner(test_workspace)
result1 = scanner.scan()
result2 = scanner.scan()
assert result1 == result2
def test_scanner_performance():
"""1000 files in `<500ms`."""
scanner = DocTypeScanner(large_workspace)
start = time.perf_counter()
result = scanner.scan()
elapsed = time.perf_counter() - start
assert elapsed < 0.5
assert len(result) > 1000
References
- Phase 07: Studio
- Python AST module
- pathlib.rglob