Files
storybook/LSP_LEXER_AUDIT.md
Sienna Meridian Satterwhite 16deb5d237 release: Storybook v0.2.0 - Major syntax and features update
BREAKING CHANGES:
- Relationship syntax now requires blocks for all participants
- Removed self/other perspective blocks from relationships
- Replaced 'guard' keyword with 'if' for behavior tree decorators

Language Features:
- Add tree-sitter grammar with improved if/condition disambiguation
- Add comprehensive tutorial and reference documentation
- Add SBIR v0.2.0 binary format specification
- Add resource linking system for behaviors and schedules
- Add year-long schedule patterns (day, season, recurrence)
- Add behavior tree enhancements (named nodes, decorators)

Documentation:
- Complete tutorial series (9 chapters) with baker family examples
- Complete reference documentation for all language features
- SBIR v0.2.0 specification with binary format details
- Added locations and institutions documentation

Examples:
- Convert all examples to baker family scenario
- Add comprehensive working examples

Tooling:
- Zed extension with LSP integration
- Tree-sitter grammar for syntax highlighting
- Build scripts and development tools

Version Updates:
- Main package: 0.1.0 → 0.2.0
- Tree-sitter grammar: 0.1.0 → 0.2.0
- Zed extension: 0.1.0 → 0.2.0
- Storybook editor: 0.1.0 → 0.2.0
2026-02-13 21:52:03 +00:00

13 KiB

LSP Lexer Usage Audit

Date: 2026-02-12 Auditor: LSP Test Engineer Purpose: Verify all LSP modules use the lexer exclusively (no ad-hoc parsing)


Executive Summary

Status: ⚠️ Mixed Compliance

  • 3 modules properly use lexer
  • 8 modules use ad-hoc parsing
  • Critical Risk: Hardcoded keyword lists in 3 modules

Risk Level: MEDIUM

  • Inconsistent behavior between LSP and compiler
  • Vulnerability to keyword changes (like recent behavior tree updates)
  • Maintenance burden from duplicate logic

Audit Results by Module

Compliant Modules (Using Lexer)

1. completion.rs - GOOD (with caveats)

Status: Uses Lexer properly for tokenization

Lexer Usage:

use crate::syntax::lexer::{Lexer, Token};
let lexer = Lexer::new(before);

Found at lines: 135, 142, 269, 277, 410, 417

Issue: ⚠️ Contains hardcoded keyword strings (8 occurrences)

  • Lines with hardcoded keywords: "character", "template", "behavior", etc.
  • Risk: Keywords could get out of sync with lexer

Recommendation: Extract keywords from lexer Token enum instead of hardcoding


2. semantic_tokens.rs - EXCELLENT

Status: Uses Lexer exclusively, no manual parsing

Lexer Usage:

use crate::syntax::lexer::{Lexer, Token};

Assessment: Best practice example - Uses lexer for all tokenization, no hardcoded keywords, no manual string manipulation


3. code_actions.rs - GOOD (with caveats)

Status: Uses Lexer for tokenization

Issues:

  • ⚠️ Contains 15 instances of manual string parsing (.split(), .chars(), .lines())
  • ⚠️ Has 4 hardcoded keyword strings
  • Concern: Mixed approach - uses lexer sometimes, manual parsing other times

Recommendation: Refactor to use lexer consistently throughout


Non-Compliant Modules (Ad-Hoc Parsing)

4. hover.rs - CRITICAL ISSUE

Status: NO lexer usage, extensive manual parsing

Problems:

A. Manual Character-by-Character Parsing

// Line 51-77: extract_word_at_position()
let chars: Vec<char> = line.chars().collect();
// Custom word boundary detection
while start > 0 && is_word_char(chars[start - 1]) {
    start -= 1;
}

Risk: Custom tokenization logic differs from compiler's lexer

B. Hardcoded Keyword List (Lines 20-39)

match word.as_str() {
    "character" => "**character** - Defines a character entity...",
    "template" => "**template** - Defines a reusable field template...",
    "life_arc" => "**life_arc** - Defines a state machine...",
    "schedule" => "**schedule** - Defines a daily schedule...",
    "behavior" => "**behavior** - Defines a behavior tree...",
    "institution" => "**institution** - Defines an organization...",
    "relationship" => "**relationship** - Defines a multi-party...",
    "location" => "**location** - Defines a place...",
    "species" => "**species** - Defines a species...",
    "enum" => "**enum** - Defines an enumeration...",
    "use" => "**use** - Imports declarations...",
    "from" => "**from** - Applies templates...",
    "include" => "**include** - Includes another template...",
    "state" => "**state** - Defines a state...",
    "on" => "**on** - Defines a transition...",
    "strict" => "**strict** - Enforces that a template...",
    _ => return None,
}

Risk: HIGH

  • 16 hardcoded keywords that must stay in sync with lexer
  • If lexer adds/removes keywords, hover won't update automatically
  • Recent behavior tree changes could have broken this

Recommendation:

// BEFORE (current):
let word = extract_word_at_position(line_text, character)?;

// AFTER (proposed):
let lexer = Lexer::new(line_text);
let token_at_pos = find_token_at_position(&lexer, character)?;
match token_at_pos {
    Token::Character => "**character** - Defines...",
    Token::Template => "**template** - Defines...",
    // etc - using Token enum from lexer
}

5. formatting.rs - MANUAL PARSING

Status: NO lexer usage, line-by-line text processing

Problems:

// Line 53: Manual line processing
for line in text.lines() {
    let trimmed = line.trim();
    // Custom logic for braces, colons, etc.
}

Risk: MEDIUM

  • Custom formatting logic may not respect language semantics
  • Could format code incorrectly if it doesn't understand context

Recommendation: Use lexer to tokenize, then format based on tokens

  • Preserves semantic understanding
  • Respects string literals, comments, etc.

6. diagnostics.rs - MANUAL PARSING

Status: NO lexer usage, manual brace counting

Problems:

// Lines 34-37: Manual brace counting
for (line_num, line) in text.lines().enumerate() {
    let open_braces = line.chars().filter(|&c| c == '{').count();
    let close_braces = line.chars().filter(|&c| c == '}').count();
}

// Line 79: Character-by-character processing
for ch in text.chars() {
    // Custom logic
}

Risk: HIGH

  • Brace counting doesn't account for braces in strings: character Alice { name: "{" }
  • Doesn't respect comments: // This { is a comment
  • Could generate false diagnostics

Recommendation: Use lexer tokens to track brace pairs accurately


7. references.rs - MANUAL STRING PARSING

Status: NO lexer usage

Problems:

// Manual string parsing for word boundaries

Risk: MEDIUM

  • May not correctly identify symbol boundaries
  • Could match partial words

Recommendation: Use lexer to identify identifiers


8. rename.rs - HARDCODED KEYWORDS

Status: NO lexer usage, contains hardcoded strings

Problems:

  • 2 hardcoded keyword strings
  • Manual symbol identification

Risk: MEDIUM

  • May incorrectly rename keywords or miss valid renames

Recommendation: Use lexer to distinguish keywords from identifiers


9. definition.rs - ⚠️ UNCLEAR

Status: No obvious lexer usage, no obvious manual parsing

Assessment: Likely uses AST-based approach (acceptable)

  • May rely on symbols extracted from AST
  • Acceptable if using Document.ast for symbol lookup

Recommendation: Verify it's using AST, not string parsing


10. inlay_hints.rs - ⚠️ UNCLEAR

Status: No lexer usage detected, no obvious parsing

Assessment: Minimal code inspection needed

  • May be stub or use AST-based approach

Recommendation: Full inspection needed


11. symbols.rs - LIKELY COMPLIANT

Status: No manual parsing detected

Assessment: Appears to extract symbols from AST

  • Acceptable approach - AST is the canonical representation
  • No need for lexer if working from AST

Summary Table

Module Lexer Usage Manual Parsing Hardcoded Keywords Risk Level
completion.rs Yes No ⚠️ 8 keywords MEDIUM
semantic_tokens.rs Yes No None LOW
code_actions.rs Yes ⚠️ 15 instances ⚠️ 4 keywords MEDIUM
hover.rs No ⚠️ Extensive ⚠️ 16 keywords HIGH
formatting.rs No ⚠️ Line-by-line None MEDIUM
diagnostics.rs No ⚠️ Char counting None HIGH
references.rs No ⚠️ Yes None MEDIUM
rename.rs No ⚠️ Yes ⚠️ 2 keywords MEDIUM
definition.rs ⚠️ Unknown ⚠️ Unknown None LOW
inlay_hints.rs ⚠️ Unknown ⚠️ Unknown None LOW
symbols.rs N/A (AST) No None LOW

Critical Findings

1. hover.rs - Highest Risk

  • 16 hardcoded keywords
  • Custom tokenization logic
  • Impact: Recent behavior tree keyword changes may have broken hover
  • Fix Priority: HIGH

2. diagnostics.rs - High Risk

  • Manual brace counting fails in strings/comments
  • Could generate false errors
  • Fix Priority: HIGH

3. Hardcoded Keywords - Maintenance Burden

  • Total: 30 hardcoded keyword strings across 4 files
  • Risk: Keywords get out of sync with lexer
  • Recent Example: Behavior tree syntax changes could have broken these

Priority 1: hover.rs Refactoring

Estimated Time: 2-3 hours

Changes:

  1. Add lexer import: use crate::syntax::lexer::{Lexer, Token};
  2. Replace extract_word_at_position() with lexer-based token finding
  3. Replace keyword match with Token enum match
  4. Remove hardcoded keyword strings

Benefits:

  • Automatic sync with lexer changes
  • Consistent tokenization with compiler
  • Easier maintenance

Example Code:

pub fn get_hover_info(text: &str, line: usize, character: usize) -> Option<Hover> {
    let line_text = text.lines().nth(line)?;
    let lexer = Lexer::new(line_text);

    // Find token at character position
    let token_at_pos = find_token_at_position(&lexer, character)?;

    let content = match token_at_pos {
        Token::Character => "**character** - Defines a character entity...",
        Token::Template => "**template** - Defines a reusable field template...",
        // Use Token enum - stays in sync automatically
    };

    Some(Hover { /* ... */ })
}

Priority 2: diagnostics.rs Refactoring

Estimated Time: 1-2 hours

Changes:

  1. Use lexer to tokenize text
  2. Count LBrace/RBrace tokens (not characters)
  3. Ignore braces in strings and comments automatically

Benefits:

  • Correct handling of braces in all contexts
  • No false positives

Priority 3: Extract Keyword Definitions

Estimated Time: 1 hour

Changes:

  1. Create src/syntax/keywords.rs module
  2. Define keyword list from lexer Token enum
  3. Use in completion, code_actions, rename

Benefits:

  • Single source of truth for keywords
  • Easy to keep in sync

Testing Recommendations

After fixes, add tests for:

  1. Hover on keywords in strings - Should NOT show hover

    character Alice { name: "character" }  // hovering "character" in string
    
  2. Diagnostics with braces in strings - Should NOT count

    character Alice { name: "{}" }  // should not report brace mismatch
    
  3. Lexer consistency tests - Verify LSP uses same tokens as compiler

    #[test]
    fn test_hover_uses_lexer_keywords() {
        // Ensure hover keywords match lexer Token enum
    }
    

Long-Term Recommendations

1. Establish Coding Standards

Document that LSP modules should:

  • Use lexer for all tokenization
  • Use AST for semantic analysis
  • Never do manual string parsing
  • Never hardcode keywords

2. Code Review Checklist

Add to PR reviews:

  • Does LSP code use lexer for tokenization?
  • Are there any hardcoded keyword strings?
  • Is there any manual .split(), .chars(), .find() parsing?

3. Shared Utilities

Create src/lsp/lexer_utils.rs with:

  • find_token_at_position() - Common utility
  • get_keyword_docs() - Centralized keyword documentation
  • is_keyword() - Token-based keyword checking

Impact Assessment

Current State Risks

Behavior Tree Changes (Recent):

  • Lexer/parser updated with new keywords
  • hover.rs still has old hardcoded list
  • ⚠️ Users may not see hover for new keywords

Future Keyword Changes:

  • Any new keywords require updates in 4 separate files
  • Easy to miss one, causing inconsistent behavior

Bug Examples:

// diagnostics.rs bug:
character Alice { description: "Contains } brace" }
// ^ Reports false "unmatched brace" error

// hover.rs bug:
character Alice { /* hover on "character" keyword here */ }
// ^ Works

"In a character block..."  // hover on "character" in string
// ^ Also shows hover (WRONG - it's just a word in a string)

Conclusion

Overall Assessment: NEEDS IMPROVEMENT

  • Compliance Rate: 27% (3/11 modules)
  • Risk Level: MEDIUM-HIGH
  • Recommended Action: Refactor hover.rs and diagnostics.rs as priority

Benefits of Fixing:

  1. Consistency with compiler behavior
  2. Automatic sync with language changes
  3. Reduced maintenance burden
  4. Fewer bugs from edge cases

Next Steps:

  1. Create GitHub issues for each Priority 1-2 fix
  2. Refactor hover.rs (highest risk)
  3. Refactor diagnostics.rs (high risk)
  4. Extract keyword definitions (prevents future issues)
  5. Add lexer usage to code review checklist

Appendix: Files Analyzed

Total LSP modules audited: 11
✅ Compliant: 3 (27%)
⚠️ Partially compliant: 1 (9%)
❌ Non-compliant: 7 (64%)

Files examined:
- src/lsp/completion.rs
- src/lsp/hover.rs
- src/lsp/semantic_tokens.rs
- src/lsp/inlay_hints.rs
- src/lsp/definition.rs
- src/lsp/formatting.rs
- src/lsp/rename.rs
- src/lsp/references.rs
- src/lsp/code_actions.rs
- src/lsp/symbols.rs
- src/lsp/diagnostics.rs