# LSP Lexer Usage Audit **Date:** 2026-02-12 **Auditor:** LSP Test Engineer **Purpose:** Verify all LSP modules use the lexer exclusively (no ad-hoc parsing) --- ## Executive Summary **Status:** ⚠️ **Mixed Compliance** - **3 modules** properly use lexer ✅ - **8 modules** use ad-hoc parsing ❌ - **Critical Risk:** Hardcoded keyword lists in 3 modules **Risk Level:** MEDIUM - Inconsistent behavior between LSP and compiler - Vulnerability to keyword changes (like recent behavior tree updates) - Maintenance burden from duplicate logic --- ## Audit Results by Module ### ✅ Compliant Modules (Using Lexer) #### 1. **completion.rs** - ✅ GOOD (with caveats) **Status:** Uses Lexer properly for tokenization **Lexer Usage:** ```rust use crate::syntax::lexer::{Lexer, Token}; let lexer = Lexer::new(before); ``` **Found at lines:** 135, 142, 269, 277, 410, 417 **Issue:** ⚠️ Contains hardcoded keyword strings (8 occurrences) - Lines with hardcoded keywords: "character", "template", "behavior", etc. - **Risk:** Keywords could get out of sync with lexer **Recommendation:** Extract keywords from lexer Token enum instead of hardcoding --- #### 2. **semantic_tokens.rs** - ✅ EXCELLENT **Status:** Uses Lexer exclusively, no manual parsing **Lexer Usage:** ```rust use crate::syntax::lexer::{Lexer, Token}; ``` **Assessment:** **Best practice example** - Uses lexer for all tokenization, no hardcoded keywords, no manual string manipulation --- #### 3. **code_actions.rs** - ✅ GOOD (with caveats) **Status:** Uses Lexer for tokenization **Issues:** - ⚠️ Contains 15 instances of manual string parsing (`.split()`, `.chars()`, `.lines()`) - ⚠️ Has 4 hardcoded keyword strings - **Concern:** Mixed approach - uses lexer sometimes, manual parsing other times **Recommendation:** Refactor to use lexer consistently throughout --- ### ❌ Non-Compliant Modules (Ad-Hoc Parsing) #### 4. **hover.rs** - ❌ CRITICAL ISSUE **Status:** NO lexer usage, extensive manual parsing **Problems:** **A. Manual Character-by-Character Parsing** ```rust // Line 51-77: extract_word_at_position() let chars: Vec = line.chars().collect(); // Custom word boundary detection while start > 0 && is_word_char(chars[start - 1]) { start -= 1; } ``` **Risk:** Custom tokenization logic differs from compiler's lexer **B. Hardcoded Keyword List** (Lines 20-39) ```rust match word.as_str() { "character" => "**character** - Defines a character entity...", "template" => "**template** - Defines a reusable field template...", "life_arc" => "**life_arc** - Defines a state machine...", "schedule" => "**schedule** - Defines a daily schedule...", "behavior" => "**behavior** - Defines a behavior tree...", "institution" => "**institution** - Defines an organization...", "relationship" => "**relationship** - Defines a multi-party...", "location" => "**location** - Defines a place...", "species" => "**species** - Defines a species...", "enum" => "**enum** - Defines an enumeration...", "use" => "**use** - Imports declarations...", "from" => "**from** - Applies templates...", "include" => "**include** - Includes another template...", "state" => "**state** - Defines a state...", "on" => "**on** - Defines a transition...", "strict" => "**strict** - Enforces that a template...", _ => return None, } ``` **Risk:** HIGH - 16 hardcoded keywords that must stay in sync with lexer - If lexer adds/removes keywords, hover won't update automatically - Recent behavior tree changes could have broken this **Recommendation:** ```rust // BEFORE (current): let word = extract_word_at_position(line_text, character)?; // AFTER (proposed): let lexer = Lexer::new(line_text); let token_at_pos = find_token_at_position(&lexer, character)?; match token_at_pos { Token::Character => "**character** - Defines...", Token::Template => "**template** - Defines...", // etc - using Token enum from lexer } ``` --- #### 5. **formatting.rs** - ❌ MANUAL PARSING **Status:** NO lexer usage, line-by-line text processing **Problems:** ```rust // Line 53: Manual line processing for line in text.lines() { let trimmed = line.trim(); // Custom logic for braces, colons, etc. } ``` **Risk:** MEDIUM - Custom formatting logic may not respect language semantics - Could format code incorrectly if it doesn't understand context **Recommendation:** Use lexer to tokenize, then format based on tokens - Preserves semantic understanding - Respects string literals, comments, etc. --- #### 6. **diagnostics.rs** - ❌ MANUAL PARSING **Status:** NO lexer usage, manual brace counting **Problems:** ```rust // Lines 34-37: Manual brace counting for (line_num, line) in text.lines().enumerate() { let open_braces = line.chars().filter(|&c| c == '{').count(); let close_braces = line.chars().filter(|&c| c == '}').count(); } // Line 79: Character-by-character processing for ch in text.chars() { // Custom logic } ``` **Risk:** HIGH - Brace counting doesn't account for braces in strings: `character Alice { name: "{" }` - Doesn't respect comments: `// This { is a comment` - Could generate false diagnostics **Recommendation:** Use lexer tokens to track brace pairs accurately --- #### 7. **references.rs** - ❌ MANUAL STRING PARSING **Status:** NO lexer usage **Problems:** ```rust // Manual string parsing for word boundaries ``` **Risk:** MEDIUM - May not correctly identify symbol boundaries - Could match partial words **Recommendation:** Use lexer to identify identifiers --- #### 8. **rename.rs** - ❌ HARDCODED KEYWORDS **Status:** NO lexer usage, contains hardcoded strings **Problems:** - 2 hardcoded keyword strings - Manual symbol identification **Risk:** MEDIUM - May incorrectly rename keywords or miss valid renames **Recommendation:** Use lexer to distinguish keywords from identifiers --- #### 9. **definition.rs** - ⚠️ UNCLEAR **Status:** No obvious lexer usage, no obvious manual parsing **Assessment:** Likely uses AST-based approach (acceptable) - May rely on symbols extracted from AST - **Acceptable** if using `Document.ast` for symbol lookup **Recommendation:** Verify it's using AST, not string parsing --- #### 10. **inlay_hints.rs** - ⚠️ UNCLEAR **Status:** No lexer usage detected, no obvious parsing **Assessment:** Minimal code inspection needed - May be stub or use AST-based approach **Recommendation:** Full inspection needed --- #### 11. **symbols.rs** - ✅ LIKELY COMPLIANT **Status:** No manual parsing detected **Assessment:** Appears to extract symbols from AST - **Acceptable approach** - AST is the canonical representation - No need for lexer if working from AST --- ## Summary Table | Module | Lexer Usage | Manual Parsing | Hardcoded Keywords | Risk Level | |--------|-------------|----------------|-------------------|------------| | **completion.rs** | ✅ Yes | ❌ No | ⚠️ 8 keywords | MEDIUM | | **semantic_tokens.rs** | ✅ Yes | ❌ No | ✅ None | LOW | | **code_actions.rs** | ✅ Yes | ⚠️ 15 instances | ⚠️ 4 keywords | MEDIUM | | **hover.rs** | ❌ No | ⚠️ Extensive | ⚠️ 16 keywords | **HIGH** | | **formatting.rs** | ❌ No | ⚠️ Line-by-line | ✅ None | MEDIUM | | **diagnostics.rs** | ❌ No | ⚠️ Char counting | ✅ None | **HIGH** | | **references.rs** | ❌ No | ⚠️ Yes | ✅ None | MEDIUM | | **rename.rs** | ❌ No | ⚠️ Yes | ⚠️ 2 keywords | MEDIUM | | **definition.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW | | **inlay_hints.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW | | **symbols.rs** | N/A (AST) | ❌ No | ✅ None | LOW | --- ## Critical Findings ### 1. **hover.rs** - Highest Risk - 16 hardcoded keywords - Custom tokenization logic - **Impact:** Recent behavior tree keyword changes may have broken hover - **Fix Priority:** HIGH ### 2. **diagnostics.rs** - High Risk - Manual brace counting fails in strings/comments - Could generate false errors - **Fix Priority:** HIGH ### 3. **Hardcoded Keywords** - Maintenance Burden - **Total:** 30 hardcoded keyword strings across 4 files - **Risk:** Keywords get out of sync with lexer - **Recent Example:** Behavior tree syntax changes could have broken these --- ## Recommended Fixes ### Priority 1: hover.rs Refactoring **Estimated Time:** 2-3 hours **Changes:** 1. Add lexer import: `use crate::syntax::lexer::{Lexer, Token};` 2. Replace `extract_word_at_position()` with lexer-based token finding 3. Replace keyword match with Token enum match 4. Remove hardcoded keyword strings **Benefits:** - Automatic sync with lexer changes - Consistent tokenization with compiler - Easier maintenance **Example Code:** ```rust pub fn get_hover_info(text: &str, line: usize, character: usize) -> Option { let line_text = text.lines().nth(line)?; let lexer = Lexer::new(line_text); // Find token at character position let token_at_pos = find_token_at_position(&lexer, character)?; let content = match token_at_pos { Token::Character => "**character** - Defines a character entity...", Token::Template => "**template** - Defines a reusable field template...", // Use Token enum - stays in sync automatically }; Some(Hover { /* ... */ }) } ``` --- ### Priority 2: diagnostics.rs Refactoring **Estimated Time:** 1-2 hours **Changes:** 1. Use lexer to tokenize text 2. Count LBrace/RBrace tokens (not characters) 3. Ignore braces in strings and comments automatically **Benefits:** - Correct handling of braces in all contexts - No false positives --- ### Priority 3: Extract Keyword Definitions **Estimated Time:** 1 hour **Changes:** 1. Create `src/syntax/keywords.rs` module 2. Define keyword list from lexer Token enum 3. Use in completion, code_actions, rename **Benefits:** - Single source of truth for keywords - Easy to keep in sync --- ## Testing Recommendations After fixes, add tests for: 1. **Hover on keywords in strings** - Should NOT show hover ```rust character Alice { name: "character" } // hovering "character" in string ``` 2. **Diagnostics with braces in strings** - Should NOT count ```rust character Alice { name: "{}" } // should not report brace mismatch ``` 3. **Lexer consistency tests** - Verify LSP uses same tokens as compiler ```rust #[test] fn test_hover_uses_lexer_keywords() { // Ensure hover keywords match lexer Token enum } ``` --- ## Long-Term Recommendations ### 1. Establish Coding Standards Document that LSP modules should: - ✅ Use lexer for all tokenization - ✅ Use AST for semantic analysis - ❌ Never do manual string parsing - ❌ Never hardcode keywords ### 2. Code Review Checklist Add to PR reviews: - [ ] Does LSP code use lexer for tokenization? - [ ] Are there any hardcoded keyword strings? - [ ] Is there any manual `.split()`, `.chars()`, `.find()` parsing? ### 3. Shared Utilities Create `src/lsp/lexer_utils.rs` with: - `find_token_at_position()` - Common utility - `get_keyword_docs()` - Centralized keyword documentation - `is_keyword()` - Token-based keyword checking --- ## Impact Assessment ### Current State Risks **Behavior Tree Changes (Recent):** - Lexer/parser updated with new keywords - ❌ `hover.rs` still has old hardcoded list - ⚠️ Users may not see hover for new keywords **Future Keyword Changes:** - Any new keywords require updates in 4 separate files - Easy to miss one, causing inconsistent behavior **Bug Examples:** ```rust // diagnostics.rs bug: character Alice { description: "Contains } brace" } // ^ Reports false "unmatched brace" error // hover.rs bug: character Alice { /* hover on "character" keyword here */ } // ^ Works "In a character block..." // hover on "character" in string // ^ Also shows hover (WRONG - it's just a word in a string) ``` --- ## Conclusion **Overall Assessment:** NEEDS IMPROVEMENT - **Compliance Rate:** 27% (3/11 modules) - **Risk Level:** MEDIUM-HIGH - **Recommended Action:** Refactor hover.rs and diagnostics.rs as priority **Benefits of Fixing:** 1. Consistency with compiler behavior 2. Automatic sync with language changes 3. Reduced maintenance burden 4. Fewer bugs from edge cases **Next Steps:** 1. Create GitHub issues for each Priority 1-2 fix 2. Refactor hover.rs (highest risk) 3. Refactor diagnostics.rs (high risk) 4. Extract keyword definitions (prevents future issues) 5. Add lexer usage to code review checklist --- ## Appendix: Files Analyzed ``` Total LSP modules audited: 11 ✅ Compliant: 3 (27%) ⚠️ Partially compliant: 1 (9%) ❌ Non-compliant: 7 (64%) Files examined: - src/lsp/completion.rs - src/lsp/hover.rs - src/lsp/semantic_tokens.rs - src/lsp/inlay_hints.rs - src/lsp/definition.rs - src/lsp/formatting.rs - src/lsp/rename.rs - src/lsp/references.rs - src/lsp/code_actions.rs - src/lsp/symbols.rs - src/lsp/diagnostics.rs ```