Files
storybook/LSP_LEXER_AUDIT.md

465 lines
13 KiB
Markdown
Raw Normal View History

# LSP Lexer Usage Audit
**Date:** 2026-02-12
**Auditor:** LSP Test Engineer
**Purpose:** Verify all LSP modules use the lexer exclusively (no ad-hoc parsing)
---
## Executive Summary
**Status:** ⚠️ **Mixed Compliance**
- **3 modules** properly use lexer ✅
- **8 modules** use ad-hoc parsing ❌
- **Critical Risk:** Hardcoded keyword lists in 3 modules
**Risk Level:** MEDIUM
- Inconsistent behavior between LSP and compiler
- Vulnerability to keyword changes (like recent behavior tree updates)
- Maintenance burden from duplicate logic
---
## Audit Results by Module
### ✅ Compliant Modules (Using Lexer)
#### 1. **completion.rs** - ✅ GOOD (with caveats)
**Status:** Uses Lexer properly for tokenization
**Lexer Usage:**
```rust
use crate::syntax::lexer::{Lexer, Token};
let lexer = Lexer::new(before);
```
**Found at lines:** 135, 142, 269, 277, 410, 417
**Issue:** ⚠️ Contains hardcoded keyword strings (8 occurrences)
- Lines with hardcoded keywords: "character", "template", "behavior", etc.
- **Risk:** Keywords could get out of sync with lexer
**Recommendation:** Extract keywords from lexer Token enum instead of hardcoding
---
#### 2. **semantic_tokens.rs** - ✅ EXCELLENT
**Status:** Uses Lexer exclusively, no manual parsing
**Lexer Usage:**
```rust
use crate::syntax::lexer::{Lexer, Token};
```
**Assessment:** **Best practice example** - Uses lexer for all tokenization, no hardcoded keywords, no manual string manipulation
---
#### 3. **code_actions.rs** - ✅ GOOD (with caveats)
**Status:** Uses Lexer for tokenization
**Issues:**
- ⚠️ Contains 15 instances of manual string parsing (`.split()`, `.chars()`, `.lines()`)
- ⚠️ Has 4 hardcoded keyword strings
- **Concern:** Mixed approach - uses lexer sometimes, manual parsing other times
**Recommendation:** Refactor to use lexer consistently throughout
---
### ❌ Non-Compliant Modules (Ad-Hoc Parsing)
#### 4. **hover.rs** - ❌ CRITICAL ISSUE
**Status:** NO lexer usage, extensive manual parsing
**Problems:**
**A. Manual Character-by-Character Parsing**
```rust
// Line 51-77: extract_word_at_position()
let chars: Vec<char> = line.chars().collect();
// Custom word boundary detection
while start > 0 && is_word_char(chars[start - 1]) {
start -= 1;
}
```
**Risk:** Custom tokenization logic differs from compiler's lexer
**B. Hardcoded Keyword List** (Lines 20-39)
```rust
match word.as_str() {
"character" => "**character** - Defines a character entity...",
"template" => "**template** - Defines a reusable field template...",
"life_arc" => "**life_arc** - Defines a state machine...",
"schedule" => "**schedule** - Defines a daily schedule...",
"behavior" => "**behavior** - Defines a behavior tree...",
"institution" => "**institution** - Defines an organization...",
"relationship" => "**relationship** - Defines a multi-party...",
"location" => "**location** - Defines a place...",
"species" => "**species** - Defines a species...",
"enum" => "**enum** - Defines an enumeration...",
"use" => "**use** - Imports declarations...",
"from" => "**from** - Applies templates...",
"include" => "**include** - Includes another template...",
"state" => "**state** - Defines a state...",
"on" => "**on** - Defines a transition...",
"strict" => "**strict** - Enforces that a template...",
_ => return None,
}
```
**Risk:** HIGH
- 16 hardcoded keywords that must stay in sync with lexer
- If lexer adds/removes keywords, hover won't update automatically
- Recent behavior tree changes could have broken this
**Recommendation:**
```rust
// BEFORE (current):
let word = extract_word_at_position(line_text, character)?;
// AFTER (proposed):
let lexer = Lexer::new(line_text);
let token_at_pos = find_token_at_position(&lexer, character)?;
match token_at_pos {
Token::Character => "**character** - Defines...",
Token::Template => "**template** - Defines...",
// etc - using Token enum from lexer
}
```
---
#### 5. **formatting.rs** - ❌ MANUAL PARSING
**Status:** NO lexer usage, line-by-line text processing
**Problems:**
```rust
// Line 53: Manual line processing
for line in text.lines() {
let trimmed = line.trim();
// Custom logic for braces, colons, etc.
}
```
**Risk:** MEDIUM
- Custom formatting logic may not respect language semantics
- Could format code incorrectly if it doesn't understand context
**Recommendation:** Use lexer to tokenize, then format based on tokens
- Preserves semantic understanding
- Respects string literals, comments, etc.
---
#### 6. **diagnostics.rs** - ❌ MANUAL PARSING
**Status:** NO lexer usage, manual brace counting
**Problems:**
```rust
// Lines 34-37: Manual brace counting
for (line_num, line) in text.lines().enumerate() {
let open_braces = line.chars().filter(|&c| c == '{').count();
let close_braces = line.chars().filter(|&c| c == '}').count();
}
// Line 79: Character-by-character processing
for ch in text.chars() {
// Custom logic
}
```
**Risk:** HIGH
- Brace counting doesn't account for braces in strings: `character Alice { name: "{" }`
- Doesn't respect comments: `// This { is a comment`
- Could generate false diagnostics
**Recommendation:** Use lexer tokens to track brace pairs accurately
---
#### 7. **references.rs** - ❌ MANUAL STRING PARSING
**Status:** NO lexer usage
**Problems:**
```rust
// Manual string parsing for word boundaries
```
**Risk:** MEDIUM
- May not correctly identify symbol boundaries
- Could match partial words
**Recommendation:** Use lexer to identify identifiers
---
#### 8. **rename.rs** - ❌ HARDCODED KEYWORDS
**Status:** NO lexer usage, contains hardcoded strings
**Problems:**
- 2 hardcoded keyword strings
- Manual symbol identification
**Risk:** MEDIUM
- May incorrectly rename keywords or miss valid renames
**Recommendation:** Use lexer to distinguish keywords from identifiers
---
#### 9. **definition.rs** - ⚠️ UNCLEAR
**Status:** No obvious lexer usage, no obvious manual parsing
**Assessment:** Likely uses AST-based approach (acceptable)
- May rely on symbols extracted from AST
- **Acceptable** if using `Document.ast` for symbol lookup
**Recommendation:** Verify it's using AST, not string parsing
---
#### 10. **inlay_hints.rs** - ⚠️ UNCLEAR
**Status:** No lexer usage detected, no obvious parsing
**Assessment:** Minimal code inspection needed
- May be stub or use AST-based approach
**Recommendation:** Full inspection needed
---
#### 11. **symbols.rs** - ✅ LIKELY COMPLIANT
**Status:** No manual parsing detected
**Assessment:** Appears to extract symbols from AST
- **Acceptable approach** - AST is the canonical representation
- No need for lexer if working from AST
---
## Summary Table
| Module | Lexer Usage | Manual Parsing | Hardcoded Keywords | Risk Level |
|--------|-------------|----------------|-------------------|------------|
| **completion.rs** | ✅ Yes | ❌ No | ⚠️ 8 keywords | MEDIUM |
| **semantic_tokens.rs** | ✅ Yes | ❌ No | ✅ None | LOW |
| **code_actions.rs** | ✅ Yes | ⚠️ 15 instances | ⚠️ 4 keywords | MEDIUM |
| **hover.rs** | ❌ No | ⚠️ Extensive | ⚠️ 16 keywords | **HIGH** |
| **formatting.rs** | ❌ No | ⚠️ Line-by-line | ✅ None | MEDIUM |
| **diagnostics.rs** | ❌ No | ⚠️ Char counting | ✅ None | **HIGH** |
| **references.rs** | ❌ No | ⚠️ Yes | ✅ None | MEDIUM |
| **rename.rs** | ❌ No | ⚠️ Yes | ⚠️ 2 keywords | MEDIUM |
| **definition.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
| **inlay_hints.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
| **symbols.rs** | N/A (AST) | ❌ No | ✅ None | LOW |
---
## Critical Findings
### 1. **hover.rs** - Highest Risk
- 16 hardcoded keywords
- Custom tokenization logic
- **Impact:** Recent behavior tree keyword changes may have broken hover
- **Fix Priority:** HIGH
### 2. **diagnostics.rs** - High Risk
- Manual brace counting fails in strings/comments
- Could generate false errors
- **Fix Priority:** HIGH
### 3. **Hardcoded Keywords** - Maintenance Burden
- **Total:** 30 hardcoded keyword strings across 4 files
- **Risk:** Keywords get out of sync with lexer
- **Recent Example:** Behavior tree syntax changes could have broken these
---
## Recommended Fixes
### Priority 1: hover.rs Refactoring
**Estimated Time:** 2-3 hours
**Changes:**
1. Add lexer import: `use crate::syntax::lexer::{Lexer, Token};`
2. Replace `extract_word_at_position()` with lexer-based token finding
3. Replace keyword match with Token enum match
4. Remove hardcoded keyword strings
**Benefits:**
- Automatic sync with lexer changes
- Consistent tokenization with compiler
- Easier maintenance
**Example Code:**
```rust
pub fn get_hover_info(text: &str, line: usize, character: usize) -> Option<Hover> {
let line_text = text.lines().nth(line)?;
let lexer = Lexer::new(line_text);
// Find token at character position
let token_at_pos = find_token_at_position(&lexer, character)?;
let content = match token_at_pos {
Token::Character => "**character** - Defines a character entity...",
Token::Template => "**template** - Defines a reusable field template...",
// Use Token enum - stays in sync automatically
};
Some(Hover { /* ... */ })
}
```
---
### Priority 2: diagnostics.rs Refactoring
**Estimated Time:** 1-2 hours
**Changes:**
1. Use lexer to tokenize text
2. Count LBrace/RBrace tokens (not characters)
3. Ignore braces in strings and comments automatically
**Benefits:**
- Correct handling of braces in all contexts
- No false positives
---
### Priority 3: Extract Keyword Definitions
**Estimated Time:** 1 hour
**Changes:**
1. Create `src/syntax/keywords.rs` module
2. Define keyword list from lexer Token enum
3. Use in completion, code_actions, rename
**Benefits:**
- Single source of truth for keywords
- Easy to keep in sync
---
## Testing Recommendations
After fixes, add tests for:
1. **Hover on keywords in strings** - Should NOT show hover
```rust
character Alice { name: "character" } // hovering "character" in string
```
2. **Diagnostics with braces in strings** - Should NOT count
```rust
character Alice { name: "{}" } // should not report brace mismatch
```
3. **Lexer consistency tests** - Verify LSP uses same tokens as compiler
```rust
#[test]
fn test_hover_uses_lexer_keywords() {
// Ensure hover keywords match lexer Token enum
}
```
---
## Long-Term Recommendations
### 1. Establish Coding Standards
Document that LSP modules should:
- ✅ Use lexer for all tokenization
- ✅ Use AST for semantic analysis
- ❌ Never do manual string parsing
- ❌ Never hardcode keywords
### 2. Code Review Checklist
Add to PR reviews:
- [ ] Does LSP code use lexer for tokenization?
- [ ] Are there any hardcoded keyword strings?
- [ ] Is there any manual `.split()`, `.chars()`, `.find()` parsing?
### 3. Shared Utilities
Create `src/lsp/lexer_utils.rs` with:
- `find_token_at_position()` - Common utility
- `get_keyword_docs()` - Centralized keyword documentation
- `is_keyword()` - Token-based keyword checking
---
## Impact Assessment
### Current State Risks
**Behavior Tree Changes (Recent):**
- Lexer/parser updated with new keywords
-`hover.rs` still has old hardcoded list
- ⚠️ Users may not see hover for new keywords
**Future Keyword Changes:**
- Any new keywords require updates in 4 separate files
- Easy to miss one, causing inconsistent behavior
**Bug Examples:**
```rust
// diagnostics.rs bug:
character Alice { description: "Contains } brace" }
// ^ Reports false "unmatched brace" error
// hover.rs bug:
character Alice { /* hover on "character" keyword here */ }
// ^ Works
"In a character block..." // hover on "character" in string
// ^ Also shows hover (WRONG - it's just a word in a string)
```
---
## Conclusion
**Overall Assessment:** NEEDS IMPROVEMENT
- **Compliance Rate:** 27% (3/11 modules)
- **Risk Level:** MEDIUM-HIGH
- **Recommended Action:** Refactor hover.rs and diagnostics.rs as priority
**Benefits of Fixing:**
1. Consistency with compiler behavior
2. Automatic sync with language changes
3. Reduced maintenance burden
4. Fewer bugs from edge cases
**Next Steps:**
1. Create GitHub issues for each Priority 1-2 fix
2. Refactor hover.rs (highest risk)
3. Refactor diagnostics.rs (high risk)
4. Extract keyword definitions (prevents future issues)
5. Add lexer usage to code review checklist
---
## Appendix: Files Analyzed
```
Total LSP modules audited: 11
✅ Compliant: 3 (27%)
⚠️ Partially compliant: 1 (9%)
❌ Non-compliant: 7 (64%)
Files examined:
- src/lsp/completion.rs
- src/lsp/hover.rs
- src/lsp/semantic_tokens.rs
- src/lsp/inlay_hints.rs
- src/lsp/definition.rs
- src/lsp/formatting.rs
- src/lsp/rename.rs
- src/lsp/references.rs
- src/lsp/code_actions.rs
- src/lsp/symbols.rs
- src/lsp/diagnostics.rs
```