465 lines
13 KiB
Markdown
465 lines
13 KiB
Markdown
|
|
# LSP Lexer Usage Audit
|
||
|
|
**Date:** 2026-02-12
|
||
|
|
**Auditor:** LSP Test Engineer
|
||
|
|
**Purpose:** Verify all LSP modules use the lexer exclusively (no ad-hoc parsing)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**Status:** ⚠️ **Mixed Compliance**
|
||
|
|
|
||
|
|
- **3 modules** properly use lexer ✅
|
||
|
|
- **8 modules** use ad-hoc parsing ❌
|
||
|
|
- **Critical Risk:** Hardcoded keyword lists in 3 modules
|
||
|
|
|
||
|
|
**Risk Level:** MEDIUM
|
||
|
|
- Inconsistent behavior between LSP and compiler
|
||
|
|
- Vulnerability to keyword changes (like recent behavior tree updates)
|
||
|
|
- Maintenance burden from duplicate logic
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Audit Results by Module
|
||
|
|
|
||
|
|
### ✅ Compliant Modules (Using Lexer)
|
||
|
|
|
||
|
|
#### 1. **completion.rs** - ✅ GOOD (with caveats)
|
||
|
|
**Status:** Uses Lexer properly for tokenization
|
||
|
|
|
||
|
|
**Lexer Usage:**
|
||
|
|
```rust
|
||
|
|
use crate::syntax::lexer::{Lexer, Token};
|
||
|
|
let lexer = Lexer::new(before);
|
||
|
|
```
|
||
|
|
|
||
|
|
**Found at lines:** 135, 142, 269, 277, 410, 417
|
||
|
|
|
||
|
|
**Issue:** ⚠️ Contains hardcoded keyword strings (8 occurrences)
|
||
|
|
- Lines with hardcoded keywords: "character", "template", "behavior", etc.
|
||
|
|
- **Risk:** Keywords could get out of sync with lexer
|
||
|
|
|
||
|
|
**Recommendation:** Extract keywords from lexer Token enum instead of hardcoding
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 2. **semantic_tokens.rs** - ✅ EXCELLENT
|
||
|
|
**Status:** Uses Lexer exclusively, no manual parsing
|
||
|
|
|
||
|
|
**Lexer Usage:**
|
||
|
|
```rust
|
||
|
|
use crate::syntax::lexer::{Lexer, Token};
|
||
|
|
```
|
||
|
|
|
||
|
|
**Assessment:** **Best practice example** - Uses lexer for all tokenization, no hardcoded keywords, no manual string manipulation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 3. **code_actions.rs** - ✅ GOOD (with caveats)
|
||
|
|
**Status:** Uses Lexer for tokenization
|
||
|
|
|
||
|
|
**Issues:**
|
||
|
|
- ⚠️ Contains 15 instances of manual string parsing (`.split()`, `.chars()`, `.lines()`)
|
||
|
|
- ⚠️ Has 4 hardcoded keyword strings
|
||
|
|
- **Concern:** Mixed approach - uses lexer sometimes, manual parsing other times
|
||
|
|
|
||
|
|
**Recommendation:** Refactor to use lexer consistently throughout
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### ❌ Non-Compliant Modules (Ad-Hoc Parsing)
|
||
|
|
|
||
|
|
#### 4. **hover.rs** - ❌ CRITICAL ISSUE
|
||
|
|
**Status:** NO lexer usage, extensive manual parsing
|
||
|
|
|
||
|
|
**Problems:**
|
||
|
|
|
||
|
|
**A. Manual Character-by-Character Parsing**
|
||
|
|
```rust
|
||
|
|
// Line 51-77: extract_word_at_position()
|
||
|
|
let chars: Vec<char> = line.chars().collect();
|
||
|
|
// Custom word boundary detection
|
||
|
|
while start > 0 && is_word_char(chars[start - 1]) {
|
||
|
|
start -= 1;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Risk:** Custom tokenization logic differs from compiler's lexer
|
||
|
|
|
||
|
|
**B. Hardcoded Keyword List** (Lines 20-39)
|
||
|
|
```rust
|
||
|
|
match word.as_str() {
|
||
|
|
"character" => "**character** - Defines a character entity...",
|
||
|
|
"template" => "**template** - Defines a reusable field template...",
|
||
|
|
"life_arc" => "**life_arc** - Defines a state machine...",
|
||
|
|
"schedule" => "**schedule** - Defines a daily schedule...",
|
||
|
|
"behavior" => "**behavior** - Defines a behavior tree...",
|
||
|
|
"institution" => "**institution** - Defines an organization...",
|
||
|
|
"relationship" => "**relationship** - Defines a multi-party...",
|
||
|
|
"location" => "**location** - Defines a place...",
|
||
|
|
"species" => "**species** - Defines a species...",
|
||
|
|
"enum" => "**enum** - Defines an enumeration...",
|
||
|
|
"use" => "**use** - Imports declarations...",
|
||
|
|
"from" => "**from** - Applies templates...",
|
||
|
|
"include" => "**include** - Includes another template...",
|
||
|
|
"state" => "**state** - Defines a state...",
|
||
|
|
"on" => "**on** - Defines a transition...",
|
||
|
|
"strict" => "**strict** - Enforces that a template...",
|
||
|
|
_ => return None,
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Risk:** HIGH
|
||
|
|
- 16 hardcoded keywords that must stay in sync with lexer
|
||
|
|
- If lexer adds/removes keywords, hover won't update automatically
|
||
|
|
- Recent behavior tree changes could have broken this
|
||
|
|
|
||
|
|
**Recommendation:**
|
||
|
|
```rust
|
||
|
|
// BEFORE (current):
|
||
|
|
let word = extract_word_at_position(line_text, character)?;
|
||
|
|
|
||
|
|
// AFTER (proposed):
|
||
|
|
let lexer = Lexer::new(line_text);
|
||
|
|
let token_at_pos = find_token_at_position(&lexer, character)?;
|
||
|
|
match token_at_pos {
|
||
|
|
Token::Character => "**character** - Defines...",
|
||
|
|
Token::Template => "**template** - Defines...",
|
||
|
|
// etc - using Token enum from lexer
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 5. **formatting.rs** - ❌ MANUAL PARSING
|
||
|
|
**Status:** NO lexer usage, line-by-line text processing
|
||
|
|
|
||
|
|
**Problems:**
|
||
|
|
```rust
|
||
|
|
// Line 53: Manual line processing
|
||
|
|
for line in text.lines() {
|
||
|
|
let trimmed = line.trim();
|
||
|
|
// Custom logic for braces, colons, etc.
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Risk:** MEDIUM
|
||
|
|
- Custom formatting logic may not respect language semantics
|
||
|
|
- Could format code incorrectly if it doesn't understand context
|
||
|
|
|
||
|
|
**Recommendation:** Use lexer to tokenize, then format based on tokens
|
||
|
|
- Preserves semantic understanding
|
||
|
|
- Respects string literals, comments, etc.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 6. **diagnostics.rs** - ❌ MANUAL PARSING
|
||
|
|
**Status:** NO lexer usage, manual brace counting
|
||
|
|
|
||
|
|
**Problems:**
|
||
|
|
```rust
|
||
|
|
// Lines 34-37: Manual brace counting
|
||
|
|
for (line_num, line) in text.lines().enumerate() {
|
||
|
|
let open_braces = line.chars().filter(|&c| c == '{').count();
|
||
|
|
let close_braces = line.chars().filter(|&c| c == '}').count();
|
||
|
|
}
|
||
|
|
|
||
|
|
// Line 79: Character-by-character processing
|
||
|
|
for ch in text.chars() {
|
||
|
|
// Custom logic
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Risk:** HIGH
|
||
|
|
- Brace counting doesn't account for braces in strings: `character Alice { name: "{" }`
|
||
|
|
- Doesn't respect comments: `// This { is a comment`
|
||
|
|
- Could generate false diagnostics
|
||
|
|
|
||
|
|
**Recommendation:** Use lexer tokens to track brace pairs accurately
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 7. **references.rs** - ❌ MANUAL STRING PARSING
|
||
|
|
**Status:** NO lexer usage
|
||
|
|
|
||
|
|
**Problems:**
|
||
|
|
```rust
|
||
|
|
// Manual string parsing for word boundaries
|
||
|
|
```
|
||
|
|
|
||
|
|
**Risk:** MEDIUM
|
||
|
|
- May not correctly identify symbol boundaries
|
||
|
|
- Could match partial words
|
||
|
|
|
||
|
|
**Recommendation:** Use lexer to identify identifiers
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 8. **rename.rs** - ❌ HARDCODED KEYWORDS
|
||
|
|
**Status:** NO lexer usage, contains hardcoded strings
|
||
|
|
|
||
|
|
**Problems:**
|
||
|
|
- 2 hardcoded keyword strings
|
||
|
|
- Manual symbol identification
|
||
|
|
|
||
|
|
**Risk:** MEDIUM
|
||
|
|
- May incorrectly rename keywords or miss valid renames
|
||
|
|
|
||
|
|
**Recommendation:** Use lexer to distinguish keywords from identifiers
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 9. **definition.rs** - ⚠️ UNCLEAR
|
||
|
|
**Status:** No obvious lexer usage, no obvious manual parsing
|
||
|
|
|
||
|
|
**Assessment:** Likely uses AST-based approach (acceptable)
|
||
|
|
- May rely on symbols extracted from AST
|
||
|
|
- **Acceptable** if using `Document.ast` for symbol lookup
|
||
|
|
|
||
|
|
**Recommendation:** Verify it's using AST, not string parsing
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 10. **inlay_hints.rs** - ⚠️ UNCLEAR
|
||
|
|
**Status:** No lexer usage detected, no obvious parsing
|
||
|
|
|
||
|
|
**Assessment:** Minimal code inspection needed
|
||
|
|
- May be stub or use AST-based approach
|
||
|
|
|
||
|
|
**Recommendation:** Full inspection needed
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### 11. **symbols.rs** - ✅ LIKELY COMPLIANT
|
||
|
|
**Status:** No manual parsing detected
|
||
|
|
|
||
|
|
**Assessment:** Appears to extract symbols from AST
|
||
|
|
- **Acceptable approach** - AST is the canonical representation
|
||
|
|
- No need for lexer if working from AST
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary Table
|
||
|
|
|
||
|
|
| Module | Lexer Usage | Manual Parsing | Hardcoded Keywords | Risk Level |
|
||
|
|
|--------|-------------|----------------|-------------------|------------|
|
||
|
|
| **completion.rs** | ✅ Yes | ❌ No | ⚠️ 8 keywords | MEDIUM |
|
||
|
|
| **semantic_tokens.rs** | ✅ Yes | ❌ No | ✅ None | LOW |
|
||
|
|
| **code_actions.rs** | ✅ Yes | ⚠️ 15 instances | ⚠️ 4 keywords | MEDIUM |
|
||
|
|
| **hover.rs** | ❌ No | ⚠️ Extensive | ⚠️ 16 keywords | **HIGH** |
|
||
|
|
| **formatting.rs** | ❌ No | ⚠️ Line-by-line | ✅ None | MEDIUM |
|
||
|
|
| **diagnostics.rs** | ❌ No | ⚠️ Char counting | ✅ None | **HIGH** |
|
||
|
|
| **references.rs** | ❌ No | ⚠️ Yes | ✅ None | MEDIUM |
|
||
|
|
| **rename.rs** | ❌ No | ⚠️ Yes | ⚠️ 2 keywords | MEDIUM |
|
||
|
|
| **definition.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
|
||
|
|
| **inlay_hints.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
|
||
|
|
| **symbols.rs** | N/A (AST) | ❌ No | ✅ None | LOW |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Critical Findings
|
||
|
|
|
||
|
|
### 1. **hover.rs** - Highest Risk
|
||
|
|
- 16 hardcoded keywords
|
||
|
|
- Custom tokenization logic
|
||
|
|
- **Impact:** Recent behavior tree keyword changes may have broken hover
|
||
|
|
- **Fix Priority:** HIGH
|
||
|
|
|
||
|
|
### 2. **diagnostics.rs** - High Risk
|
||
|
|
- Manual brace counting fails in strings/comments
|
||
|
|
- Could generate false errors
|
||
|
|
- **Fix Priority:** HIGH
|
||
|
|
|
||
|
|
### 3. **Hardcoded Keywords** - Maintenance Burden
|
||
|
|
- **Total:** 30 hardcoded keyword strings across 4 files
|
||
|
|
- **Risk:** Keywords get out of sync with lexer
|
||
|
|
- **Recent Example:** Behavior tree syntax changes could have broken these
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Fixes
|
||
|
|
|
||
|
|
### Priority 1: hover.rs Refactoring
|
||
|
|
**Estimated Time:** 2-3 hours
|
||
|
|
|
||
|
|
**Changes:**
|
||
|
|
1. Add lexer import: `use crate::syntax::lexer::{Lexer, Token};`
|
||
|
|
2. Replace `extract_word_at_position()` with lexer-based token finding
|
||
|
|
3. Replace keyword match with Token enum match
|
||
|
|
4. Remove hardcoded keyword strings
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Automatic sync with lexer changes
|
||
|
|
- Consistent tokenization with compiler
|
||
|
|
- Easier maintenance
|
||
|
|
|
||
|
|
**Example Code:**
|
||
|
|
```rust
|
||
|
|
pub fn get_hover_info(text: &str, line: usize, character: usize) -> Option<Hover> {
|
||
|
|
let line_text = text.lines().nth(line)?;
|
||
|
|
let lexer = Lexer::new(line_text);
|
||
|
|
|
||
|
|
// Find token at character position
|
||
|
|
let token_at_pos = find_token_at_position(&lexer, character)?;
|
||
|
|
|
||
|
|
let content = match token_at_pos {
|
||
|
|
Token::Character => "**character** - Defines a character entity...",
|
||
|
|
Token::Template => "**template** - Defines a reusable field template...",
|
||
|
|
// Use Token enum - stays in sync automatically
|
||
|
|
};
|
||
|
|
|
||
|
|
Some(Hover { /* ... */ })
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Priority 2: diagnostics.rs Refactoring
|
||
|
|
**Estimated Time:** 1-2 hours
|
||
|
|
|
||
|
|
**Changes:**
|
||
|
|
1. Use lexer to tokenize text
|
||
|
|
2. Count LBrace/RBrace tokens (not characters)
|
||
|
|
3. Ignore braces in strings and comments automatically
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Correct handling of braces in all contexts
|
||
|
|
- No false positives
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Priority 3: Extract Keyword Definitions
|
||
|
|
**Estimated Time:** 1 hour
|
||
|
|
|
||
|
|
**Changes:**
|
||
|
|
1. Create `src/syntax/keywords.rs` module
|
||
|
|
2. Define keyword list from lexer Token enum
|
||
|
|
3. Use in completion, code_actions, rename
|
||
|
|
|
||
|
|
**Benefits:**
|
||
|
|
- Single source of truth for keywords
|
||
|
|
- Easy to keep in sync
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing Recommendations
|
||
|
|
|
||
|
|
After fixes, add tests for:
|
||
|
|
|
||
|
|
1. **Hover on keywords in strings** - Should NOT show hover
|
||
|
|
```rust
|
||
|
|
character Alice { name: "character" } // hovering "character" in string
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Diagnostics with braces in strings** - Should NOT count
|
||
|
|
```rust
|
||
|
|
character Alice { name: "{}" } // should not report brace mismatch
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Lexer consistency tests** - Verify LSP uses same tokens as compiler
|
||
|
|
```rust
|
||
|
|
#[test]
|
||
|
|
fn test_hover_uses_lexer_keywords() {
|
||
|
|
// Ensure hover keywords match lexer Token enum
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Long-Term Recommendations
|
||
|
|
|
||
|
|
### 1. Establish Coding Standards
|
||
|
|
Document that LSP modules should:
|
||
|
|
- ✅ Use lexer for all tokenization
|
||
|
|
- ✅ Use AST for semantic analysis
|
||
|
|
- ❌ Never do manual string parsing
|
||
|
|
- ❌ Never hardcode keywords
|
||
|
|
|
||
|
|
### 2. Code Review Checklist
|
||
|
|
Add to PR reviews:
|
||
|
|
- [ ] Does LSP code use lexer for tokenization?
|
||
|
|
- [ ] Are there any hardcoded keyword strings?
|
||
|
|
- [ ] Is there any manual `.split()`, `.chars()`, `.find()` parsing?
|
||
|
|
|
||
|
|
### 3. Shared Utilities
|
||
|
|
Create `src/lsp/lexer_utils.rs` with:
|
||
|
|
- `find_token_at_position()` - Common utility
|
||
|
|
- `get_keyword_docs()` - Centralized keyword documentation
|
||
|
|
- `is_keyword()` - Token-based keyword checking
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Impact Assessment
|
||
|
|
|
||
|
|
### Current State Risks
|
||
|
|
|
||
|
|
**Behavior Tree Changes (Recent):**
|
||
|
|
- Lexer/parser updated with new keywords
|
||
|
|
- ❌ `hover.rs` still has old hardcoded list
|
||
|
|
- ⚠️ Users may not see hover for new keywords
|
||
|
|
|
||
|
|
**Future Keyword Changes:**
|
||
|
|
- Any new keywords require updates in 4 separate files
|
||
|
|
- Easy to miss one, causing inconsistent behavior
|
||
|
|
|
||
|
|
**Bug Examples:**
|
||
|
|
```rust
|
||
|
|
// diagnostics.rs bug:
|
||
|
|
character Alice { description: "Contains } brace" }
|
||
|
|
// ^ Reports false "unmatched brace" error
|
||
|
|
|
||
|
|
// hover.rs bug:
|
||
|
|
character Alice { /* hover on "character" keyword here */ }
|
||
|
|
// ^ Works
|
||
|
|
|
||
|
|
"In a character block..." // hover on "character" in string
|
||
|
|
// ^ Also shows hover (WRONG - it's just a word in a string)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
**Overall Assessment:** NEEDS IMPROVEMENT
|
||
|
|
|
||
|
|
- **Compliance Rate:** 27% (3/11 modules)
|
||
|
|
- **Risk Level:** MEDIUM-HIGH
|
||
|
|
- **Recommended Action:** Refactor hover.rs and diagnostics.rs as priority
|
||
|
|
|
||
|
|
**Benefits of Fixing:**
|
||
|
|
1. Consistency with compiler behavior
|
||
|
|
2. Automatic sync with language changes
|
||
|
|
3. Reduced maintenance burden
|
||
|
|
4. Fewer bugs from edge cases
|
||
|
|
|
||
|
|
**Next Steps:**
|
||
|
|
1. Create GitHub issues for each Priority 1-2 fix
|
||
|
|
2. Refactor hover.rs (highest risk)
|
||
|
|
3. Refactor diagnostics.rs (high risk)
|
||
|
|
4. Extract keyword definitions (prevents future issues)
|
||
|
|
5. Add lexer usage to code review checklist
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Files Analyzed
|
||
|
|
|
||
|
|
```
|
||
|
|
Total LSP modules audited: 11
|
||
|
|
✅ Compliant: 3 (27%)
|
||
|
|
⚠️ Partially compliant: 1 (9%)
|
||
|
|
❌ Non-compliant: 7 (64%)
|
||
|
|
|
||
|
|
Files examined:
|
||
|
|
- src/lsp/completion.rs
|
||
|
|
- src/lsp/hover.rs
|
||
|
|
- src/lsp/semantic_tokens.rs
|
||
|
|
- src/lsp/inlay_hints.rs
|
||
|
|
- src/lsp/definition.rs
|
||
|
|
- src/lsp/formatting.rs
|
||
|
|
- src/lsp/rename.rs
|
||
|
|
- src/lsp/references.rs
|
||
|
|
- src/lsp/code_actions.rs
|
||
|
|
- src/lsp/symbols.rs
|
||
|
|
- src/lsp/diagnostics.rs
|
||
|
|
```
|