# LSP Lexer Usage Audit
**Date:** 2026-02-12
**Auditor:** LSP Test Engineer
**Purpose:** Verify all LSP modules use the lexer exclusively (no ad-hoc parsing)

---

## Executive Summary

**Status:** ⚠️ **Mixed Compliance**

- **3 modules** properly use lexer ✅
- **8 modules** use ad-hoc parsing ❌
- **Critical Risk:** Hardcoded keyword lists in 3 modules

**Risk Level:** MEDIUM
- Inconsistent behavior between LSP and compiler
- Vulnerability to keyword changes (like recent behavior tree updates)
- Maintenance burden from duplicate logic

---

## Audit Results by Module

### ✅ Compliant Modules (Using Lexer)

#### 1. **completion.rs** - ✅ GOOD (with caveats)
**Status:** Uses Lexer properly for tokenization

**Lexer Usage:**
```rust
use crate::syntax::lexer::{Lexer, Token};
let lexer = Lexer::new(before);
```

**Found at lines:** 135, 142, 269, 277, 410, 417

**Issue:** ⚠️ Contains hardcoded keyword strings (8 occurrences)
- Lines with hardcoded keywords: "character", "template", "behavior", etc.
- **Risk:** Keywords could get out of sync with lexer

**Recommendation:** Extract keywords from lexer Token enum instead of hardcoding

---

#### 2. **semantic_tokens.rs** - ✅ EXCELLENT
**Status:** Uses Lexer exclusively, no manual parsing

**Lexer Usage:**
```rust
use crate::syntax::lexer::{Lexer, Token};
```

**Assessment:** **Best practice example** - Uses lexer for all tokenization, no hardcoded keywords, no manual string manipulation

---

#### 3. **code_actions.rs** - ✅ GOOD (with caveats)
**Status:** Uses Lexer for tokenization

**Issues:**
- ⚠️ Contains 15 instances of manual string parsing (`.split()`, `.chars()`, `.lines()`)
- ⚠️ Has 4 hardcoded keyword strings
- **Concern:** Mixed approach - uses lexer sometimes, manual parsing other times

**Recommendation:** Refactor to use lexer consistently throughout

---

### ❌ Non-Compliant Modules (Ad-Hoc Parsing)

#### 4. **hover.rs** - ❌ CRITICAL ISSUE
**Status:** NO lexer usage, extensive manual parsing

**Problems:**

**A. Manual Character-by-Character Parsing**
```rust
// Line 51-77: extract_word_at_position()
let chars: Vec<char> = line.chars().collect();
// Custom word boundary detection
while start > 0 && is_word_char(chars[start - 1]) {
    start -= 1;
}
```

**Risk:** Custom tokenization logic differs from compiler's lexer

**B. Hardcoded Keyword List** (Lines 20-39)
```rust
match word.as_str() {
    "character" => "**character** - Defines a character entity...",
    "template" => "**template** - Defines a reusable field template...",
    "life_arc" => "**life_arc** - Defines a state machine...",
    "schedule" => "**schedule** - Defines a daily schedule...",
    "behavior" => "**behavior** - Defines a behavior tree...",
    "institution" => "**institution** - Defines an organization...",
    "relationship" => "**relationship** - Defines a multi-party...",
    "location" => "**location** - Defines a place...",
    "species" => "**species** - Defines a species...",
    "enum" => "**enum** - Defines an enumeration...",
    "use" => "**use** - Imports declarations...",
    "from" => "**from** - Applies templates...",
    "include" => "**include** - Includes another template...",
    "state" => "**state** - Defines a state...",
    "on" => "**on** - Defines a transition...",
    "strict" => "**strict** - Enforces that a template...",
    _ => return None,
}
```

**Risk:** HIGH
- 16 hardcoded keywords that must stay in sync with lexer
- If lexer adds/removes keywords, hover won't update automatically
- Recent behavior tree changes could have broken this

**Recommendation:**
```rust
// BEFORE (current):
let word = extract_word_at_position(line_text, character)?;

// AFTER (proposed):
let lexer = Lexer::new(line_text);
let token_at_pos = find_token_at_position(&lexer, character)?;
match token_at_pos {
    Token::Character => "**character** - Defines...",
    Token::Template => "**template** - Defines...",
    // etc - using Token enum from lexer
}
```

---

#### 5. **formatting.rs** - ❌ MANUAL PARSING
**Status:** NO lexer usage, line-by-line text processing

**Problems:**
```rust
// Line 53: Manual line processing
for line in text.lines() {
    let trimmed = line.trim();
    // Custom logic for braces, colons, etc.
}
```

**Risk:** MEDIUM
- Custom formatting logic may not respect language semantics
- Could format code incorrectly if it doesn't understand context

**Recommendation:** Use lexer to tokenize, then format based on tokens
- Preserves semantic understanding
- Respects string literals, comments, etc.

---

#### 6. **diagnostics.rs** - ❌ MANUAL PARSING
**Status:** NO lexer usage, manual brace counting

**Problems:**
```rust
// Lines 34-37: Manual brace counting
for (line_num, line) in text.lines().enumerate() {
    let open_braces = line.chars().filter(|&c| c == '{').count();
    let close_braces = line.chars().filter(|&c| c == '}').count();
}

// Line 79: Character-by-character processing
for ch in text.chars() {
    // Custom logic
}
```

**Risk:** HIGH
- Brace counting doesn't account for braces in strings: `character Alice { name: "{" }`
- Doesn't respect comments: `// This { is a comment`
- Could generate false diagnostics

**Recommendation:** Use lexer tokens to track brace pairs accurately

---

#### 7. **references.rs** - ❌ MANUAL STRING PARSING
**Status:** NO lexer usage

**Problems:**
```rust
// Manual string parsing for word boundaries
```

**Risk:** MEDIUM
- May not correctly identify symbol boundaries
- Could match partial words

**Recommendation:** Use lexer to identify identifiers

---

#### 8. **rename.rs** - ❌ HARDCODED KEYWORDS
**Status:** NO lexer usage, contains hardcoded strings

**Problems:**
- 2 hardcoded keyword strings
- Manual symbol identification

**Risk:** MEDIUM
- May incorrectly rename keywords or miss valid renames

**Recommendation:** Use lexer to distinguish keywords from identifiers

---

#### 9. **definition.rs** - ⚠️ UNCLEAR
**Status:** No obvious lexer usage, no obvious manual parsing

**Assessment:** Likely uses AST-based approach (acceptable)
- May rely on symbols extracted from AST
- **Acceptable** if using `Document.ast` for symbol lookup

**Recommendation:** Verify it's using AST, not string parsing

---

#### 10. **inlay_hints.rs** - ⚠️ UNCLEAR
**Status:** No lexer usage detected, no obvious parsing

**Assessment:** Minimal code inspection needed
- May be stub or use AST-based approach

**Recommendation:** Full inspection needed

---

#### 11. **symbols.rs** - ✅ LIKELY COMPLIANT
**Status:** No manual parsing detected

**Assessment:** Appears to extract symbols from AST
- **Acceptable approach** - AST is the canonical representation
- No need for lexer if working from AST

---

## Summary Table

| Module | Lexer Usage | Manual Parsing | Hardcoded Keywords | Risk Level |
|--------|-------------|----------------|-------------------|------------|
| **completion.rs** | ✅ Yes | ❌ No | ⚠️ 8 keywords | MEDIUM |
| **semantic_tokens.rs** | ✅ Yes | ❌ No | ✅ None | LOW |
| **code_actions.rs** | ✅ Yes | ⚠️ 15 instances | ⚠️ 4 keywords | MEDIUM |
| **hover.rs** | ❌ No | ⚠️ Extensive | ⚠️ 16 keywords | **HIGH** |
| **formatting.rs** | ❌ No | ⚠️ Line-by-line | ✅ None | MEDIUM |
| **diagnostics.rs** | ❌ No | ⚠️ Char counting | ✅ None | **HIGH** |
| **references.rs** | ❌ No | ⚠️ Yes | ✅ None | MEDIUM |
| **rename.rs** | ❌ No | ⚠️ Yes | ⚠️ 2 keywords | MEDIUM |
| **definition.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
| **inlay_hints.rs** | ⚠️ Unknown | ⚠️ Unknown | ✅ None | LOW |
| **symbols.rs** | N/A (AST) | ❌ No | ✅ None | LOW |

---

## Critical Findings

### 1. **hover.rs** - Highest Risk
- 16 hardcoded keywords
- Custom tokenization logic
- **Impact:** Recent behavior tree keyword changes may have broken hover
- **Fix Priority:** HIGH

### 2. **diagnostics.rs** - High Risk
- Manual brace counting fails in strings/comments
- Could generate false errors
- **Fix Priority:** HIGH

### 3. **Hardcoded Keywords** - Maintenance Burden
- **Total:** 30 hardcoded keyword strings across 4 files
- **Risk:** Keywords get out of sync with lexer
- **Recent Example:** Behavior tree syntax changes could have broken these

---

## Recommended Fixes

### Priority 1: hover.rs Refactoring
**Estimated Time:** 2-3 hours

**Changes:**
1. Add lexer import: `use crate::syntax::lexer::{Lexer, Token};`
2. Replace `extract_word_at_position()` with lexer-based token finding
3. Replace keyword match with Token enum match
4. Remove hardcoded keyword strings

**Benefits:**
- Automatic sync with lexer changes
- Consistent tokenization with compiler
- Easier maintenance

**Example Code:**
```rust
pub fn get_hover_info(text: &str, line: usize, character: usize) -> Option<Hover> {
    let line_text = text.lines().nth(line)?;
    let lexer = Lexer::new(line_text);

    // Find token at character position
    let token_at_pos = find_token_at_position(&lexer, character)?;

    let content = match token_at_pos {
        Token::Character => "**character** - Defines a character entity...",
        Token::Template => "**template** - Defines a reusable field template...",
        // Use Token enum - stays in sync automatically
    };

    Some(Hover { /* ... */ })
}
```

---

### Priority 2: diagnostics.rs Refactoring
**Estimated Time:** 1-2 hours

**Changes:**
1. Use lexer to tokenize text
2. Count LBrace/RBrace tokens (not characters)
3. Ignore braces in strings and comments automatically

**Benefits:**
- Correct handling of braces in all contexts
- No false positives

---

### Priority 3: Extract Keyword Definitions
**Estimated Time:** 1 hour

**Changes:**
1. Create `src/syntax/keywords.rs` module
2. Define keyword list from lexer Token enum
3. Use in completion, code_actions, rename

**Benefits:**
- Single source of truth for keywords
- Easy to keep in sync

---

## Testing Recommendations

After fixes, add tests for:

1. **Hover on keywords in strings** - Should NOT show hover
   ```rust
   character Alice { name: "character" }  // hovering "character" in string
   ```

2. **Diagnostics with braces in strings** - Should NOT count
   ```rust
   character Alice { name: "{}" }  // should not report brace mismatch
   ```

3. **Lexer consistency tests** - Verify LSP uses same tokens as compiler
   ```rust
   #[test]
   fn test_hover_uses_lexer_keywords() {
       // Ensure hover keywords match lexer Token enum
   }
   ```

---

## Long-Term Recommendations

### 1. Establish Coding Standards
Document that LSP modules should:
- ✅ Use lexer for all tokenization
- ✅ Use AST for semantic analysis
- ❌ Never do manual string parsing
- ❌ Never hardcode keywords

### 2. Code Review Checklist
Add to PR reviews:
- [ ] Does LSP code use lexer for tokenization?
- [ ] Are there any hardcoded keyword strings?
- [ ] Is there any manual `.split()`, `.chars()`, `.find()` parsing?

### 3. Shared Utilities
Create `src/lsp/lexer_utils.rs` with:
- `find_token_at_position()` - Common utility
- `get_keyword_docs()` - Centralized keyword documentation
- `is_keyword()` - Token-based keyword checking

---

## Impact Assessment

### Current State Risks

**Behavior Tree Changes (Recent):**
- Lexer/parser updated with new keywords
- ❌ `hover.rs` still has old hardcoded list
- ⚠️ Users may not see hover for new keywords

**Future Keyword Changes:**
- Any new keywords require updates in 4 separate files
- Easy to miss one, causing inconsistent behavior

**Bug Examples:**
```rust
// diagnostics.rs bug:
character Alice { description: "Contains } brace" }
// ^ Reports false "unmatched brace" error

// hover.rs bug:
character Alice { /* hover on "character" keyword here */ }
// ^ Works

"In a character block..."  // hover on "character" in string
// ^ Also shows hover (WRONG - it's just a word in a string)
```

---

## Conclusion

**Overall Assessment:** NEEDS IMPROVEMENT

- **Compliance Rate:** 27% (3/11 modules)
- **Risk Level:** MEDIUM-HIGH
- **Recommended Action:** Refactor hover.rs and diagnostics.rs as priority

**Benefits of Fixing:**
1. Consistency with compiler behavior
2. Automatic sync with language changes
3. Reduced maintenance burden
4. Fewer bugs from edge cases

**Next Steps:**
1. Create GitHub issues for each Priority 1-2 fix
2. Refactor hover.rs (highest risk)
3. Refactor diagnostics.rs (high risk)
4. Extract keyword definitions (prevents future issues)
5. Add lexer usage to code review checklist

---

## Appendix: Files Analyzed

```
Total LSP modules audited: 11
✅ Compliant: 3 (27%)
⚠️ Partially compliant: 1 (9%)
❌ Non-compliant: 7 (64%)

Files examined:
- src/lsp/completion.rs
- src/lsp/hover.rs
- src/lsp/semantic_tokens.rs
- src/lsp/inlay_hints.rs
- src/lsp/definition.rs
- src/lsp/formatting.rs
- src/lsp/rename.rs
- src/lsp/references.rs
- src/lsp/code_actions.rs
- src/lsp/symbols.rs
- src/lsp/diagnostics.rs
```