feat: matrix, script, evaluator, and devtools integration tests
- matrix_utils: construct ruma events in tests, verify extract_body
(text/notice/emote/unsupported), extract_reply_to, extract_thread_id,
extract_edit, extract_image, make_reply_content, make_thread_reply
- script tool: full run_script against live Tuwunel + OpenSearch —
basic math, TypeScript transpilation, filesystem sandbox read/write,
error capture, output truncation, invalid args
- evaluator: DM/mention/silence short-circuits, LLM evaluation path
with Mistral API, reply-to-human suppression
- agent registry: list/get_id, prompt reuse, prompt-change recreation
- devtools: tool dispatch for list_repos, get_repo, list_issues,
get_file, list_branches, list_comments, list_orgs
- conversations: token tracking, multi-turn context recall, room
isolation