Audit LLM function calling reliability across multiple models.
Models run via Together AI
Fixed at 120s for demo
Fixed at 256 for demo
Applies to all tests (overrides per-test prompts)
Cosine distance to expected_response (0 = disabled). Lower = stricter.
Run each test N times with different seeds to detect inconsistent models
Toggle between pass/fail symbols and response times