evals
6 posts tagged evals.
- Multimodal models still change answers when you shuffle the evidence
- Self-distillation can make models better on the first try and worse on the fifth
- Agent Success Rate is the only number that matters when a new model drops
- Marketers are still vibe-checking prompts. Frontier devs run evals before lunch.
- Stop Vibe-Checking New Models. Build a 50-Prompt Eval Set Instead.
- The Frustration Index: A Cheap Eval Most Teams Skip