Part of
β Generative AI β Artificial Intelligence
AI Engineer Hackathon
π§ Statement 2: Evaluation and Reliability
There are many use cases for agents, varying in task, complexity, runtime, and many other variables.
-
Can we develop an evaluation framework thatβs task agnostic?
-
How can we get agents to self-evaluate their own work to reduce failures?
-
Whatβs the best way to ensure reliability and replicability in long-running agentic tasks?
-
Is there a system we can build that automatically detects, corrects, and prevents failure cases?
https://github.com/willccbb/verifiers
Error Analysis (make it easier to look at data)
Automate:
- Data annotation
- categorization
- Analysis
- Improvements & Evals
- Iterate
Content Creation:
evaluation framework:
- multimodal input
- self-evaluation