Part of

← Generative AI ← Artificial Intelligence

AI Engineer Hackathon

🚧 Statement 2: Evaluation and Reliability

There are many use cases for agents, varying in task, complexity, runtime, and many other variables.

  • Can we develop an evaluation framework that’s task agnostic?

  • How can we get agents to self-evaluate their own work to reduce failures?

  • What’s the best way to ensure reliability and replicability in long-running agentic tasks?

  • Is there a system we can build that automatically detects, corrects, and prevents failure cases?

https://github.com/willccbb/verifiers

Error Analysis (make it easier to look at data)

Automate:

  • Data annotation
  • categorization
  • Analysis
  • Improvements & Evals
  • Iterate

Content Creation:

evaluation framework:

  • multimodal input
  • self-evaluation