Decoding The AI Black Box: An overwhelmed engineer's guide to LLM Evals

Approved

Session Description

Your code takes particular input. Returns a particular output. Deterministic, traceable.
Feeding in known inputs and checking if the outputs is how we test you code. If the output is wrong, we know exactly where to look and how to fix it.

LLMs on the other takes prompts as inputs. Returns a probablistic, context-sensitive, and often unpredictable output. This key difference is how LLMs break the mold of traditional QA systems failing to evaluate AI models.

That's where you need AI Evaluations. Not just to test your systems, but to iterate, develop and build new features on it. Evaluations is so much more than testing.

This practical session will be your first step into AI evals. Figuring out the basics, Learn systematic methods for testing, benchmarking, and validating AI models to build reliable, production ready LLM applications and move beyond “vibe checks”.

For the talk, we will be using Evidently (https://github.com/evidentlyai/evidently) for our LLM framework, along with mentioning other LLM eval frameworks like Phenoix, DeepEval and more. All open-source. The LLM models used are also open-source, mainly Gemma 3 or llama using open-source tools like OLlama or LM Studio https://github.com/lmstudio-ai

Key Takeaways

You’ll leave with a mental model for AI evals, hands-on examples, and the confidence to start evaluating your own AI projects—no matter your background or project you are undertaking.

Reviews

0 %

Approvability

0

Approvals

0

Rejections

1

Not Sure

This isn't highlighting a FOSS project,just seems to be AI tips and tricks

Reviewer #1

Not Sure

Decoding The AI Black Box: An overwhelmed engineer's guide to LLM Evals

Vipul Gupta