Toward a shared eval infrastructure for Drupal AI: A proof of concept

Drupal's AI initiative has made enormous strides in bringing AI-powered features to site builders and content editors. But behind the scenes, there's a question nobody's really answered yet: how do we know if the AI is actually giving good advice, or completely making *&@# up? 👀

When an AI assistant helps a site builder configure a content type, answers a developer's question about the right way to structure their code, or explains how a Drupal feature works — is it right? Wrong? Confidently wrong in a way that causes subtle problems nobody notices until it bites them directly in the ass much later?

The discipline that answers this question is called AI evaluation (evals, for short). Think of it as the test suite for AI behaviour — not "does the code run," but "does the code AI produces conform to modern best practices?" Several Drupal AI projects are starting to build this kind of evaluation independently, which is great! But without a shared format for recording and comparing results, we risk duplicating effort and losing the ability to learn from each other's findings.

George Kastanis (zorz) of Point Blank recently proposed a solution: a five-layer Drupal Eval Commons umbrella framework that gives these projects a shared contract without forcing anyone to rewrite their existing tooling.

I worked with Claude to build a proof of concept to find out what that actually means.

What does an "eval" actually look like?

If you've written PHPUnit tests, you already understand the basic idea. A unit test says: given this input, assert this output. An eval says the same thing, but for a language model: given this question, the answer should contain certain concepts and not contain certain mistakes.

Here's a real example from the AI Best Practices for Drupal project in evals/drupal-automated-testing/evals.json:

{
  "skill_name": "drupal-automated-testing",
  "skill_file": "skills/drupal-automated-testing/SKILL.md",
  "evals": [
    {
      "id": 1,
      "prompt": "Generate a Kernel test class for a Drupal module called my_module that tests a service named my_module.calculator. The service has a method add(int \(a, int \)b): int. Write the complete test file.",
      "expected_output": "A complete PHP Kernel test file using KernelTestBase with correct namespace, RunTestsInSeparateProcesses attribute, and explicit setUp with installConfig/installSchema.",
      "expectations": [
        "Uses KernelTestBase as the base class",
        "Does NOT extend UnitTestCase or BrowserTestBase",
        "Includes the RunTestsInSeparateProcesses attribute",
        "Generated PHP passes php -l syntax check"
      ],
      "must_contain_any": ["KernelTestBase", "RunTestsInSeparateProcesses"],
      "must_not_contain": ["extends UnitTestCase", "extends BrowserTestBase"],
      "check_php_lint": true
    },

We first feed in our Agent Skill (domain-specific knowledge) for writing automated tests in Drupal: skills/drupal-automated-testing/SKILL.md

We then give the model this prompt:

Generate a Kernel test class for a Drupal module called my_module that tests a service named my_module.calculator. The service has a method add(int \(a, int \)b): int. Write the complete test file.

Finally, we evaluate the result. A correct answer must include certain Drupal-specific patterns that AI models frequently get wrong (e.g. it uses KernelTestBase, not UnitTestCase). It must also produce syntactically valid PHP. Those are the assertions. We run the same eval case against multiple AI models and record which ones pass, how long each evaluation took, how much the run cost (estimated), etc.

That's it. No magic — just structured, repeatable testing.

The five-layer stack

The proposal defines five layers for evals, each independently useful:

Layer 1 — Cases, rubrics, judges. The eval cases themselves: the question, what a correct answer looks like, and reusable grading logic. Like a PHPUnit test file, but for AI outputs.

Layer 2 — Result envelope. A standard record for what happened when a model ran a case: which model, which version, what the score was, how long it took, how much it cost. Critically, this format is harness-agnostic — it doesn't matter whether you ran the eval with Inspect AI, promptfoo, or a shell script.

Layer 3 — Registry/storage/distribution. Where eval cases and results live, and how they reach projects. The proposal deliberately defers the design of this layer until real usage patterns emerge.

Layer 4 — Browser/community submission. Discovery and sharing — a way for the community to browse results, compare models, and submit new eval cases.

Layer 5 — Domain-specific bundles. Extension points that let modules add context specific to their use case (agent behaviour, RAG pipelines, content editing) without breaking the shared base format.

The key insight is that layers 1 and 2 can be built right now, independently of the storage debates that tend to stall these conversations.

What the proof of concept covers

Working in MR #37 on ai_best_practices, Claude and I built a proof of concept implementation that touches every layer:

Layer 1: Six skills — automated testing, configuration management, the render pipeline, and more — each have a set of eval cases with explicit pass/fail criteria. Run them against any model and you get a consistent score.

Layer 2: A converter script wraps the results of an eval run into the standard Every Eval Ever envelope format — one record per model per case, carrying latency, token usage, cost, and a detailed pass/fail breakdown. The same output format would work regardless of which eval tool produced the original results.

Layer 3 (provisional): Envelopes are stored as files on HuggingFace Datasets — a free hosting platform widely used in the AI/ML community, similar in spirit to what Packagist is for PHP packages. HuggingFace automatically converts them to a format that makes the raw data browsable and downloadable by anyone, no special tooling required.

Layer 4: A live dashboard reads the dataset and renders a filterable comparison table — pass/fail results across models and skills, with latency and cost highlighted in green-to-red heat maps, and a bar chart of pass rates by model. It's a proof of concept, but it's built on real data.

Connecting evals to production observability: One of the more interesting ideas in the proposal is that eval data and production runtime data are fundamentally the same shape. Both record: which model was called, with what input, producing what output, in how long, at what cost. We wired up the proof of concept to emit this data in OpenTelemetry (OTel) format — the same industry-standard tracing infrastructure many organizations already use to monitor their web applications — so that eval results and live production traffic appear in the same dashboard.

Try it yourself

The best way to understand what's been built is to follow the same progression that made it click for me: read an eval case, run it, see the result envelope, then see it visualized. There's a step-by-step walkthrough in the repository that takes you through exactly that journey.

Or if you just want to browse:

Live dashboard: huggingface.co/spaces/webchick/eval-dashboard-poc — real results from Claude Haiku and Mistral Large across six Drupal evals
Raw data: huggingface.co/datasets/webchick/eval-results-poc — the result envelopes, browsable directly
Code: MR #37 on ai_best_practices — eval cases, result converter, dashboard, and OTel demo

Help shape the proposal

George Kastanis (zorz) is looking for community input on three specific questions:

Does the five-layer structure match the problem well enough to use as the umbrella?
Should Layer 1 proceed independently of the Layer 3 storage/entity debate?
Should the storage/entity-location question remain a Layer 3 decision rather than a blocker on Layer 1?

Head over to the Eval Commons proposal to read the full context and add your voice. The proof of concept above is meant to make the proposal more concrete, not to pre-answer those questions — let us know what you think!

Toward a shared eval infrastructure for Drupal AI: A proof of concept

What does an "eval" actually look like?

The five-layer stack

What the proof of concept covers

Try it yourself

Help shape the proposal

Comments

More from this blog

Building a Developer Advocacy Team from Scratch #4: 2025 Year-End Reflections

Building a Developer Advocacy Team from Scratch #3: #DevRelTeamOps

Building a Developer Advocacy Team from Scratch #2: "Go-To" DevRellers

Tips and tricks for getting hired for "niche" tech work

Command Palette

What does an "eval" actually look like?

The five-layer stack

What the proof of concept covers

Try it yourself

Help shape the proposal

Comments

More from this blog