Eval-Driven Development

Eval-Driven Development (EDD) is a critical development strategy for applications powered by Large Language Models (LLMs). This practice places continuous and rigorous evaluation at the heart of the development lifecycle.

Junjo accelerates EDD and complex workflow development by allowing one to iterate on their LLM prompts with many test inputs, and immediately see how the prompt changes impact the evaluation results.

Animated demo of eval-driven pytest execution and results

The above example demonstrates a simple pytest execution that gives pass / fail rates for a set of test inputs evaluating against a Junjo node.

Powered by pytest

  • Evaluate / Judge the output of your Junjo workflows and nodes with LLMs

  • Test individual nodes

  • Test entire workflows

  • Automate testing with CI / CD pipelines

  • Run on-demand as you iterate on your workflows

  • It just uses pytest!

  • Use tools like pytest-harvest to gather and track test results

  • No proprietary tools or testing platforms are required - everything happens directly in your codebase

Pytest executions can initialize an input state for the node, and analyze the results after the node executes its set_state updates.

Library Example

Check out src/base/sample_workflow/sample_subflow/nodes/create_joke_node/test to see an example eval system, setup to evaluate the joke created.

  • Github link to test example

  • It uses a combination of asserts and live LLM evaluations

  • This example uses Gemini to evaluate the results of the create_joke_node against several test inputs inside test_cases.py

  • The eval has a prompt inside test_prompt.py

  • test_node.py executes the pytest test

  • The live node.py LLM call is executed to generate the result and state update for evaluation

  • Test failures include reasons why the prompt failed to generate output that passed the evaluation. See the test_schema.py.

On mission critical workflows, this setup can be used to orchestrate hundreds or thousands of test inputs against a prompt to ensure it covers all use cases well.

Testing Model Changes

This is also a great way to evaluate whether changing LLM models increases or decreases eval pass / fail rates, or changes the speed at which evals are completed.