Batch evaluation runs the same scenario pattern across many inputs and
aggregates the results into a pass/fail summary. Use it to evaluate a dataset of
test cases, measure regression coverage, or compare outputs across prompt
variants.
To get started, weβll implement the core batch loop. The key insight is that
asyncio.gather submits all scenarios simultaneously, so the total runtime
scales with the slowest single call rather than the number of test cases β
critical when each interaction involves an LLM.
Define your test cases as a list of (input, expected) pairs, create a scenario
for each pair, run them all concurrently with asyncio.gather, then summarise:
import asyncio
from giskard.checks import Scenario, StringMatching
test_cases =[
("How long do we retain KYC records?","5 years"),
("Can we share customer data with third parties?","only with consent"),
("Is medical advice allowed in the chatbot?","no"),
]
defmy_qa_system(question:str)->str:
# Your QA system
return"..."
asyncdefrun_batch():
scenarios =[
(
question,
Scenario(f"qa_{i}")
.interact(
inputs=question,
outputs=lambdainputs,q=question:my_qa_system(q),
)
.check(
StringMatching(
name="contains_expected",
keyword=expected,
text_key="trace.last.outputs",
)
),
)
for i,(question, expected)inenumerate(test_cases)
]
results =await asyncio.gather(*(s.run()for _, s in scenarios))
passed =sum(1for r in results if r.passed)
total =len(results)
print(f"Passed: {passed}/{total} ({passed / total *100:.1f}%)")
for result in results:
result.print_report()
return results
_ = asyncio.run(run_batch())
Output
Passed: 0/3 (0.0%)
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword '5 years'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'How long do we retain KYC records?'
Outputs: '...'βββββββββββββββββββββββββββββββββββββββββββ 1 step in 17ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword 'only with consent'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Can we share customer data with third parties?'
Outputs: '...'ββββββββββββββββββββββββββββββββββββββββββββ 1 step in 4ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββcontains_expectedFAIL The answer does not contain the keyword 'no'ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Is medical advice allowed in the chatbot?'
Outputs: '...'ββββββββββββββββββββββββββββββββββββββββββββ 1 step in 4ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ
The asyncio.gather approach above gives you aggregate pass/fail counts, but a
CI pipeline benefits from individual failure markers. Next, weβll convert the
same test cases into a parametrized pytest function so each input gets its own
entry in the test report.
To get per-test failure reporting in CI, use @pytest.mark.parametrize:
import pytest
from giskard.checks import Scenario, StringMatching
QA_CASES =[
("How long do we retain KYC records?","5 years"),
("Can we share customer data with third parties?","only with consent"),
("Is medical advice allowed in the chatbot?","no"),
With the basic batch loop established, we can now swap in an LLMJudge check.
The generator is configured once before the loop; every scenario created inside
it reuses that single configuration, so you arenβt reinitializing a client on
every iteration.
LLM-based checks work in batch too. Set a default generator once before the
loop:
import asyncio
from giskard.agents.generators import Generator
from giskard.checks import Scenario, LLMJudge, set_default_generator
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyFAIL The summary is incomplete and does not contain enough information to determine
factual consistency with the original statement, which specifies that employees must complete security training
annually.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'The new policy requires all employees to complete security training annually.'
Outputs: 'Summary of: The new policy requires all employees to...'ββββββββββββββββββββββββββββββββββββββββββ 1 step in 2676ms | runs: 1/1 βββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyFAIL The summary is incomplete and cuts off, making it impossible to confirm its factual
consistency with the original statement.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'The quarterly report shows a 12% increase in revenue compared to last year.'
Outputs: 'Summary of: The quarterly report shows a 12% increas...'ββββββββββββββββββββββββββββββββββββββββββ 1 step in 1899ms | runs: 1/1 βββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββfactual_consistencyFAIL The summary is incomplete and does not specify the return window or the receipt
requirement, thus it cannot be confirmed as factually consistent with the original statement.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Our refund policy allows returns within 30 days of purchase with a receipt.'
Outputs: 'Summary of: Our refund policy allows returns within ...'ββββββββββββββββββββββββββββββββββββββββββ 1 step in 1448ms | runs: 1/1 βββββββββββββββββββββββββββββββββββββββββββ
Beyond pass/fail, you can collect numeric data from each result to compute
statistics across the whole batch. This is useful for monitoring response
quality trends over time rather than just asserting a binary threshold.
If your checks emit numeric metrics, collect them to compute aggregates:
import asyncio
from giskard.checks import Scenario, FnCheck
test_inputs =[
"This is a short response.",
"This is a slightly longer response with more words in it.",
"Short.",
]
defmy_model(text:str)->str:
return text # Echo for demonstration
asyncdefrun_with_metrics():
scenarios =[
Scenario(f"length_{i}")
.interact(
inputs=inp,
outputs=lambdainputs,x=inp:my_model(x),
)
.check(
FnCheck(fn=
lambdatrace:len(trace.last.outputs.split())>=3,
name="min_word_count",
success_message="Meets minimum word count",
failure_message="Response too short",
)
)
for i, inp inenumerate(test_inputs)
]
results =await asyncio.gather(*(s.run()for s in scenarios))
word_counts =[len(r.final_trace.last.outputs.split())for r in results]
print(f"Average word count: {sum(word_counts)/len(word_counts):.1f}")
print(f"Passed: {sum(1for r in results if r.passed)}/{len(results)}")
for r in results:
r.print_report()
asyncio.run(run_with_metrics())
Output
Average word count: 5.7
Passed: 2/3
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'This is a short response.'
Outputs: 'This is a short response.'ββββββββββββββββββββββββββββββββββββββββββββ 1 step in 1ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β PASSED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countPASSββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'This is a slightly longer response with more words in it.'
Outputs: 'This is a slightly longer response with more words in it.'ββββββββββββββββββββββββββββββββββββββββββββ 1 step in 0ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β FAILED ββββββββββββββββββββββββββββββββββββββββββββββββββββmin_word_countFAIL Response too short
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Trace ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Interaction 1 ββββββββββββββββββββββββββββββββββββββββββββββββββ
Inputs: 'Short.'
Outputs: 'Short.'ββββββββββββββββββββββββββββββββββββββββββββ 1 step in 0ms | runs: 1/1 ββββββββββββββββββββββββββββββββββββββββββββ