| Name | Type | Description |
|---|---|---|
target* | Union[TARGET_T, Runnable, EXPERIMENT_T, Tuple[EXPERIMENT_T, EXPERIMENT_T]] | The target system or experiment(s) to evaluate. Can be a function that takes a |
data | DATA_T | Default: None |
evaluators | Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] | Default: None |
summary_evaluators | Optional[Sequence[SUMMARY_EVALUATOR_T]] | Default: None |
metadata | Optional[dict] | Default: None |
experiment_prefix | Optional[str] | Default: None |
description | Optional[str] | Default: None |
max_concurrency | Optional[int], default=0 | Default: 0 |
blocking | bool, default=True | Default: True |
num_repetitions | int, default=1 | Default: 1 |
experiment | Optional[EXPERIMENT_T] | Default: None |
upload_results | bool, default=True | Default: True |
error_handling | str, default="log" | Default: 'log' |
**kwargs | Any | Default: {} |
Evaluate a target system on a given dataset.
The dataset to evaluate on.
Can be a dataset name, a list of examples, or a generator of examples.
A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.
A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments.
Metadata to attach to the experiment.
A prefix to provide for your experiment name.
A free-form text description for the experiment.
The maximum number of concurrent evaluations to run.
If None then no limit is set. If 0 then no concurrency.
Whether to block until the evaluation is complete.
The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
An existing experiment to extend.
If provided, experiment_prefix is ignored.
For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
Whether to upload the results to LangSmith.
How to handle individual run errors.
'log' will trace the runs with the error message as part of the
experiment, 'ignore' will not count the run as part of the experiment at
all.
Additional keyword arguments to pass to the evaluator.