Tau-Eval package

Top-level package for tau-eval.

class tau_eval.Experiment(models: list[Anonymizer], metrics: list[str | Callable], tasks: list[Task | CustomTask], config: ExperimentConfig = ExperimentConfig(exp_name='experiment', classifier_name='answerdotai/ModernBERT-base', train_task_models=False, train_with_generations=False, device='cuda', classifier_args={}))[source]

Bases: object

classmethod from_json(filepath: str)[source]

Loads experiment results from a JSON file.

run(output_dir='results.json')[source]
summary(output_dir=None, to_rich=False)[source]
class tau_eval.ExperimentConfig(exp_name: str = 'experiment', classifier_name: str = 'answerdotai/ModernBERT-base', train_task_models: bool = False, train_with_generations: bool = False, device: str | None = 'cuda', classifier_args: dict = <factory>)[source]

Bases: object

Evaluation experiment config

classifier_args: dict
classifier_name: str = 'answerdotai/ModernBERT-base'
device: str | None = 'cuda'
exp_name: str = 'experiment'
train_task_models: bool = False
train_with_generations: bool = False

Subpackages

Submodules

class tau_eval.experiment.Experiment(models: list[Anonymizer], metrics: list[str | Callable], tasks: list[Task | CustomTask], config: ExperimentConfig = ExperimentConfig(exp_name='experiment', classifier_name='answerdotai/ModernBERT-base', train_task_models=False, train_with_generations=False, device='cuda', classifier_args={}))[source]

Bases: object

classmethod from_json(filepath: str)[source]

Loads experiment results from a JSON file.

run(output_dir='results.json')[source]
summary(output_dir=None, to_rich=False)[source]
class tau_eval.experiment.ExperimentConfig(exp_name: str = 'experiment', classifier_name: str = 'answerdotai/ModernBERT-base', train_task_models: bool = False, train_with_generations: bool = False, device: str | None = 'cuda', classifier_args: dict = <factory>)[source]

Bases: object

Evaluation experiment config

classifier_args: dict
classifier_name: str = 'answerdotai/ModernBERT-base'
device: str | None = 'cuda'
exp_name: str = 'experiment'
train_task_models: bool = False
train_with_generations: bool = False
tau_eval.experiment.rich_display_dataframe(df, title='Dataframe') None[source]

Display dataframe as table using rich library. :param df: dataframe to display :type df: pd.DataFrame :param title: title of the table. Defaults to “Dataframe”. :type title: str, optional

Raises:

NotRenderableError – if dataframe cannot be rendered

Returns:

rich table

Return type:

rich.table.Table

tau_eval.utils.evaluate_system_output(inputs: list[str], outputs: list[str], metrics: list[str | Callable[[str | list[str], str | list[str]], dict[str, float]]] = ['rouge', 'meteor', 'luar']) dict[source]

Evaluate a system output with automatic metrics

tau_eval.utils.run_models_on_custom_task(models, task: CustomTask, metrics)[source]
tau_eval.utils.run_models_on_task(models, task, metrics, classifier_name='answerdotai/ModernBERT-base', do_train=False, do_train_adversarial=False, device='cuda', export_generated_texts=True)[source]
tau_eval.visualization.compute_correlation(data, metric1_name: str, metric2_name: str, method: str = 'kendall', aggregate_across_tasks: bool = False, task_name: str | None = None) dict[source]

Computes correlation between two metrics for models.

Parameters:
  • data – Data structure containing metrics for models across tasks

  • metric1_name – Name of the first metric

  • metric2_name – Name of the second metric

  • method – Correlation method (‘kendall’, ‘pearson’, ‘spearman’)

  • aggregate_across_tasks – If True, compute single correlation across all tasks

  • task_name – Specific task name to compute correlation (ignored if aggregate_across_tasks=True)

Returns:

(correlation, p_value)} for all tasks

Return type:

dict of {task_name

tau_eval.visualization.get_all_dataset_names(data)[source]

Extracts all dataset names from the Experiment data.

tau_eval.visualization.get_all_model_method_names(data)[source]

Extract unique model names across all datasets

tau_eval.visualization.get_all_numeric_metric_names(data)[source]

Extracts all unique numeric metric names present in model/method results or in ‘original_metrics’.

tau_eval.visualization.plot_all_metrics_for_model_on_dataset(data, dataset_name, model_name)[source]

Plots all numeric metrics for a specific model on a specific dataset.

Parameters:
  • data – The parsed JSON data from Experiment.

  • dataset_name – The name of the dataset.

  • model_name – The name of the model (e.g., ‘google/gemini-flash-1.5-8b’). Use “Original Model” to see metrics from original texts.

tau_eval.visualization.plot_metric_comparison_across_datasets(data, metric_name, specific_models=None, show_original=True)[source]

Compares a specific metric for selected models across all datasets.

Parameters:
  • data – The parsed JSON data from Experiment.

  • metric_name – The metric to compare (e.g., ‘bertscore_f1’, ‘test_accuracy’).

  • specific_models (optional) – A list of model names to include. If None, includes all found models.

  • show_original – If True, includes ‘Original Model’ performance for the metric.

tau_eval.visualization.plot_metric_distribution(data, metric_name, chart_type='hist')[source]

Plots the distribution of a specific metric across all models/methods and datasets.

Parameters:
  • data – The parsed JSON data from Experiment.

  • metric_name – The metric whose distribution is to be plotted.

  • chart_type – Type of chart: ‘hist’ for histogram, ‘box’ for box plot.

tau_eval.visualization.plot_radar_model_comparison(data, metric_name, model_list, ordered_dataset_keys)[source]

Generates a single radar plot to compare specified model series across datasets for a given metric.

Parameters:
  • data – The parsed JSON data from Experiment.

  • metric_name – The metric to plot (e.g., ‘sbert’, ‘test_f1’).

  • model_list – A list of model/method names to include.

  • ordered_dataset_keys – List of dataset names, determining the order of axes on the radar.

tau_eval.visualization.plot_trade_off_metric(data, x_metric_name, y_metric_name, task_list=None, model_list=None)[source]

Creates a scatter plot of a task performance metric vs. an anonymization/utility metric, with a legend for different model/method groups.

Parameters:
  • data – The parsed JSON data from Experiment.

  • x_metric_name – e.g., ‘test_f1’, ‘test_accuracy’.

  • y_metric_name – e.g., ‘bertscore_f1’, ‘rougeL’, ‘sbert’.

  • model_list (optional) – Filter for specific models.