tau_eval.tasks package

class tau_eval.tasks.CustomTask(dataset: datasets.arrow_dataset.Dataset = None, name: str = '', s1: str = 'text', s2: str = '')[source]

Bases: object

dataset: Dataset = None

evaluate(new_texts: list[str]) → dict[source]

name: str = ''

s1: str = 'text'

s2: str = ''

class tau_eval.tasks.DeIdentification(dataset: datasets.arrow_dataset.Dataset = None, name: str = '', s1: str = 'text', s2: str = '', max_rows: int = None)[source]

Bases: CustomTask

dataset: Dataset = None

evaluate(new_texts: list[str]) → dict[source]

max_rows: int = None

name: str = ''

class tau_eval.tasks.IMDBAuthorshipClassification(n_authors: int = 10, min_docs_per_author: int = 1000, random_seed: int = 0, **kwargs)[source]

Bases: Classification

A classification task for authorship attribution using the IMDb-62 dataset.

Inherits from tasknet.Classification and automatically processes the dataset to select authors with sufficient documents for classification.

property author_info: dict: Get information about selected authors and their document counts.

Submodules

class tau_eval.tasks.customtask.CustomTask(dataset: datasets.arrow_dataset.Dataset = None, name: str = '', s1: str = 'text', s2: str = '')[source]

Bases: object

dataset: Dataset = None

evaluate(new_texts: list[str]) → dict[source]

name: str = ''

s1: str = 'text'

s2: str = ''

class tau_eval.tasks.deidentification.DeIdentification(dataset: datasets.arrow_dataset.Dataset = None, name: str = '', s1: str = 'text', s2: str = '', max_rows: int = None)[source]

Bases: CustomTask

dataset: Dataset = None

evaluate(new_texts: list[str]) → dict[source]

max_rows: int = None

name: str = ''

tau_eval.tasks.deidentification.dataset_task_preprocessing(dataset_name: str, dataset_size: int = 2500) → Dataset[source]

tau_eval.tasks.deidentification.extract_non_o_words(tokens, tags)[source]

Extract words associated with non-“O” tags by merging tokens.

Parameters:

tokens (list of str) – The list of tokens.
tags (list of str) – The list of tags corresponding to each token.

Returns:

A dictionary where keys are the tags (e.g., “B-NAME”, “B-EMAIL”): and values are the merged words for each tag.

Return type:

dict

class tau_eval.tasks.imdb_authorship_classification.IMDBAuthorshipClassification(n_authors: int = 10, min_docs_per_author: int = 1000, random_seed: int = 0, **kwargs)[source]

Bases: Classification

A classification task for authorship attribution using the IMDb-62 dataset.

Inherits from tasknet.Classification and automatically processes the dataset to select authors with sufficient documents for classification.

property author_info: dict: Get information about selected authors and their document counts.