Datasets

Datasets are the backbone of the library. They hold samples, model outputs, metrics, and export-ready views.

BaseASRDataset

from plantain2asr.dataloaders.base import BaseASRDataset

Main responsibilities:

Most-used methods:

Method	What it does
`filter(fn)`	Returns a filtered dataset view
`take(n)`	Returns the first `n` samples
`run_model(model)`	Runs a model directly without writing pipeline syntax
`evaluate_metric(metric)`	Computes one metric directly
`to_pandas()`	Returns one row per `(sample, model)`
`iter_results_rows()`	Iterates flattened result rows
`save_csv(path)`	Exports flattened rows to CSV
`save_excel(path)`	Exports flattened rows to XLSX
`summarize_by_model()`	Builds aggregate metrics by model
`load_model_results(name, path)`	Loads precomputed JSONL inference

Pipeline form:

dataset >> model
dataset >> normalizer
dataset >> metric

from plantain2asr.dataloaders.types import AudioSample

Field	Type	Description
`id`	`str`	Unique sample identifier
`audio_path`	`str`	Audio file path
`text`	`str`	Reference transcript
`duration`	`float \\| None`	Duration in seconds
`meta`	`dict`	Arbitrary metadata
`asr_results`	`dict`	Per-model hypotheses and metrics

from plantain2asr import GolosDataset

ds = GolosDataset("data/golos")

Parameter	Type	Default	Description
`root_dir`	`str`	required	Storage directory
`limit`	`int \\| None`	`None`	Optional sample cap
`auto_download`	`bool`	`True`	Download automatically if missing

Typical metadata: meta["subset"] is "crowd" or "farfield".

from plantain2asr import DagrusDataset

ds = DagrusDataset("data/dagrus")

Parameter	Type	Default	Description
`root_dir`	`str`	required	Corpus root
`limit`	`int \\| None`	`None`	Optional sample cap

from plantain2asr import NeMoDataset

ds = NeMoDataset("data/my_corpus")

Parameter	Type	Default	Description
`root_dir`	`str`	required	Directory with `manifest.jsonl`
`limit`	`int \\| None`	`None`	Optional sample cap

If you want a ready-made research workflow, use Experiment on top of a dataset instead of orchestrating individual dataset calls by hand.