Experiment Constructor
This page walks you through assembling a complete ASR experiment step by step -- from choosing your data to ready-to-run code you can copy and execute.
If you already know what you need, skip the explanations and jump straight to the interactive builder at the bottom.
Step 1. What data will you measure on?
Before picking models and metrics, you need to decide what data you will evaluate on.
Built-in datasets
plantain2asr ships with loaders for several Russian speech corpora.
Each loader parses the corpus structure automatically and provides a uniform AudioSample interface.
Golos
An open-source corpus by Sber. ~1 200 hours of Russian speech. Two subsets:
- crowd -- crowdsourced recordings (clean, diverse speakers)
- farfield -- far-field microphone recordings (noisier, more realistic)
| Size | ~1 200 h |
| Audio format | WAV / OGG |
| Download | github.com/sberdevices/golos |
| Loader | GolosDataset("data/golos") |
| Auto-download | yes (auto_download=True) |
from plantain2asr import GolosDataset
ds = GolosDataset("data/golos")
crowd = ds.filter(lambda s: s.meta["subset"] == "crowd")
DaGRuS
A conversational Russian speech corpus with detailed annotations: laughter, noise, unclear words, fillers.
| Size | ~60 h |
| Key feature | Conversational speech, event annotations |
| Download | available on request from corpus authors |
| Loader | DagrusDataset("data/dagrus") |
Normalization for DaGRuS
Use DagrusNormalizer() -- it knows how to strip corpus-specific annotations
([laugh], [noise], {word*}) and normalize colloquial forms.
from plantain2asr import DagrusDataset, DagrusNormalizer
ds = DagrusDataset("data/dagrus")
norm = ds >> DagrusNormalizer()
RuDevices
A corpus of recordings from various devices (laptops, phones, smart speakers).
| Loader | RuDevicesDataset("data/rudevices") |
| Key feature | Different devices and recording conditions |
Using your own data
If your data is not covered by the built-in loaders, there are two paths.
Path 1: NeMo-format JSONL
If you have audio files and a JSONL manifest, use NeMoDataset:
{"audio_filepath": "audio/001.wav", "text": "hello world", "duration": 2.1}
{"audio_filepath": "audio/002.wav", "text": "how are you", "duration": 1.8}
from plantain2asr import NeMoDataset
ds = NeMoDataset(root_dir="data/my_corpus", manifest_path="data/my_corpus/manifest.jsonl")
Path 2: custom loader class
Subclass BaseASRDataset and return a list of AudioSample:
from plantain2asr.dataloaders.base import BaseASRDataset
from plantain2asr.dataloaders.types import AudioSample
class MyDataset(BaseASRDataset):
def __init__(self, root_dir):
super().__init__()
self.name = "my-dataset"
self._samples = [
AudioSample(id="s1", audio_path=f"{root_dir}/001.wav", text="reference text"),
]
More details: Extending -> Custom Model
Step 2. Which metrics do you need?
Metrics show how well a model recognized speech.
Core metrics
| Metric | What it measures | When to use |
|---|---|---|
| WER (Word Error Rate) | Fraction of erroneous words. Counts insertions, deletions, and substitutions at the word level. | Universal primary metric. Always include. |
| CER (Character Error Rate) | Same idea, but at the character level. | When spelling accuracy matters, not just words. |
| MER (Match Error Rate) | Normalized variant of WER accounting for both string lengths. | More stable on short utterances. |
| Accuracy | 1 - MER. The fraction of correctly recognized content. |
When you want an intuitive "percent correct" number. |
Additional metrics
| Metric | What it measures |
|---|---|
| WIL | Word Information Lost |
| WIP | Word Information Preserved |
| IDR | Insertion / Deletion Ratio |
| LengthRatio | Hypothesis length divided by reference length |
| BERTScore | Semantic similarity via BERT embeddings (requires analysis extra) |
| POSAnalysis | POS-tag error analysis (requires analysis extra) |
What should I choose?
Recommendation
For a first evaluation, use Metrics.composite() -- it computes WER, CER, MER,
WIL, WIP, Accuracy, IDR, and LengthRatio in a single pass.
If you only need one metric:
Step 3. Which models to compare?
plantain2asr supports several ASR model families. They all share the same interface:
dataset >> Models.XXX().
Local models
| Model | Description | Device | pip extra | When to choose |
|---|---|---|---|---|
| GigaAM v3 | Large Sber model, e2e-RNNT architecture. Best Russian quality. | CUDA / MPS / CPU | gigaam |
When quality matters and you have a GPU |
| GigaAM v2 | Previous GigaAM generation. | CUDA / MPS / CPU | gigaam |
For comparison with v3 |
| Whisper | OpenAI model, large-v3. Strong multilingual baseline. | CUDA / MPS / CPU | whisper |
Universal baseline |
| T-One | T-Bank model on ONNX Runtime. Fast inference. | CUDA / CPU | tone + T-One source archive |
When speed matters |
| Vosk | Lightweight offline model on Kaldi. CPU only. | CPU | vosk |
No GPU, need offline |
| Canary | NVIDIA NeMo Canary. Heavy, requires GPU. | CUDA | canary |
Research comparisons |
Cloud models
| Model | Description | Extra | When to choose |
|---|---|---|---|
| SaluteSpeech | Sber cloud API. | none | Cloud-based recognition |
Installation
Each model requires its own set of dependencies. Install only what you need:
pip install plantain2asr[gigaam]
pip install plantain2asr[whisper]
pip install plantain2asr[vosk]
pip install plantain2asr[tone]
pip install "tone @ https://github.com/voicekit-team/T-one/archive/3c5b6c015038173840e62cea99e10cdb1c759116.tar.gz"
Or the full CPU/GPU stack at once:
Running models
from plantain2asr import Models
ds >> Models.GigaAM_v3()
ds >> Models.Whisper()
ds >> Models.Vosk(model_path="path/to/vosk-model")
Results are cached: re-running skips already processed samples.
Step 4. Text normalization
Before computing metrics, you need to bring references and hypotheses to a common form: remove punctuation, normalize case, handle corpus-specific markup.
| Normalizer | What it does | When to use |
|---|---|---|
SimpleNormalizer() |
Lowercase, strip punctuation, ё -> е, collapse whitespace |
Most corpora |
DagrusNormalizer() |
Everything SimpleNormalizer does + strips DaGRuS markup + normalizes colloquial forms | DaGRuS corpus |
| No normalization | Metrics are computed on raw text | Only if texts are already normalized |
Step 5. Assemble the >> chain
Now that you have chosen data, models, normalizer, and metrics, assemble them
into a pipeline using the >> operator:
from plantain2asr import GolosDataset, Models, SimpleNormalizer, Metrics
ds = GolosDataset("data/golos")
# step 1: run models
ds >> Models.GigaAM_v3()
ds >> Models.Whisper()
# step 2: normalize
norm = ds >> SimpleNormalizer()
# step 3: compute metrics
norm >> Metrics.composite()
# step 4: view results
df = norm.to_pandas()
print(df.groupby("model")[["WER", "CER"]].mean().sort_values("WER"))
Each >> creates a new results layer on top of the dataset.
You can branch (.filter()), subsample (.take(n)), and recombine at any point.
Experiment convenience wrapper
If you don't need manual control, Experiment wraps the same >> steps:
from plantain2asr import Experiment, GolosDataset, Models, SimpleNormalizer
experiment = Experiment(
dataset=GolosDataset("data/golos"),
models=[Models.GigaAM_v3(), Models.Whisper()],
normalizer=SimpleNormalizer(),
)
experiment.compare_on_corpus(metrics=["WER", "CER", "Accuracy"])
| Method | What it does |
|---|---|
compare_on_corpus() |
Model comparison with metric table |
prepare_thesis_tables() |
CSV tables for thesis/paper |
export_appendix_bundle() |
Full package: tables + report + benchmark |
benchmark_models() |
Latency, throughput, RTF measurements |
Interactive Builder
Pick your components below, and the builder will show you ready-to-use code, the install command, and a list of output artifacts.
Install command
Ready-to-run code
Artifacts
What's next?
- Quick Start -- a canonical runnable workflow from start to finish
- API Reference -> Datasets -- full dataset method documentation
- API Reference -> Metrics -- all available metrics
- Extending -- how to add your own normalizer, model, or metric