Skip to content

Experiment Constructor

This page walks you through assembling a complete ASR experiment step by step -- from choosing your data to ready-to-run code you can copy and execute.

If you already know what you need, skip the explanations and jump straight to the interactive builder at the bottom.


Step 1. What data will you measure on?

Before picking models and metrics, you need to decide what data you will evaluate on.

Built-in datasets

plantain2asr ships with loaders for several Russian speech corpora. Each loader parses the corpus structure automatically and provides a uniform AudioSample interface.

Golos

An open-source corpus by Sber. ~1 200 hours of Russian speech. Two subsets:

  • crowd -- crowdsourced recordings (clean, diverse speakers)
  • farfield -- far-field microphone recordings (noisier, more realistic)
Size ~1 200 h
Audio format WAV / OGG
Download github.com/sberdevices/golos
Loader GolosDataset("data/golos")
Auto-download yes (auto_download=True)
from plantain2asr import GolosDataset

ds = GolosDataset("data/golos")
crowd = ds.filter(lambda s: s.meta["subset"] == "crowd")

DaGRuS

A conversational Russian speech corpus with detailed annotations: laughter, noise, unclear words, fillers.

Size ~60 h
Key feature Conversational speech, event annotations
Download available on request from corpus authors
Loader DagrusDataset("data/dagrus")

Normalization for DaGRuS

Use DagrusNormalizer() -- it knows how to strip corpus-specific annotations ([laugh], [noise], {word*}) and normalize colloquial forms.

from plantain2asr import DagrusDataset, DagrusNormalizer

ds = DagrusDataset("data/dagrus")
norm = ds >> DagrusNormalizer()

RuDevices

A corpus of recordings from various devices (laptops, phones, smart speakers).

Loader RuDevicesDataset("data/rudevices")
Key feature Different devices and recording conditions
from plantain2asr import RuDevicesDataset

ds = RuDevicesDataset("data/rudevices")

Using your own data

If your data is not covered by the built-in loaders, there are two paths.

Path 1: NeMo-format JSONL

If you have audio files and a JSONL manifest, use NeMoDataset:

{"audio_filepath": "audio/001.wav", "text": "hello world", "duration": 2.1}
{"audio_filepath": "audio/002.wav", "text": "how are you", "duration": 1.8}
from plantain2asr import NeMoDataset

ds = NeMoDataset(root_dir="data/my_corpus", manifest_path="data/my_corpus/manifest.jsonl")

Path 2: custom loader class

Subclass BaseASRDataset and return a list of AudioSample:

from plantain2asr.dataloaders.base import BaseASRDataset
from plantain2asr.dataloaders.types import AudioSample

class MyDataset(BaseASRDataset):
    def __init__(self, root_dir):
        super().__init__()
        self.name = "my-dataset"
        self._samples = [
            AudioSample(id="s1", audio_path=f"{root_dir}/001.wav", text="reference text"),
        ]

More details: Extending -> Custom Model


Step 2. Which metrics do you need?

Metrics show how well a model recognized speech.

Core metrics

Metric What it measures When to use
WER (Word Error Rate) Fraction of erroneous words. Counts insertions, deletions, and substitutions at the word level. Universal primary metric. Always include.
CER (Character Error Rate) Same idea, but at the character level. When spelling accuracy matters, not just words.
MER (Match Error Rate) Normalized variant of WER accounting for both string lengths. More stable on short utterances.
Accuracy 1 - MER. The fraction of correctly recognized content. When you want an intuitive "percent correct" number.

Additional metrics

Metric What it measures
WIL Word Information Lost
WIP Word Information Preserved
IDR Insertion / Deletion Ratio
LengthRatio Hypothesis length divided by reference length
BERTScore Semantic similarity via BERT embeddings (requires analysis extra)
POSAnalysis POS-tag error analysis (requires analysis extra)

What should I choose?

Recommendation

For a first evaluation, use Metrics.composite() -- it computes WER, CER, MER, WIL, WIP, Accuracy, IDR, and LengthRatio in a single pass.

from plantain2asr import Metrics

norm >> Metrics.composite()

If you only need one metric:

norm >> Metrics.WER()

Step 3. Which models to compare?

plantain2asr supports several ASR model families. They all share the same interface: dataset >> Models.XXX().

Local models

Model Description Device pip extra When to choose
GigaAM v3 Large Sber model, e2e-RNNT architecture. Best Russian quality. CUDA / MPS / CPU gigaam When quality matters and you have a GPU
GigaAM v2 Previous GigaAM generation. CUDA / MPS / CPU gigaam For comparison with v3
Whisper OpenAI model, large-v3. Strong multilingual baseline. CUDA / MPS / CPU whisper Universal baseline
T-One T-Bank model on ONNX Runtime. Fast inference. CUDA / CPU tone + T-One source archive When speed matters
Vosk Lightweight offline model on Kaldi. CPU only. CPU vosk No GPU, need offline
Canary NVIDIA NeMo Canary. Heavy, requires GPU. CUDA canary Research comparisons

Cloud models

Model Description Extra When to choose
SaluteSpeech Sber cloud API. none Cloud-based recognition

Installation

Each model requires its own set of dependencies. Install only what you need:

pip install plantain2asr[gigaam]
pip install plantain2asr[whisper]
pip install plantain2asr[vosk]
pip install plantain2asr[tone]
pip install "tone @ https://github.com/voicekit-team/T-one/archive/3c5b6c015038173840e62cea99e10cdb1c759116.tar.gz"

Or the full CPU/GPU stack at once:

pip install plantain2asr[asr-cpu]
pip install plantain2asr[asr-gpu]

Running models

from plantain2asr import Models

ds >> Models.GigaAM_v3()
ds >> Models.Whisper()
ds >> Models.Vosk(model_path="path/to/vosk-model")

Results are cached: re-running skips already processed samples.


Step 4. Text normalization

Before computing metrics, you need to bring references and hypotheses to a common form: remove punctuation, normalize case, handle corpus-specific markup.

Normalizer What it does When to use
SimpleNormalizer() Lowercase, strip punctuation, ё -> е, collapse whitespace Most corpora
DagrusNormalizer() Everything SimpleNormalizer does + strips DaGRuS markup + normalizes colloquial forms DaGRuS corpus
No normalization Metrics are computed on raw text Only if texts are already normalized
from plantain2asr import SimpleNormalizer

norm = ds >> SimpleNormalizer()

Step 5. Assemble the >> chain

Now that you have chosen data, models, normalizer, and metrics, assemble them into a pipeline using the >> operator:

from plantain2asr import GolosDataset, Models, SimpleNormalizer, Metrics

ds = GolosDataset("data/golos")

# step 1: run models
ds >> Models.GigaAM_v3()
ds >> Models.Whisper()

# step 2: normalize
norm = ds >> SimpleNormalizer()

# step 3: compute metrics
norm >> Metrics.composite()

# step 4: view results
df = norm.to_pandas()
print(df.groupby("model")[["WER", "CER"]].mean().sort_values("WER"))

Each >> creates a new results layer on top of the dataset. You can branch (.filter()), subsample (.take(n)), and recombine at any point.

Experiment convenience wrapper

If you don't need manual control, Experiment wraps the same >> steps:

from plantain2asr import Experiment, GolosDataset, Models, SimpleNormalizer

experiment = Experiment(
    dataset=GolosDataset("data/golos"),
    models=[Models.GigaAM_v3(), Models.Whisper()],
    normalizer=SimpleNormalizer(),
)

experiment.compare_on_corpus(metrics=["WER", "CER", "Accuracy"])
Method What it does
compare_on_corpus() Model comparison with metric table
prepare_thesis_tables() CSV tables for thesis/paper
export_appendix_bundle() Full package: tables + report + benchmark
benchmark_models() Latency, throughput, RTF measurements

Interactive Builder

Pick your components below, and the builder will show you ready-to-use code, the install command, and a list of output artifacts.

1 What result do you need?
2 Which dataset?
3 Which models?
4 Normalizer
5 Metrics
6 Additional outputs

Install command

Ready-to-run code

Artifacts


What's next?