Command line tool

There are two main issues that may arise with results reported in the literature:

Even though the same public datasets are used, the actual evaluation protocol may differ slightly from one paper to another.

The implementation of the reported evaluation metrics may also differ.

The first objective of the pyannote.metrics library is to address these two problems, and provide a convenient way for researchers to evaluate their approaches in a reproducible and comparable manner.

Here is an example use of the command line interface that is provided to solve this problem.

$ pyannote-metrics diarization --subset=development Etape.SpeakerDiarization.TV hypothesis.rttm

Diarization (collar = 0 ms)               error    purity    coverage     total    correct      %    fa.      %    miss.     %    conf.      %
--------------------------------------  -------  --------  ----------  --------  ---------  -----  ------  -----  -------  ----  -------  -----
BFMTV_BFMStory_2011-03-17_175900          14.64     94.74       90.00   2582.08    2300.22  89.08   96.16   3.72    80.14  3.10   201.72   7.81
LCP_CaVousRegarde_2011-02-17_204700       17.80     89.13       86.90   3280.72    2848.42  86.82  151.78   4.63   208.29  6.35   224.01   6.83
LCP_EntreLesLignes_2011-03-18_192900      23.46     79.52       79.03   1704.97    1337.80  78.46   32.89   1.93   157.14  9.22   210.03  12.32
LCP_EntreLesLignes_2011-03-25_192900      26.75     76.97       75.86   1704.13    1292.83  75.86   44.61   2.62   158.38  9.29   252.92  14.84
LCP_PileEtFace_2011-03-17_192900          10.73     93.33       92.30   1611.49    1487.32  92.30   48.73   3.02    55.49  3.44    68.67   4.26
LCP_TopQuestions_2011-03-23_213900        18.28     98.25       94.20    727.26     668.65  91.94   74.36  10.22    16.41  2.26    42.20   5.80
LCP_TopQuestions_2011-04-05_213900        27.97     97.95       79.81    818.03     638.68  78.08   49.45   6.04    17.46  2.13   161.89  19.79
TV8_LaPlaceDuVillage_2011-03-14_172834    21.43     92.89       89.64    996.12     892.04  89.55  109.36  10.98    11.80  1.18    92.28   9.26
TV8_LaPlaceDuVillage_2011-03-21_201334    66.23     77.24       70.64   1296.86     691.76  53.34  253.80  19.57    29.16  2.25   575.95  44.41
TOTAL                                     23.27     88.18       84.55  14721.65   12157.71  82.58  861.14   5.85   734.28  4.99  1829.67  12.43

Tasks

Not only can pyannote-metrics command line tool be used to compute the diarization error rate using NIST implementation, one can also evaluate the typical four sub-modules used in most speaker diarization systems:

Practically, the first positional argument (e.g. diarization, above) is a flag indicating which task should be evaluated.

Apart from the diarization flag that is used for evaluating speaker diarization results, other available flags are detection (speech activity detection), segmentation (speaker change detection), and identification (supervised speaker identification). Depending on the task, a different set of evaluation metrics is computed.

Datasets and protocols

pyannote.metrics provides an easy way to ensure the same protocol (i.e. manual groundtruth and training/development/test split) is used for evaluation.

Internally, it relies on a collection of Python packages that all derive from the pyannote.database main package, that provides a convenient API to define training/development/test splits, along with groundtruth annotations. In the example above, the development set of the TV evaluation protocol of the ETAPE dataset is used.

Results are both reported for each file in the selected subset, and aggregated into one final metric value.

As of March 2017, pyannote.database packages exist for the ETAPE corpus, the REPERE corpus, and the AMI corpus. As more people contribute new pyannote.database packages, they will be added to the pyannote ecosystem.

File formats

Hypothesis files must use the [Rich Transcription Time Marked](https://web.archive.org/web/20170119114252/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) (RTTM) format.