Principles

pyannote.metrics provides a set of classes to compare the output of speaker diarization (hereafter called hypothesis) systems to manual annotations (reference). Let us first instantiate a sample reference and hypothesis.

In [1]: from pyannote.core import Segment, Timeline, Annotation

In [2]: reference = Annotation(uri='file1')
   ...: reference[Segment(0, 10)] = 'A'
   ...: reference[Segment(12, 20)] = 'B'
   ...: reference[Segment(24, 27)] = 'A'
   ...: reference[Segment(30, 40)] = 'C'
   ...: 

In [3]: hypothesis = Annotation(uri='file1')
   ...: hypothesis[Segment(2, 13)] = 'a'
   ...: hypothesis[Segment(13, 14)] = 'd'
   ...: hypothesis[Segment(14, 20)] = 'b'
   ...: hypothesis[Segment(22, 38)] = 'c'
   ...: hypothesis[Segment(38, 40)] = 'd'
   ...: 

(Source code)

This basically tells us that, according to the manual annotation, speaker A speaks in timeranges [0s, 10s] and [24s, 27s].

Note

Overlapping segments are supported. See pyannote.core documentation for more details.

pyannote.metrics follows an object-oriented paradigm. Most evaluation metrics (e.g. DiarizationErrorRate below) inherit from BaseMetric. As such, they share a common set of methods.

For instance, once instantiated, they can be called directly to compute the value of the evaluation metric.

In [4]: from pyannote.metrics.diarization import DiarizationErrorRate

In [5]: metric = DiarizationErrorRate()

In [6]: metric(reference, hypothesis)
Out[6]: 0.5161290322580645

Accumulation & reporting

The same metric instance can be used to evaluate multiple files.

In [7]: other_reference = Annotation(uri='file2')
   ...: other_reference[Segment(0, 5)] = 'A'
   ...: other_reference[Segment(6, 10)] = 'B'
   ...: other_reference[Segment(12, 13)] = 'B'
   ...: other_reference[Segment(15, 20)] = 'A'
   ...: 

In [8]: other_hypothesis = Annotation(uri='file2')
   ...: other_hypothesis[Segment(1, 6)] = 'a'
   ...: other_hypothesis[Segment(6, 7)] = 'b'
   ...: other_hypothesis[Segment(7, 10)] = 'c'
   ...: other_hypothesis[Segment(11, 19)] = 'b'
   ...: other_hypothesis[Segment(19, 20)] = 'a'
   ...: 

In [9]: metric = DiarizationErrorRate()

In [10]: metric(reference, hypothesis)
Out[10]: 0.5161290322580645

In [11]: metric(other_reference, other_hypothesis)
Out[11]: 0.7333333333333333

You do not need to keep track of the result of each call yourself: this is done automatically. For instance, once you have evaluated all files, you can use the overriden __abs__() operator to get the accumulated value:

In [12]: abs(metric)
Out[12]: 0.5869565217391305

report() provides a convenient summary of the result:

In [13]: report = metric.report(display=True)
      diarization error rate total correct correct false alarm false alarm missed detection missed detection confusion confusion
                           %                     %                       %                                 %                   %
item                                                                                                                            
file1                  51.61 31.00   22.00   70.97        7.00       22.58             2.00             6.45      7.00     22.58
file2                  73.33 15.00    8.00   53.33        4.00       26.67             1.00             6.67      6.00     40.00
TOTAL                  58.70 46.00   30.00   65.22       11.00       23.91             3.00             6.52     13.00     28.26

The internal accumulator can be reset using the report() method:

In [14]: metric.reset()

Evaluation map

Though audio files can always be processed entirely (from beginning to end), there are cases where reference annotations are only available for some regions of the audio files. All metrics support the provision of an evaluation map that indicate which part of the audio file should be evaluated.

In [15]: uem = Timeline([Segment(0, 10), Segment(15, 20)])

In [16]: metric(reference, hypothesis, uem=uem)
Out[16]: 0.13333333333333333

Components

Most metrics are computed as the combination of several components. For instance, the diarization error rate is the combination of false alarm (non-speech regions classified as speech), missed detection (speech regions classified as non-speech) and confusion between speakers.

Using detailed=True will return the value of each component:

In [17]: metric(reference, hypothesis, detailed=True)
Out[17]: 
{'false alarm': 7.0,
 'correct': 22.0,
 'total': 31.0,
 'missed detection': 2.0,
 'confusion': 7.0,
 'diarization error rate': 0.5161290322580645}

The accumulated value of each component can also be obtained using the overriden __getitem__() operator:

In [18]: metric(other_reference, other_hypothesis)
Out[18]: 0.7333333333333333

In [19]: metric['confusion']
Out[19]: 13.0

In [20]: metric[:]
Out[20]: 
{'false alarm': 11.0,
 'correct': 43.0,
 'total': 61.0,
 'missed detection': 5.0,
 'confusion': 13.0}

Define your own metric

It is possible (and encouraged) to develop and contribute new evaluation metrics.

All you have to do is inherit from BaseMetric and implement a few methods: metric_name, metric_components, compute_components, and compute_metric:

def is_male(speaker_name):
    # black magic that returns True if speaker is a man, False otherwise
    pass

class MyMetric(BaseMetric):
    # This dummy metric computes the ratio between male and female speakers.
    # It does not actually use the reference annotation...

    @classmethod
    def metric_name(cls):
        # Return human-readable name of the metric

        return 'male / female ratio'

    @classmethod:
    def metric_components(cls):
        # Return component names from which the metric is computed

        return ['male', 'female']

    def compute_components(self, reference, hypothesis, **kwargs):
        # Actually compute the value of each component

        components = {'male': 0., 'female': 0.}

        for segment, _, speaker_name in hypothesis.itertracks(yield_label=True):
            if is_male(speaker_name):
                components['male'] += segment.duration
            else:
                components['female'] += segment.duration

        return components

    def compute_metric(self, components):
        # Actually compute the metric based on the component values

        return components['male'] / components['female']

See pyannote.metrics.base.BaseMetric for more details.