Principles¶
pyannote.metrics provides a set of classes to compare the output of speaker diarization (hereafter called hypothesis) systems to manual annotations (reference). Let us first instantiate a sample reference and hypothesis.
In [1]: from pyannote.core import Segment, Timeline, Annotation
In [2]: reference = Annotation(uri='file1')
...: reference[Segment(0, 10)] = 'A'
...: reference[Segment(12, 20)] = 'B'
...: reference[Segment(24, 27)] = 'A'
...: reference[Segment(30, 40)] = 'C'
...:
In [3]: hypothesis = Annotation(uri='file1')
...: hypothesis[Segment(2, 13)] = 'a'
...: hypothesis[Segment(13, 14)] = 'd'
...: hypothesis[Segment(14, 20)] = 'b'
...: hypothesis[Segment(22, 38)] = 'c'
...: hypothesis[Segment(38, 40)] = 'd'
...:
(Source code, png, hires.png, pdf)
This basically tells us that, according to the manual annotation, speaker A speaks in timeranges [0s, 10s] and [24s, 27s].
Note
Overlapping segments are supported. See pyannote.core
documentation for more details.
pyannote.metrics follows an object-oriented paradigm.
Most evaluation metrics (e.g. DiarizationErrorRate
below) inherit from BaseMetric
.
As such, they share a common set of methods.
For instance, once instantiated, they can be called directly to compute the value of the evaluation metric.
In [4]: from pyannote.metrics.diarization import DiarizationErrorRate
In [5]: metric = DiarizationErrorRate()
In [6]: metric(reference, hypothesis)
Out[6]: 0.5161290322580645
Accumulation & reporting¶
The same metric instance can be used to evaluate multiple files.
In [7]: other_reference = Annotation(uri='file2')
...: other_reference[Segment(0, 5)] = 'A'
...: other_reference[Segment(6, 10)] = 'B'
...: other_reference[Segment(12, 13)] = 'B'
...: other_reference[Segment(15, 20)] = 'A'
...:
In [8]: other_hypothesis = Annotation(uri='file2')
...: other_hypothesis[Segment(1, 6)] = 'a'
...: other_hypothesis[Segment(6, 7)] = 'b'
...: other_hypothesis[Segment(7, 10)] = 'c'
...: other_hypothesis[Segment(11, 19)] = 'b'
...: other_hypothesis[Segment(19, 20)] = 'a'
...:
In [9]: metric = DiarizationErrorRate()
In [10]: metric(reference, hypothesis)
Out[10]: 0.5161290322580645
In [11]: metric(other_reference, other_hypothesis)
Out[11]: 0.7333333333333333
You do not need to keep track of the result of each call yourself: this is done automatically.
For instance, once you have evaluated all files, you can use the overriden __abs__()
operator to get the accumulated value:
In [12]: abs(metric)
Out[12]: 0.5869565217391305
report()
provides a convenient summary of the result:
In [13]: report = metric.report(display=True)
diarization error rate total correct correct false alarm false alarm missed detection missed detection confusion confusion
% % % % %
item
file1 51.61 31.00 22.00 70.97 7.00 22.58 2.00 6.45 7.00 22.58
file2 73.33 15.00 8.00 53.33 4.00 26.67 1.00 6.67 6.00 40.00
TOTAL 58.70 46.00 30.00 65.22 11.00 23.91 3.00 6.52 13.00 28.26
The internal accumulator can be reset using the report()
method:
In [14]: metric.reset()
Evaluation map¶
Though audio files can always be processed entirely (from beginning to end), there are cases where reference annotations are only available for some regions of the audio files. All metrics support the provision of an evaluation map that indicate which part of the audio file should be evaluated.
In [15]: uem = Timeline([Segment(0, 10), Segment(15, 20)])
In [16]: metric(reference, hypothesis, uem=uem)
Out[16]: 0.13333333333333333
Components¶
Most metrics are computed as the combination of several components. For instance, the diarization error rate is the combination of false alarm (non-speech regions classified as speech), missed detection (speech regions classified as non-speech) and confusion between speakers.
Using detailed=True
will return the value of each component:
In [17]: metric(reference, hypothesis, detailed=True)
Out[17]:
{'correct': 22.0,
'confusion': 7.0,
'missed detection': 2.0,
'total': 31.0,
'false alarm': 7.0,
'diarization error rate': 0.5161290322580645}
The accumulated value of each component can also be obtained using the overriden __getitem__()
operator:
In [18]: metric(other_reference, other_hypothesis)
Out[18]: 0.7333333333333333
In [19]: metric['confusion']
Out[19]: 13.0
In [20]: metric[:]
Out[20]:
{'correct': 43.0,
'confusion': 13.0,
'missed detection': 5.0,
'total': 61.0,
'false alarm': 11.0}
Define your own metric¶
It is possible (and encouraged) to develop and contribute new evaluation metrics.
All you have to do is inherit from BaseMetric
and implement a few methods:
metric_name
, metric_components
, compute_components
, and compute_metric
:
def is_male(speaker_name):
# black magic that returns True if speaker is a man, False otherwise
pass
class MyMetric(BaseMetric):
# This dummy metric computes the ratio between male and female speakers.
# It does not actually use the reference annotation...
@classmethod
def metric_name(cls):
# Return human-readable name of the metric
return 'male / female ratio'
@classmethod:
def metric_components(cls):
# Return component names from which the metric is computed
return ['male', 'female']
def compute_components(self, reference, hypothesis, **kwargs):
# Actually compute the value of each component
components = {'male': 0., 'female': 0.}
for segment, _, speaker_name in hypothesis.itertracks(yield_label=True):
if is_male(speaker_name):
components['male'] += segment.duration
else:
components['female'] += segment.duration
return components
def compute_metric(self, components):
# Actually compute the metric based on the component values
return components['male'] / components['female']
See pyannote.metrics.base.BaseMetric
for more details.