Annotation: Evaluation

Firetail’s annotation evaluation module

Assessment of annotations against gold standards

For an in-depth understanding of wildlife behavior the translation of raw tag (acceleration) data to time-resolved categorial data is extremely helpful. We refer to this process as the annotation of burst data.

In general, the quality of these annotations - regardless of their origin – cannot be assessed by intrinsic measures. We have to embed these annotations in extrinsic contexts.

Possible sources of annotation

Manual annotation

It is possible to visually interpret acceleration patterns in the context of other available sensor and location data (height, speed, GPS) and manually assign categories to specific time windows in your dataset.

This procedure requires extensive experience, but often yields reliable results. Given the requirement for manpower and prior knowledge it is typically suited for short-term observations only.

For an overview of concepts, see Annotation: Acc and Multi-Axial Data

FireSOM

To combine the power of machine learning pattern recognition with your expert knowledge, Firetail ships with FireSOM, a tool that enables you to detect acceleration patterns and predict categories on a large scale.

Yet, FireSOM results depend on the current context, animal, parameter settings and the assignment of categories and therefore do not provide a quality estimate. External resources or references are therefore required.

External resources

In some settings, additional time-resolved data may be available that helps to understand annotation quality (while providing a source of annotation itself). With Firetail, you can import external predictions or recorded events using the annotation exchange format.

Video annotation

If available, one of the most reliable sources of annotation is categorial assignments derived from captured video footage, see video-based gold standards. This data will often provide the gold standard for quality estimates, but may be hard to obtain for wildlife in many cases.

Sound annotation

While sound data is not yet directly supported in Firetail, the annotation exchange format can be used to import segmented sound files and evaluate them in the context of location and sensor data.

What to compare?

The following are typical use cases for an evaluation

Resource A Resource B intended application note
gold-standard automated prediction evaluation of annotation quality most common setup
gold-standard manual annotation evaluation of annotation quality
algorithmic prediction algorithmic prediction compare parameter settings or algorithms compare to published procedures
manual annotation manual annotation inter-annotator agreement compare multiple sources of human annotation

How does Firetail compare two resources?

Firetail can link either

  • two categories OR
  • two layers (possibly containing multiple categories)

and report recall and precision and F-measure among the annotations in these categories/layers by computing the length of overlapping segments and compare it to segments that are exclusive to each of the categories.

For linked categories/layers Firetail will report the overlap statistics on hovering the tabular values.

Measures

We refer to the set of categories as $C$. Let categories $(c_1, c_2) \in C \times C$ two linked categories.

Each category $c \in C$ consists of $n_c$ segments $s_i$ for $i=1 \dots n_c$. The beginning of a segment $s$ is $B(s)$, its end is denoted $E(s)$.

The overlap of two segments $s_1 \in c_1$ and $s_2 \in c_2$ is a segment $s_o$ such that $B(s_o) = \max(B(s_1), B(s_2))$ and $E(s_o) = \min(E(s_1), E(s_2))$.

No overlap exists if $E(s_o) \le B(s_o)$, i.e., one of the segments starts after the other has ended.

Induced non-overlapping sub-segments are called unique to $c_i$ if they do not overlap with any other segment in the linked category $c_j$. Similarly, unique segments of $c_j$ will be affected by other segments in $c_i$.

In the example below $s_3$ shortens the segment that is unique to $c_2$. The intersection of $s_3$ and $s_2$ will produce another overlap instead (not shown).

category 1  :    [-----s_1------]     [----s_3----]
category 2  :        [--------s_2----------]
overlap 1 vs2 :      [---s_o----]                    
unique to 1 :    [---]       
unique to 2 :                   [-----]

The set of all overlapping sub-segments of $c_1$ and $c_2$ is called $O(c_1, c_2)$.

By summing over the duration of all these segments we obtain $$ov(c_1, c_2) = \sum_{s \in O(c_1, c_2)}{E(s)-B(s)}$$.

The joint length of segments in a category $c$ is the coverage $$cov(c) = \sum_{s \in c}{E(s)-B(s)}$$.

Now let’s assume one of the categories serves as a reference $r \in C$, while the other is the prediction $r \in C$. We will obtain the precision of a prediction

$$ prc(r,p) = {{ov(r, p)} \over {cov(p)}} $$

the recall of a prediction

$$ rec(r,p) = {{ov(r, p)} \over {cov(r)}} $$

and their harmonic mean, the F-measure

$$ F(r,p) = {2 \over {1 \over prc(r,p)} + {1 \over rec(r,p)}} $$.

For convenience, we report all measures as percentages.

When computing the overlap of layers instead of categories, all categories that reside on the same layer are treated as if they would belong the same (virtual) category.

Caveat: Overlapping annotation segments in the same category (or layer) can lead to undefined results as the overlap semantic discussed above will be applicable. Typically, two kinds of behaviors that happen simultaneously should be handled as separate categories.

Starting the evaluation module

To start the evaluation module, prepare a project that contains predicted annotations and possibly a source of gold standard, both loaded in the burst viewport. Then right-click in the burst viewport and choose Evaluate these Anntations from the context menu.

Linking reference and prediction

To link a reference to a prediction, first choose either categories or layers.

Note: closing the module or switching from categories to layers
will discard the linked entities!`

Then choose a reference category that should be linked to a prediction category (like ‘flight-measured’ vs ‘flight-predicted’). Confirm by pressing Link. A new (empty) row will appear to indicate the pairwise comparison.

Select Compute Evaluation to compute the evaluation metrics. A pie chart for each Reference and Prediction shows the distribution of categories.

The average statistics will be shown in the bottom line of the analysis window.

Use Ctrl left-click on a row in the table to remove the pair from the analysis.

Restrict analysis to a selection

If, before starting the module, a selection in the burst viewport was active, the computation will be restricted to this time window. Segments must be covered by the selection. Partially cut annotations are excluded from the analysis.

Exporting the results

Right-clicking on the table will copy the current values (make sure to recompute if required) as tabular text data to be imported in Excel or LibreOffice.

Right-clicking on either of the pie charts will copy the fractions (make sure to recompute if required) as tabular text data to be imported in Excel or LibreOffice.