Localisation of Signs

One of the main difficulties with using subtitles as a weak label is that there is no guarantee that the corresponding video window contains an occurrence of the sign. Or for that matter, if it contains the sign, that it contains only one example. An example of the types of correlations typical in sign subtitle alignment is shown in Figure 4.

Figure 5 - An example of subtitle/sign alignment over a short sequence. The correlations have been colour coded to show corresponding words and signs. Note how there is no guarantee that the order is the same between British Sign Language and English. Nor that every word appearing once in the subtitles will occur once in the sign gloss.

This makes it difficult to choose the minimum support and confidence parameters for mining apriori. Instead what we do is run the mining with several different combinations of parameters and combine the results. Since often examples are taken from the same video this is done in situ so that any overlaps are accounted for.

Figure 5 - The iterative process used to hone in on a sign. In the first instance candidate positions are gathered from subtitle data. A buffer is added, mining is run and a histogram of the peak responses is compiled. This histogram is then assessed by mean shift to find the most likely points at which the sign will occur. These points then go through the process again but with a smaller buffer until the sign is localised after 6 iterations, the buffer being reduced each time.

Mean shift offers a good way to combine the peak responses. Usually the histogram displays small clusters of values, we could quantise more heavily but this runs the risk of combining two sign examples together. We could also take the peak values but this risks not being accurate if either histogram boundaries fall awkwardly or if the clusters of responses are not symmetrical about the peak. Mean shift takes into account all the bins within a given radius and finds the most likely centre given all the bin values.

Figure 6 - Mean shift takes the histogram of peak responses (shown in blue)and finds the most likely place that a sign will occur (the yellow dots). In this example you can see that the yellow dots do not necessarily match the peak responses.