Image Matching Methods Explained
This page will explain the different methods for image matching in more depth.
Summary
Template matching
Template matching is a method of detecting and locating a query image of interest (the template) in a larger frame. As shown in the visualization below, the template (the HeadSpin logo outlined in red for visibility) gets slid across the frame, and a correlation score is computed for each operation.
The heatmap of the scores calculated for each window is shown below, where the lighter (whiter) the spot, the higher the similarity between the template and the window. The location of the highest-scoring operation is used to determine whether the object exists in the frame and where (in the frame) it is located. Labelled with the red bounding box, the highest score here is located in the top left corner where the logo indeed appears.
However, template matching is not robust to changes in scale or rotation. Shown below is a visualization of sliding a query image that's smaller than how it appears in the frame.
Although the threshold parameter can be tuned to accept or reject a confidence score, selecting the window with the highest score can still produce an incorrect match. Shown below are heatmaps generated from sliding a query image that is sufficiently smaller (left) or larger (right) than the target of interest (the HeadSpin logo). Template match fails to correctly find and locate the query image if it appears in the frame at a different scale.
Please refer to the template match section in the Session Analysis API documentation to see how the template matching analysis API can be accessed.
Multiscale template matching
Multiscale template matching works exactly like regular template matching, except a pre-defined set of scales of the query image (template) is matched against in each frame. The location of the highest scoring template across the scales is taken as the final result.
For demonstration purposes, only 3 example template scales are shown above. Since it is not possible to iterate over every possible scale of the query image, ideally the object appears in a scale that is sufficiently close to one of the pre-defined levels. Due to the compression or approximation of pixel information by rescaling the template, the expected confidence score for a good match found at a scale different than the original image would be lower when compared to a confidence score from a match found with the exact scale of the image. In other words, if a query image appears in its exact size in the frame, the confidence score from regular matching would be higher than the confidence score from multiscale template matching with 0.9 and 1.1 as the scaling factors.
Whereas regular template matching fails, multiscale template matching does a much better job of matching the smaller and larger query images to the frame.
In practice, there is a trade-off between the computation time of matching over many different scales and the probability of missing the template because it exists in a scale that does not get computed. Multiscale matching can generally be expected to be at most 2-3x slower than template matching in the worst case, although speedups can be obtained depending on the following factors:
- how many duplicated frames there are (e.g., if the time interval of interest in the session video remains on a static or loading screen with little to no moving components, then there will likely be many duplicate frames, and the analysis will run faster)
- how many scaling factors are used (the more scaling factors, the more resizings of the template get matched, which make the analysis slower).
Please refer to the multiscale template match section in the Session Analysis API to see how the multiscale template matching analysis API can be accessed.
Keypoint matching
Keypoint matching is a classical technique in computer vision used to identify and match an object that exists in two images (if it has a similar enough of appearance in both images). Keypoint matching can work even if the object has undergone changes in translation, rotation, scaling, illumination, and/or perspective. The technique is applied to image match analysis by matching a query image to frames in the video -- however, unlike in template matching where the entire grid of pixels gets compared, the comparison is performed on compact sets of keypoints and features that describe the images.
The standard keypoint matching pipeline consists of the following steps:
- Keypoint detection: Identifying interesting points in an image
- Feature description: Describing each interesting point
- Feature matching: Finding similar pairs of keypoints
1. Keypoint detection
Keypoints are "interesting" features found in an image. Usually, this means parts of the the image where there is a large change in pixel values. Examples below show keypoints (green dots) found in logos and icons.
Keypoints are usually detected around corners or edges that contrast significantly from the surrounding pixels, as opposed to "flat" areas where such contrast cannot be found. Consequently, icons and images that have low resolution or do not have enough areas with contrasting pixel values will yield only a few keypoints. Among the many keypoint detection algorithms that exist, the BRISK algorithm is used in our keypoint matching analysis as it performs both the keypoint detection and description (explained in the next section) and is efficient in both steps.
2. Feature description
Once the interesting points of an image are identified, the next step is to assign each point a signature that best describes each point. These signatures, also called descriptors, should not only be descriptive of each keypoint but also efficiently condense the information about the pixels surrounding the keypoint into a compact list of numbers. One way to create the condensed representation is to encode the information into one binary string for each keypoint.
A simple visualization is provided above for explanatory purposes, where each row is the binary representation of each keypoint detected in an image. However, note that without going into the details behind why the representation was chosen and from just inspecting these binary features alone, it's not obvious what kind of information is encoded since the representation is deliberately made to be compact. Among the many existing algorithms to create feature representations from keypionts, the BRISK descriptor is used in our keypoint matching analysis because of the descriptor's robustness to scale and rotation while also being relatively computationally efficient (due to its binary representation).
3. Feature matching
Having created an informative yet compact descriptor that describes each keypoint, we can now compare the keypoints in the query image against those in the frame -- similar descriptors between a pair of points indicate a good match whereas descriptors dissimilar to each other indicate a bad match. Based on the similarity between descriptors, good matches are kept, and bad matches are discarded. When many good matches exist, this indicates that many similar features are found and matched between the query image and the frame.
Shown above is a visualization of a feature matching example where the query image is placed in the top left and the frame to the right. The frame contains the query image that's rotated 90 degrees clockwise. Due to the compactness of the information condensed inside of keypoint descriptors, it is common to arrive at scenarios where not every pair of matched points is necessarily "correct" and bad matches (false positives) are picked up in the matching process. Green matches are determined to be good matches by the algorithm (the binary descriptors differ only in a small number of positions) whereas red matches are determined to be bad matches (the binary descriptors differ in many positions) and are therefore discarded. By inspection, most of the green matches seem to be reasonable, and there do not appear to be any unreasonable outliers. This example represents "the best case scenario" that can be achieved with keypoint matching.
In such "best case scenarios" where most keypoints are correctly mapped from the query image onto the frame, the spatial relationships between keypoints in the original image are preserved to a good degree when they are mapped to their destinations in the frame. So, with this correspondence information available, a region can be constructed which localizes the final result, as labelled by the red box indicating where the final object was found.
The example above showcases a less than ideal scenario. Clearly the HeadSpin graphic does not appear in the frame, yet false positive matches still remain even after most of the bad matches (in red) are filtered out. But generally in such cases, the spurious matches themselves are either scattered around spatially or disproportionally concentrated around some region. With the false positive matches, the spatial relationships between the keypoints in the query image and the frame are largely no longer present.
Please refer to the keypoint match section in Session Analysis API to see how the keypoint matching analysis API can be accessed.
Tips on picking an image matching method
There are scenarios where one method outperforms the other and vice versa. Demonstrated below is one scenario where keypoint matching does not perform as well as template matching.
Keypoints are not as reliable in scenarios where the image of interest is small and lacks interesting features. Here although the icon is present in the frame, very few keypoints are detected from the small and simple icon, and the resulting "good" matches generated by the keypoint algorithm are inadequate. Performing a template match works much better in this example. In general, when the image of interest is small or lacking in details, it would be better to default to template matching (regular or multiscale).
In addition, image match results would be most predictable with regular template matching. If the query image appears in the session video in its exact same shape and orientation, it can definitely be found with template matching, whereas there are no guarantees with keypoint matching.
As seen with the visualization of the descriptors above, debugging cases of why keypoint matching may or may not have worked as expected is more difficult, especially if the conditions on the detailed-ness of the query image are not sufficiently satisfied. But if they are, keypoint matching is more flexible than template matching as it is capable of matching not only with changes in rotation and scale but also in background. One example where keypoint matching would outperform template matching may be finding a logo in gaming content where the query image is overlayed on a background that is animated or cycling through various scenes and different colors.