fbpx

Announcing Norfair 2.0: An Open-source Real-time Multi-object Tracking Library

This blog post was originally published at Tryolabs’ website. It is reprinted here with the permission of Tryolabs.

We are excited to announce the release of Norfair 2.0, available in GitHub and PyPI as of now. This is the biggest upgrade to Tryolabs’ open-source multi-object tracking library since its first release two years ago.

For those unfamiliar with the term, multi-object tracking (abbreviated as MOT) is essentially the process of associating unique identifiers over time to every object in the video. It is generally performed by first running the video frames through a detector that identifies the objects in the scene (like YOLO, Faster R-CNN, or another deep learning model), and then linking those objects over time.

Norfair was born out of the need to solve the tracking problem in a straightforward way, with a simple setup and no extraneous dependencies, so that tracking capabilities could be added to any pre-existing codebase. Initially, it was adapted from a simple mechanism that does tracking by estimating the trajectories and velocities of objects called SORT. But it has matured!

Object detection.

Object detection and recognition.

Over the last two years, Norfair has steadily evolved in terms of features, and the community of users and contributors has continued to grow. Its simplicity of use has made it the go-to choice for many to add tracking to several different detectors, and it has been the angular stone behind many of Tryolabs’ projects related to video analytics.

For its second birthday 🎂, we have decided to release version 2.0. Even though 1.0 was released not long ago, there are a lot of new and exciting features that merit the release of a new major version. On top of many bugs fixed and improved documentation, we added support for estimating camera motion to allow tracking under a moving frame of reference, n-dimensional tracking (yes, Norfair can now track 3D objects!), and re-identification with appearance embeddings for the most challenging of scenarios with occlusion.

Buckle up, as we will go over all these features and how they work!

Support for 3D and n-dimensional tracking

In the past few years, 3D has become more ubiquitous thanks to LIDAR and detection models that output cuboids instead of 2D bounding boxes. This trend will only accelerate in the future, so we had to do something about it.

There was no fundamental blocker to adding 3D object — and more generally, n-dimensional object — tracking support to Norfair. The Kalman filter can estimate objects’ trajectories no matter the number of dimensions, so we generalized the rest of Norfair to support it.

The new 3D support in Norfair opens the door to new applications in augmented reality and non-video realms.

We made a demo that uses Google’s MediaPipe Objectron, a mobile real-time 3D object detection solution for everyday objects.

3D tracking with Norfair and MediaPipe’s Objectron.

Camera motion estimation

What do video analytics applications on self-driving cars and soccer matches have in common? The camera is not static. This makes object tracking more challenging because the objects move but also does your frame of reference. But don’t worry, we got you covered!

We added a new optional component that can be used to estimate the movement of the camera. Internally, this component will map camera movements and estimate coordinate transformations to construct a fixed reference frame, over which the tracking of the objects will occur. You then pass these coordinate transformations to the Tracker, and it will use them internally.

This will significantly improve tracking results under some scenarios by compensating the camera motion. Let’s look at an example:

Improvements by estimating camera motion using optical flow.

On the left, we ran the usual tracking without camera motion estimation. By the end of the video, the person gets assigned id 5, meaning that the tracker lost it 4 times. The reason the tracker performs this bad is simple: the camera movement makes the apparent movement of the object too erratic for the tracker to accurately estimate the future position of the object.

On the right, we estimate the motion of the camera and stabilize the detections. Now the detections can be processed in a fixed reference making them much easier to track.

In code, this looks really simple:

from norfair import Tracker, Video
from norfair.camera_motion import MotionEstimator

motion_estimator = MotionEstimator() # initialize motion estimator
video = Video(input_path="video.mp4")
tracker = Tracker(
distance_function="frobenius",
distance_threshold=200,
)
for frame in video:
detections = run_model(frame)
coord_transformations = motion_estimator.update(frame) # estimate the movement in this frame
tracked_objects = tracker.update(
detections=detections,
coord_transformations=coord_transformations # pass the estimation to the tracker
)

The second benefit to this is that now the position of objects in this fixed reference frame is available to the user.

For instance, if you want to know if a person entered a certain area or the path they walked through, now you can do it on a fixed set of coordinates, independent of how the camera moved. In the following example, we draw the paths of the players in a fixed frame of reference. Note how it’s consistent no matter the panning movement of the camera:

Tracking soccer players under a moving camera.

Motion estimation: behind the scenes

If you are wondering how this works, the solution involves some classical computer vision. On each frame, we sample corner points and calculate the optical flow associated with each one. Optical flow is just a way of calculating the movement vector of a point from one frame to another.

Armed with these vectors, we simply estimate a transformation that undoes the movement for each frame. Furthermore, by concatenating these transformations we can take points in each frame and calculate their coordinates in an absolute frame of reference corresponding to the first frame of the video.

Estimating optical flow for more accurate tracking under a moving camera.


Sampled points with arrows help visualize the optical flow.

The sampled points are key for this method to work. This is why we provide a number of parameters for the user to customize how many points are sampled, how crammed together they can be, and even to avoid certain regions of the frame. This last point is important since it allows users to mask out the tracked objects and the detections, forcing the points to be sampled from the background of the scene.

See this feature in action in the new Hugging Face Space. If you want to apply it to your own project, check out our demo on GitHub.

Re-identification (ReID)

Re-Identification of objects in complex scenarios such as those with heavy occlusions brings the need to consider not only positional information but also appearance. We had implemented some forms of ReID in the past with previous versions of Norfair, attaching embeddings to each detection and considering them in the distance function. This is still possible, but we found that was not enough.

In this new iteration, instead of relying on a combination of position information and embeddings to track the objects all the time, we added a new matching stage to deal with ReID. This new stage focuses on matching recently lost objects with those new objects that are about to be initialized. This has its own set of rules that can be tuned by the user and is independent of the regular tracking stage we’ve worked with until now.

This new stage of ReID matching has the benefit of delaying the decision until you have gathered the most evidence about the new object. This is important since partially-occluded objects tend to have poor quality embeddings and you don’t want those to interfere with your matching.

Understanding the object lifecycle

  1. Every time a tracked object is successfully matched with a detection, the counter is increased by one. Matches occur when the customizable distance function goes below a threshold.
  2. On the contrary, for every frame with no match, we subtract one from the counter.
  3. When the hit_counter reaches initialization_delay the object is considered active. It will continue accumulating hits (+1) with each new match until it reaches hit_counter_max, where the counter is capped.
  4. If hit_counter drops to 0, the tracked object is considered inactive and ReID kicks in: the object will wait for reid_hit_counter_max frames to be matched with one of the initializing objects. After this, the object is deleted. In case of a match, the older object is activated again and the younger one is deleted after being merged into the older one.

Putting it all together: a demo of ReID

In Norfair, the user is free to define how to compare objects for ReID. Any type of embedding can be used. Depending on the application, you might want to use the embeddings from your detector or the ones coming from another model specific for ReID.

To provide a straightforward example of this feature that shows the power of embeddings in a simple case, we took inspiration from the synthetic example on motpy and adapted it. It’s basically a simulation video where boxes are moving around. When boxes get occluded by other boxes we just grab the detection from the top box, so any other box occluded is not getting a detection. For this demo we went with the simplest embedding we could think of: the color histogram inside the bounding box. On the left, we are only using positional information, while on the right the ReID stage is being used:

Demo of ReID with appearance embeddings and Norfair.

This demo can be found on GitHub. Take a look at it to get a better idea of how to use the new ReID stage.

In a more complex case, we suggest looking at models that generate embeddings specific to the type of ReID that you wish to perform (person, vehicle, or other).

Other news

Other things relevant to Norfair 2.0:

  • New site with official documentation which we plan to keep improving.
  • Norfair now ships with the most common distance functions, so you don’t need to write your own copy if you are dealing with standard cases.
  • Norfair now uses our optimized Kalman filter implementation, which is significantly faster than filterpy.
  • Regarding the demos:
    • All have been implemented in Docker to make them more easily reproducible. Moreover, we added a demo in our Hugging Face Space.
    • We added several new ones on YOLOv7, MMDetection, moving camera, 3D tracking, and ReID. More to come!
  • Many other bug fixes and improvements.

Closing

We want to thank all the people and organizations who are actively using and contributing to Norfair. Seeing you choose Norfair in your projects, pushing its limits, opening GitHub issues, and contributing back has made the tool much better, and working on it much more gratifying. Community feedback has also been paramount for prioritizing our roadmap.

Hope the new features excite you as much as they do us. More will come!

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone
Phone: +1 (925) 954-1411
Scroll to Top