The Challenge of Supporting AV at Scale

This blog post was originally published by Mobileye, an Intel company. It is reprinted here with the permission of Intel.

At the Consumer Electronics Show in January, we presented an unedited 25 minute-long video of a Mobileye self-driving car navigating the busy streets of Jerusalem. The video was published, first and foremost, for the sake of promoting transparency. We wanted to demonstrate the exceptional capabilities of our technology, but more importantly, to show the world how autonomous vehicles (AVs) operate so that society will come to trust them.

Continuing this effort, we are introducing today a new 40-minute unedited video of a drive comprising a small section from 160 miles of Jerusalem streets we use for our AV development. We chose to follow the drive with a drone to properly provide context for the decision-making logic of the robotic agent, and the only intervention during the drive was to replace the drone’s battery after 20 minutes or so. We have also added narration to explain where and how our technology is handling the wide variety of situations encountered during the drive. The full-length clip is inserted below and a number of short sections from the drive are highlighted at the end of this editorial.

The 40-minute unedited ride in Mobileye’s autonomous vehicle

The clip also provides an opportunity to articulate our approach to AV development — which as far as we know is unique and stands out among the various actors in the AV industry. The problem we aim to solve is scale. The true promise of AVs can only materialize at scale — first as a means for ride-sharing via robo-shuttles and later as passenger cars offered to consumers. The challenges to support AVs at scale center around cost, proliferation of HD-maps, and Safety. The point I would like to make here is that Safety must dictate the software and hardware architecture in ways that are not obvious.

Back in 2017, we published our safety concept, which was based on two observations. The first is that accidents caused by lapses of judgment in the decision-making process (e.g., when merging into traffic) can be effectively eliminated by clarifying, in a formal manner, what it means to be “careful” when planning maneuvers, and this ultimately defines the balance between safety and usefulness. Our Responsibility Sensitive Safety model establishes metric parameters around the assumptions that drivers make (like “right of way given, not taken”) in order to make safe decisions. These parameters are established in cooperation with regulatory bodies and standards organizations. Then, the RSS model assumes the worst-case scenario within the boundaries of the agreed upon assumptions, of what other road users will do thus removing the need to make predictions about behaviors of other road users. The RSS theory proves that if AVs follow the assumptions and actions prescribed by the theory, then the decision-making brain of the AV will never cause an accident. Since then, RSS has been promoted across the globe with great traction. For instance, in late 2019 the IEEE enacted a new workgroup chaired by Intel to develop a standard — IEEE 2846 — for AV decision making. Members of the group represent, by and large, the AV industry as a whole. This to me is a greatly reassuring sign that through industry-wide cooperation we can achieve a critical milestone that will elevate the entire industry and allow us to move forward.

The second observation in the paper we published had a profound effect on our system architecture. Assuming that the decision-making process of the robotic driver is taken care of (through RSS for example), we are still left with the possibility that a glitch in the perception system will cause an accident. The perception system is based on cameras, radars and lidars with software that interprets the sensors’ raw data into an “environmental model” — especially of the position and speed of other road users. There is always a chance, even if infinitesimally small, that the perception system will miss the existence, or miscalculate measurements, of a pertinent object (whether a road user or an inanimate obstacle) and cause an accident.

To appreciate what we are dealing with let’s do a simple “back of the envelope” calculation. The number of miles driven in the US is about 3.2 trillion annually and the number of accidents with injuries is about 6 million. Assuming an average speed of 10 mph we have a mean-time-between-failures (MTBF) of 50,000 hours of driving. Assume we design our AV with an MTBF of 10x, 100x, or 1000x better than human MTBF (note that we have ruled out being “as good as humans”, we know we must be better). Consider deploying 100,000 robotic cars to serve as robo-shuttles at scale. This number is consistent with figures raised by players in the ride-hailing space as necessary to support a few dozen cities. Assume each robo-shuttle drives on average five hours a day. So, with a 10x MTBF design we should expect an accident every day, with a 100x design an accident every week, and with a 1000x an accident every quarter. From the societal point of view, if all cars on the road were 10x better on MTBF it would be a huge achievement, but from the perspective of an operator of a fleet, an accident occurring every day is an unbearable result both financially and publicly. Clearly, a lower bound of 1000x on MTBF is a must (and even then, an accident per quarter is still nerve-wracking) if AV-at-Scale is the goal. An MTBF of 1000x translates to 50 million hours of driving which is roughly 500 million miles. Even collecting this amount of data for the purpose of validating the MTBF claim is unwieldy, not to mention developing a perception system that can satisfy such an MTBF to begin with.

All the above is the context for our choice of system architecture. To achieve such an ambitious MTBF for the perception system necessitates introducing redundancies — specifically system redundancies, as opposed to sensor redundancies within the system. This is like having both iOS and Android smartphones in my pocket and asking myself: What is the probability that they both crash simultaneously? By all likelihood it is the product of the probabilities that each device crashes on its own. Likewise, in the AV world, if we will build a complete end-to-end AV capability-based only on cameras, and then build a completely independent capability using radars/lidars we will have two separate redundant sub-systems. Just like with the two smartphones, the probability of both systems experiencing perception failure at the same time drops dramatically. This is very different from how other players in the AV space handle perception who have been focusing on sensor fusion. It is much more difficult to build a camera-only AV than to build an AV fusing all sensors’ data simultaneously. Cameras are notoriously difficult to handle because the access to depth (range) is indirect and is based on cues such as perspective, shading, motion and geometry. The details on how we built the camera-only AV system is articulated in the talk I gave at CES (starting at the 12-minute mark).

This brings me back to the clip we are introducing today. It shows the performance of our camera-only subsystem. There is no radar nor lidar in the car you see in the clip. The car is powered by eight long-range cameras and four parking cameras that are fed into a compute system based on just two EyeQ5s. The car needs to balance agility with safety and does so using the RSS framework. The streets of Jerusalem are notoriously challenging as other road users tend to be very assertive adding significant challenge on the decision-making module of the robotic driver.

We will continue sharing progress and insights of our journey towards AV-at-Scale. Stay tuned for more updates.

Professor Amnon Shashua
Senior Vice President, Intel Corporation
President and Chief Executive Officer, Mobileye, an Intel company