Software for Robots

Efficient Data Acquisition in MATLAB: Streaming HD Video in Real-Time

Florian Enner — Sat, 14 Oct 2017 00:00:00 GMT

The acquisition and processing of a video stream can be very computationally expensive. Typical image processing applications split the work across multiple threads, one acquiring the images, and another one running the actual algorithms. In MATLAB we can get multi-threading by interfacing with other languages, but there is a significant cost associated with exchanging data across the resulting language barrier. In this blog post, we compare different approaches for getting data through MATLAB’s Java interface, and we show how to acquire high-resolution video streams in real-time and with low overhead.

Motivation

For our booth at ICRA 2014, we put together a demo system in MATLAB that used stereo vision for tracking colored bean bags, and a robot arm to pick them up. We used two IP cameras that streamed H.264 video over RTSP. While developing the image processing and robot control parts worked as expected, it proved to be a challenge to acquire images from both video streams fast enough to be useful.

IP Camera Support only supports MJPEG over HTTP and didn’t exist at the time
USB Webcam Support only supports USB cameras
imread and webread are limited to HTTP and too slow for real-time

Since we did not want to switch to another language, we decided to develop a small library for acquiring video streams. The project was later open sourced as HebiCam.

Technical Background

In order to save bandwidth most IP cameras compress video before sending it over the network. Since the resulting decoding step can be computationally expensive, it is common practice to move the acquisition to a separate thread in order to reduce the load on the main processing thread.

Unfortunately, doing this in MATLAB requires some workarounds due to the language’s single threaded nature, i.e., background threads need to run in another language. Out of the box, there are two supported interfaces: MEX for calling C/C++ code, and the Java Interface for calling Java code.

While both interfaces have strengths and weaknesses, practically all use cases can be solved using either one. For this project, we chose the Java interface in order to simplify cross-platform development and the deployment of binaries. The diagram below shows an overview of the resulting system.

Figure 1. System overview for a stereo vision setup

Starting background threads and getting the video stream into Java was relatively straightforward. We used the JavaCV library, which is a Java wrapper around OpenCV and FFMpeg that includes pre-compiled native binaries for all major platforms. However, passing the acquired image data from Java into MATLAB turned out to be more challenging.

The Java interface automatically converts between Java and MATLAB types by following a set of rules. This makes it much simpler to develop for than the MEX interface, but it does cause additional overhead when calling Java functions. Most of the time this overhead is negligible. However, for certain types of data, such as large and multi-dimensional matrices, the default rules are very inefficient and can become prohibitively expensive. For example, a 1080x1920x3 MATLAB image matrix gets translated to a byte[1080][1920][3] in Java, which means that there is a separate array object for every single pixel in the image.

As an additional complication, MATLAB stores image data in a different memory layout than most other libraries (e.g. OpenCV’s Mat or Java’s BufferedImage). While pixels are commonly stored in row-major order ([width][height][channels]), MATLAB stores images transposed and in column-major order ([channels][width][height]). For example, if the Red-Green-Blue pixels of a BufferedImage would be laid out as [RGB][RGB][RGB]…, the same image would be laid out as [RRR…][GGG…][BBB…] in MATLAB. Depending on the resolution this conversion can become fairly expensive.

In order to process images at a frame rate of 30 fps in real-time, the total time budget of the main MATLAB thread is 33ms per cycle. Thus, the acquisition overhead imposed on the main thread needs to be sufficiently low, i.e., a low number of milliseconds, to leave enough time for the actual processing.

Data Translation

We benchmarked five different ways to get image data from Java into MATLAB and compared their respective overhead on the main MATLAB thread. We omitted overhead incurred by background threads because it had no effect on the time budget available for image processing.

The full benchmark code is available here.

1. Default 3D Array

By default MATLAB image matrices convert to byte[height][width][channels] Java arrays. However, when converting back to MATLAB there are some additional problems:

byte gets converted to int8 instead of uint8, resulting in an invalid image matrix
changing the type back to uint8 is somewhat messy because the uint8(matrix) cast sets all negative values to zero, and the alternative typecast(matrix, 'uint8') only works on vectors

Thus, converting the data to a valid image matrix still requires several operations.

% (1) Get matrix from byte[height][width][channels]
data = getRawFormat3d(this.javaConverter);
[height,width,channels] = size(data);

% (2) Reshape matrix to vector
vector = reshape(data, width * height * channels, 1);

% (3) Cast int8 data to uint8
vector = typecast(vector, 'uint8');

% (4) Reshape vector back to original shape
image = reshape(vector, height, width, channels);

2. Compressed 1D Array

A common approach to move image data across distributed components (e.g. ROS) is to encode the individual images using MJPEG compression. Doing this within a single process is obviously wasteful, but we included it because it is common practice in many distributed systems. Since MATLAB did not offer a way to decompress jpeg images in memory, we needed to save the compressed data to a file located on a RAM disk.

% (1) Get compressed data from byte[]
data = getJpegData(this.javaConverter);

% (2) Save as jpeg file
fileID = fopen('tmp.jpg','w+');
fwrite(fileID, data, 'int8');
fclose(fileID);

% (3) Read jpeg file
image = imread('tmp.jpg');

3. Java Layout as 1D Pixel Array

Another approach is to copy the pixel array of Java’s BufferedImage and to reshape the memory using MATLAB. This is also the accepted answer for How can I convert a Java Image object to a MATLAB image matrix?.

% (1) Get data from byte[] and cast to correct type
data = getJavaPixelFormat1d(this.javaConverter);
data = typecast(data, 'uint8');
[h,w,c] = size(this.matlabImage); % get dim info

% (2) Reshape matrix for indexing
pixelsData = reshape(data, 3, w, h);

% (3) Transpose and convert from row major to col major format (RGB case)
image = cat(3, ...
    transpose(reshape(pixelsData(3, :, :), w, h)), ...
    transpose(reshape(pixelsData(2, :, :), w, h)), ...
    transpose(reshape(pixelsData(1, :, :), w, h)));

4. MATLAB Layout as 1D Pixel Array

The fourth approach also copies a single pixel array, but this time the pixels are already stored in the MATLAB convention.

% (1) Get data from byte[] and cast to correct type
data = getMatlabPixelFormat1d(this.javaConverter);
[h,w,c] = size(this.matlabImage);  % get dim info
vector = typecast(data, 'uint8');

% (2) Interpret pre-laid out memory as matrix
image = reshape(vector,h,w,c);

Note that the most efficient way we found for converting the memory layout on the Java side was to use OpenCV’s split and transpose functions. The code can be found in MatlabImageConverterBGR and MatlabImageConverterGrayscale.

5. MATLAB Layout as Shared Memory

The fifth approach is the same as the fourth with the difference that the Java translation layer is bypassed entirely by using shared memory via memmapfile. Shared memory is typically used for inter-process communication, but it can also be used within a single process. Running within the same process also simplifies synchronization since MATLAB can access Java locks.

% (1) Lock memory
lock(this.javaObj);

% (2) Force a copy of the data
image = this.memFile.Data.pixels * 1;

% (3) Unlock memory
unlock(this.javaObj);

Note that the code could be interrupted (ctrl+c) at any line, so the locking mechanism would need to be able to recover from bad states, or the unlocking would need to be guaranteed by using a destructor or onCleanup.

The multiplication by one forces a copy of the data. This is necessary because under-the-hood memmapfile only returns a reference to the underlying memory.

Results

All benchmarks were run in MATLAB 2017b on an Intel NUC6I7KYK. The performance was measured using MATLAB’s timeit function. The background color of each cell in the result tables represents a rough classification of the overhead on the main MATLAB thread.

Table 1. Color classification
Color	Overhead	At 30 FPS
Green	<10%	<3.3 ms
Yellow	<50%	<16.5 ms
Orange	<100%	<33.3 ms
Red	>100%	>33.3 ms

The two tables below show the results for converting color (RGB) images as well as grayscale images. All measurements are in milliseconds.

Figure 2. Conversion overhead on the MATLAB thread in [ms]

The results show that the default conversion, as well as jpeg compression, are essentially non-starters for color images. For grayscale images, the default conversion works significantly better due to the fact that the data is stored in a much more efficient 2D array (byte[height][width]), and that there is no need to re-order pixels by color. Unfortunately, we currently don’t have a good explanation for the ~10x cost increase (rather than ~4x) between 1080p and 4K grayscale. The behavior was the same across computers and various different memory settings.

When copying the backing array of a BufferedImage we can see another significant performance increase due to the data being stored in a single contiguous array. At this point much of the overhead comes from re-ordering pixels, so by doing the conversion beforehand, we can get another 2-3x improvement.

Lastly, although accessing shared memory in combination with the locking overhead results in a slightly higher fixed cost, the copying itself is significantly cheaper, resulting in another 2-3x speedup for high-resolution images. Overall, going through shared memory scales very well and would even allow streaming of 4K color images from two cameras simultaneously.

Final Notes

Our main takeaway was that although MATLAB’s Java interface can be inefficient for certain cases, there are simple workarounds that can remove most bottlenecks. The most important rule is to avoid converting to and from large multi-dimensional matrices whenever possible.

Another insight was that shared-memory provides a very efficient way to transfer large amounts of data to and from MATLAB. We also found it useful for inter-process communication between multiple MATLAB instances. For example, one instance can track a target while another instance can use its output for real-time control. This is useful for avoiding coupling a fast control loop to the (usually lower) frame rate of a camera or sensor.

As for our initial motivation, after creating HebiCam we were able to develop and reliably run the entire demo in MATLAB. The video below shows the setup using old-generation S-Series actuators.

Using MATLAB for hardware-in-the-loop prototyping #1 : Message Passing Systems

Florian Enner — Sun, 25 Jun 2017 00:00:00 GMT

MATLAB© is a programming language and environment designed for scientific computing. It is one of the best languages for developing robot control algorithms and is widely used in the research community. While it is often thought of as an offline programming language, there are several ways to interface with it to control robotic hardware 'in the loop'. As part of our own development we surveyed a number of different projects that accomplish this by using a message passing system and we compared the approaches they took. This post focuses on bindings for the following message passing frameworks: LCM, ROS, DDS, and ZeroMQ.

The main motivation for using MATLAB to prototype directly on real hardware is to dramatically accelerate the development cycle by reducing the time it takes to find out out whether an algorithm can withstand ubiquitous real-world problems like noisy and poorly-calibrated sensors, imperfect actuator controls, and unmodeled robot dynamics. Additionally, a workflow that requires researchers to port prototype code to another language before being able to test on real hardware can often lead to weeks or months being lost in chasing down new technical bugs introduceed by the port. Finally, programming in a language like C++ can pose a significant barrier to controls engineers who often have a strong electro-mechanical background but are not as strong in computer science or software engineering.

We have also noticed that over the past few years several other groups in the robotics community also experience these problems and have started to develop ways to control hardware directly from MATLAB.

The Need for External Languages

The main limitation when trying to use MATLAB to interface with hardware stems from the fact that its scripting language is fundamentally single threaded. It has been designed to allow non-programmers to do complex math operations without needing to worry about programming concepts like multi-threading or synchronization. However, this poses a problem for real-time control of hardware because all communication is forced to happen synchronously in the main thread. For example, if a control loop runs at 100Hz and it takes a message ~8ms for a round-trip, the main thread ends up wasting 80% of the available time budget waiting for a response without doing any actual work.

A second hurdle is that while MATLAB is very efficient in the execution of math operations, it is not particularly well suited for byte manipulation. This makes it difficult to develop code that can efficiently create and parse binary message formats that the target hardware can understand. Thus, after having the main thread spend its time waiting for and parsing the incoming data, there may not be any time left for performing interesting math operations.

Figure 1. Communications overhead in the main MATLAB thread

Pure MATLAB implementations can work for simple applications, such as interfacing with an Arduino to gather temperature data or blink an LED, but it is not feasible to control complex robotic systems (e.g. a humanoid) at high rates (e.g. 100Hz-1KHz). Fortunately, MATLAB does have the ability to interface with other programming languages that allow users to create background threads that can offload the communications aspect from the main thread.

Figure 2. Communications overhead offloaded to other threads

Out of the box MATLAB provides two interfaces to other languages: MEX for calling C/C++ code, and the Java Interface for calling Java code. There are some differences between the two, but at the end of the day the choice effectively comes down to personal preference. Both provide enough capabilities for developing sophisticated interfaces and have orders of magnitude better performance than required. There are additional interfaces to other languages, but those require additional setup steps.

Message Passing Frameworks

Message passing frameworks such as Robot Operating System (ROS) and Lightweight Communication and Marshalling (LCM) have been widely adopted in the robotics research community. At the core they typically consist of two parts: a way to exchange data between processes (e.g. UDP/TCP), as well as a defined binary format for encoding and decoding the messages. They allow systems to be built with distributed components (e.g. processes) that run on different computers, different operating systems, and different programming languages.

The resulting systems are very extensible and provide convenient ways for prototyping. For example, a component communicating with a physical robot can be exchanged with a simulator without affecting the rest of the system. Similarly, a new walking controller could be implemented in MATLAB and communicate with external processes (e.g. robot comms) through the exchange of messages. With ROS and LCM in particular, their flexibility, wide-spread adoption, and support for different languages make them a nice starting point for a MATLAB-hardware interface.

Lightweight Communication and Marshalling (LCM)

LCM was developed in 2006 at MIT for their entry to DARPA’s Urban Challenge. In recent years it has become a popular alternative to ROS-messaging, and it was as far as we know the first message passing framework for robotics that supported MATLAB as a core language.

The snippet below shows how the MATLAB code for sending a command message could look like. The code creates a struct-like message, sets desired values, and publishes it on an appropriate channel.

%% MATLAB code for sending an LCM message
% Setup
lc = lcm.lcm.LCM.getSingleton();

% Fill message
cmd = types.command();
cmd.position = [1 2 3];
cmd.velocity = [1 2 3];

% Publish
lc.publish('COMMAND_CHANNEL', cmd);

Interestingly, the backing implementation of these bindings was done in pure Java and did not contain any actual MATLAB code. The exposed interface consisted of two Java classes as well as auto-generated message types.

The LCM class provides a way to publish messages and subscribe to channels
The generated Java messages handle the binary encoding and exposed fields that MATLAB can access
The MessageAggregator class provides a way to receive messages on a background thread and queue them for MATLAB.

Thus, even though the snippet looks similar to MATLAB code, all variables are actually Java objects. For example, the struct-like command type is a Java object that exposes public fields as shown in the snippet below. Users can access them the same way as fields of a standard MATLAB struct (or class properties) resulting in nice syntax. The types are automatically converted according to the type mapping.

/**
 * Java class that behaves like a MATLAB struct
 */
public final class command implements lcm.lcm.LCMEncodable
{
    public double[] position;
    public double[] velocity;
    // etc. ...
}

Receiving messages is done by subscribing an aggregator to one or more channels. The aggregator receives messages from a background thread and stores them in a queue that MATLAB can access in a synchronous manner using aggregator.getNextMessage(). Each message contains the raw bytes as well as some meta data for selecting an appropriate type for decoding.

%% MATLAB code for receiving an LCM message
% Setup
lc = lcm.lcm.LCM.getSingleton();
aggregator = lcm.lcm.MessageAggregator();
lc.subscribe('FEEDBACK_CHANNEL', aggregator);

% Continuously check for new messages
timeoutMs = 1000;
while true

    % Receive raw message
    msg = aggregator.getNextMessage(timeoutMs);

    % Ignore timeouts
    if ~isempty(msg)

        % Select message type based on channel name
        if strcmp('FEEDBACK_CHANNEL', char(msg.channel))

            % Decode raw bytes to a usable type
            fbk = types.feedback(msg.data);

            % Use data
            position = fbk.position;
            velocity = fbk.velocity;

        end

    end
end

The snippet below shows a simplified version of the backing Java code for the aggregator class. Since Java is limited to a single return argument, the getNextMessage call returns a Java type that contains the received bytes as well as meta data to identify the type, i.e., the source channel name.

/**
 * Java class for receiving messages in the background
 */
public class MessageAggregator implements LCMSubscriber {

    /**
     * Value type that combines multiple return arguments
     */
    public static class Message {

        final public byte[] data; // raw bytes
        final public String channel; // source channel name

        public Message(String channel_, byte[] data_) {
            data = data_;
            channel = channel_;
        }
    }

    /**
     * Method that gets called from MATLAB to receive new messages
     */
    public synchronized Message getNextMessage(long timeout_ms) {

		if (!messages.isEmpty()) {
		    return messages.removeFirst();
        }

        if (timeout_ms == 0) { // non-blocking
            return null;
        }

        // Wait for new message until timeout ...
    }

}

Note that the getNextMessage method requires a timeout argument. In general it is important for blocking Java methods to have a timeout in order to prevent the main thread from getting stuck permanently. Being in a Java call prohibits users from aborting the execution (ctrl-c), so timeouts should be reasonably short, i.e., in the low seconds. Otherwise this could cause the UI to become unresponsive and users may be forced to close MATLAB without being able to save their workspace. Passing in a timeout of zero serves as a non-blocking interface that immediately returns empty if no messages are available. This is often useful for working with multiple aggregators or for integrating asynchronous messages with unknown timing, such as user input.

Overall, we thought that this was a well thought out API and a great example for a minimum viable interface that works well in practice. By receiving messages on a background thread and by moving the encoding and decoding steps to the Java language, the main thread is able to spend most of its time on actually working with the data. Its minimalistic implementation is comparatively simple and we would recommend it as a starting point for developing similar interfaces.

Some minor points for improvement that we found were:

The decoding step fbk = types.feedback(msg.data) forces two unnecessary translations due to msg.data being a byte[], which automatically gets converted to and from int8. This could result in a noticeable performance hit when receiving larger messages (e.g. images) and could be avoided by adding an overload that accepts a non-primitive type that does not get translated, e.g., fbk = types.feedback(msg).
The Java classes did not implement Serializable, which could become bothersome when trying to save the workspace.
We would prefer to select the decoding type during the subscription step, e.g., lc.subscribe('FEEDBACK_CHANNEL', aggregator, 'types.feedback'), rather than requiring users to instantiate the type manually. This would clean up the parsing code a bit and allow for a less confusing error message if types are missing.

Robot Operating System (ROS)

ROS is by far the most widespread messaging framework in the robotics research community and has been officially supported by Mathworks' Robotics System Toolbox since 2014. While the Simulink code generation uses ROS C++, the MATLAB implementation is built on the less common RosJava.

The API was designed such that each topic requires dedicated publishers and subscribers, which is different from LCM where each subscriber may listen to multiple channels/topics. While this may result in potentially more subscribers, the specification of the expected type at initialization removes much of the boiler plate code necessary for dealing with message types.

%% MATLAB code for publishing a ROS message
% Setup Publisher
chatpub = rospublisher('/chatter', 'std_msgs/String');

% Fill message
msg = rosmessage(chatpub);
msg.Data = 'Some test string';

% Publish
chatpub.send(msg);

Subscribers support three different styles to access messages: blocking calls, non-blocking calls, and callbacks.

%% MATLAB code for receiving a ROS message
% Setup Subscriber
laser = rossubscriber('/scan');

% (1) Blocking receive
scan = laser.receive(1); % timeout [s]

% (2) Non-blocking latest message (may not be new)
scan = laser.LatestMessage;

% (3) Callback
callback = @(msg) disp(msg);
subscriber = rossubscriber('/scan', @callback);

Contrary to LCM, all objects that are visible to users are actually MATLAB classes. Even though the implementation is using Java underneath, all exposed functionality is wrapped in MATLAB classes that hide all Java calls. For example, each message type is associated with a generated wrapper class. The code below shows a simplified example of a wrapper for a message that has a Name property.

%% MATLAB code for wrapping a Java message type
classdef WrappedMessage

    properties (Access = protected)
        % The underlying Java message object (hidden from user)
        JavaMessage
    end

    methods

        function name = get.Name(obj)
            % value = msg.Name;
            name = char(obj.JavaMessage.getName);
        end

        function set.Name(obj, name)
            % msg.Name = value;
            validateattributes(name, {'char'}, {}, 'WrappedMessage', 'Name');
            obj.JavaMessage.setName(name); % Forward to Java method
        end

        function out = doSomething(obj)
            % msg.doSomething() and doSomething(msg)
            try
                out = obj.JavaMessage.doSomething(); % Forward to Java method
            catch javaException
                throw(WrappedException(javaException)); % Hide Java exception
            end
        end

    end
end

Due to the implementation being closed-source, we were only able to look at the public toolbox files as well as the compiled Java bytecode. As far as we could tell they built a small Java library that wrapped RosJava functionality in order to provide an interface that is easier to call from MATLAB. Most of the actual logic seemed to be implemented in MATLAB code, but we also found several calls to various Java libraries for problems that would have been difficult to implement in pure MATLAB, e.g., listing networking interfaces or doing in-memory decompression of images.

Overall, we found that the ROS support toolbox looked very nice and was a great example of how seamless external languages could be integrated with MATLAB. We also really liked that they offered a way to load log files (rosbags).

One concern we had was that there did not seem to be a simple non-blocking way to check for new messages, e.g., a hasNewMessage() method or functionality equivalent to LCM’s getNextMessage(0). We often found this useful for applications that combined data from multiple topics that arrived at different rates (e.g. sensor feedback and joystick input events). We checked whether this behavior could be emulated by specifying a very small timeout in the receive method (shown in the snippet below), but any value below 0.1s seemed to never successfully return.

%% Trying to check whether a new message has arrived without blocking
try
    msg = sub.receive(0.1); % below 0.1s always threw an error
    % ... use message ...
catch ex
    % ignore
end

Data Distribution Service (DDS)

In 2014 Mathworks also added a support package for DDS, which is the messaging middleware that ROS 2.0 is based on. It supports MATLAB and Simulink, as well as code generation. Unfortunately, we did not have all the requirements to get it setup, and we could not find much information about the underlying implementation. After looking at some of the intro videos, we believe that the resulting code should look as follows.

%% MATLAB code for sending and receiving DDS messages
% Setup
DDS.import('ShapeType.idl','matlab');
dp = DDS.DomainParticipant

% Create message
myTopic = ShapeType;
myTopic.x = int32(23);
myTopic.y = int32(35);

% Send Message
dp.addWriter('ShapeType', 'Square');
dp.write(myTopic);

% Receive message
dp.addReader('ShapeType', 'Square');
readTopic = dp.read();

ZeroMQ

ZeroMQ is another asynchonous messaging library that is popular for building distributed systems. It only handles the messaging aspect, so users need to supply their own wire format. ZeroMQ-matlab is a MATLAB interface to ZeroMQ that was developed at UPenn between 2013-2015. We were not able to find much documentation, but as far as we could tell the resulting code should look similar to following snippet.

%% MATLAB code for sending and receiving ZeroMQ data
% Setup
subscriber = zmq( 'subscribe', 'tcp', '127.0.0.1', 43210 );
publisher = zmq( 'publish', 'tcp', 43210 );

% Publish data
bytes = uint8(rand(100,1));
nbytes = zmq( 'send', publisher, bytes );

% Receive data
receiver = zmq('poll', 1000); // polls for next message
[recv_data, has_more] = zmq( 'receive', receiver );

disp(char(recv_data));

It was implemented as a single MEX function that selects appropriate sub-functions based on a string argument. State was maintained by using socket IDs that were passed in by the user at every call. The code below shows a simplified snippet of the send action.

// Parsing the selected ZeroMQ action behind the MEX barrier
// Grab command String
if ( !(command = mxArrayToString(prhs[0])) )
	mexErrMsgTxt("Could not read command string. (1st argument)");

// Match command String with desired action (e.g. 'send')
if (strcasecmp(command, "send") == 0){
	// ... (argument validation)

	// retrieve arguments
	socket_id = *( (uint8_t*)mxGetData(prhs[1]) );
	size_t n_el = mxGetNumberOfElements(prhs[2]);
	size_t el_sz = mxGetElementSize(prhs[2]);
	size_t msglen = n_el*el_sz;

	// send data
	void* msg = (void*)mxGetData(prhs[2]);
	int nbytes = zmq_send( sockets[ socket_id ], msg, msglen, 0 );

	// ... check outcome and return
}
// ... other actions

Other Frameworks

Below is a list of APIs to other frameworks that we looked at but could not cover in more detail.

Project	Notes
RabbitMQ-Matlab-Client	Simple Java wrapper for RabbitMQ with callbacks into MATLAB
URBI (tutorial)	Seems to be deprecated

Project

Notes

RabbitMQ-Matlab-Client

Simple Java wrapper for RabbitMQ with callbacks into MATLAB

URBI (tutorial)

Seems to be deprecated

Final Notes

Contrary to the situation a few years ago, nowadays there exist interfaces for most of the common message passing frameworks that allow researchers to do at least basic hardware-in-the-loop prototyping directly from MATLAB. However, if none of the available options work for you and you are planning on developing your own, we recommend the following:

If there is no clear pre-existing preference between C++ and Java, we recommend to start with a Java implementation. MEX interfaces require a lot of conversion code that Java interfaces would handle automatically.
We would recommend starting with a minimalstic LCM-like implementation and then add complexity when necessary.
While interfaces that only expose MATLAB code can provide a better and more consistent user experience (e.g. help documentation), there is a significant cost associated with maintaing all of the involved layers. We would recommend holding off on creating MATLAB wrappers until the API is relatively stable.

Finally, even though message passing systems are very widespread in the robotics community, they do have drawbacks and are not appropriate for every application. Future posts in this series will focus on some of the alternatives.

Analyzing the viability of Ethernet and UDP for robot control

Florian Enner — Wed, 23 Nov 2016 00:00:00 GMT

Ethernet is the most pervasive communication standard in the world. However, it is often dismissed for robotics applications because of its presumed non-deterministic behavior. In this article, we show that in practice Ethernet can be made to be extremely deterministic and provide a flexible and reliable solution for robot communication.

The network topologies and traffic patterns used to control robotic systems exhibit different characteristics than those studied by traditional networking work that focuses on large, ad-hoc networks. Below we present results from a number of tests and benchmarks, involving over 100 million transmitted packets. Over the course of all of our tests no packets were dropped or received out of order.

Technical Background

One of the primary concerns that roboticists have when considering technologies for real-time control is the predictability of latency. The worst case latency tends to be more important than the overall throughput, so the possibility of latency spikes and packet loss in a communication standard represent significant red flags.

Much of the prevalent hesitance towards using Ethernet for real-time control originated in the early days of networking. Nodes used to communicate over a single shared media that employed a control method with random elements for arbitrating access (CSMA/CD). When two Frames collided during a transmission, the senders backed off for random timeouts and attempted to retransmit. After a number of failed attempts, frames could be dropped entirely. By connecting more nodes through Hubs the Collision Domain was extended further, resulting in more collisions and less predictable behavior.

In a process that started in 1990, Hubs have been fully replaced with Switches that have dedicated full-duplex (separate lines for transmitting and receiving) connections for each port. This separates segments and isolates collision domains, which eliminates any collisions that were happening on the physical (wire) level. CSMA/CD is still supported for backwards compatibility and half-duplex connections, but it is largely obsolete.

Using dedicated lines introduces additional buffering and overhead for forwarding Frames to intended receivers. As of 2016, virtually all Switches implement the Store-and-Forward switching architecture in which Switches fully receive packets, store them in an internal buffer, and then forward them to the appropriate receiver port. This adds a latency cost that scales linearly with the number of Switches that a packet has to go through. In the alternative Cut-through approach Switches can forward packets immediately after receiving the target address, potentially resulting in lower latency. While this is sometimes used in latency sensitive applications, such as financial trading applications, it generally can’t be found in consumer grade hardware. It is more difficult to implement, only works well if both ports negotiate the same speed, and requires the receiving port to be idle. The benefits are also less significant on smaller packets due to the requirement to buffer enough data to evaluate the target address.

Another problem that many roboticists are often concerned about is Out-of-Order Delivery, which means that a sequence of packets coming from a single source may be received in a different order. This is relevant for communicating over the Internet, but generally does not apply to local networks without redundant routes and load balancing. Depending on the driver implementation it can theoretically happen on a local network, but we have yet to observe such a case.

There are several competing networking standards that are built on Ethernet and can guarantee enough determinism to be used in industrial automation (Industrial Ethernet). They achieve this by enforcing tight control over the network layout and by limiting the components that can be connected. However, even cheap consumer grade network equipment can produce very good results if the network is controlled in a similar manner.

Note that this is not a new concept. We found several resources that discussed similar findings more than a decade ago, e.g., Real-Time-Ethernet (2001), Real-time performance measurements using UDP on Windows and Linux (2005), Evaluating Industrial Ethernet (2007), and Deterministic Networking: from niches to the mainstream (2013).

Benchmark Setup

A common way to benchmark networks is to setup two computers and have a sender transmit a message to a receiver that echoes it back. That way the sender can measure the round-trip time (RTT) and gather statistics of the network. This generally works well, but large operating system stacks and device drivers can potentially add a lot of variation. In an attempt to reduce unwanted jitter, we decided to setup a benchmark using two embedded devices instead.

Figure 1. HEBI Robotics I/O Board

Our startup HEBI Robotics builds a variety of building blocks that enable quick development of custom robotic systems. We mainly focus on actuators, but we’ve also developed other devices such as the I/O Board shown in the picture above. Each board has 48 pins that serve a variety of functions (analog and digitial I/O, PWM, Encoder input, etc.) that can be accessed remotely via network. We normally use them in conjunction with our actuators to interface with external devices, such as a gripper or pneumatic valve, or to get various sensor input into MATLAB.

Each device contains a 168MHz ARM microcontroller (STM32f407) and a 100 Mbit/s network port, so we found them to be very convenient for doing network tests. We selected two I/O Boards to act as the sender and receiver nodes and developed custom firmware in order to isolate the network stack. The resulting firmware was based on ChibiOS 2.6.8 and lwIP 1.4.1. The relevant code pieces can be found here. The elapsed time was measured using a hardware counter with a resolution of 250ns.

Since there was no way to store multiple Gigabytes on these devices, we decided to log data remotely using a UDP service that can receive measurement data and persist to disk (see code). In order to avoid stalls caused by disk I/O, the main socket handler wrote into a double buffered structure that got persisted by a background thread. The synchronization between the threads was done using a WriterReaderPhaser, which is a synchronization primitive that allows readers to flip buffers while keeping writers wait-free. We found this primitive to be very useful for persisting events that are represented by small amounts of data.

The step by step flow was as follows:

Sender wakes up at a fixed rate, e.g., 100Hz
Sender increments sequence number
Sender measures time ("transmit timestamp") and sends packet to receiver
Receiver echoes packet back to sender
Sender receives packet and measures time ("receive timestamp")
Sender sends measurement to logging server
Logging server receives measurement and persists to disk

The resulting binary data was loaded into MATLAB© for analysis and visualization. The code for reading the binary file can be found here.

The round-trip time is the difference between the receive and transmit timestamps. We also recorded the sequence number of each packet and the ip address of the receiver node in order to detect packet loss and track ordering.

UDP datagram size

UDP datagrams include a variety of headers that result in a minimum of 66 bytes of overhead. Additionally, Ethernet Frames have a minimum size of 84 bytes, which makes the minimum payload for a UDP Datagram 18 bytes. The rough structure is shown below. More detailed information can be found at Ethernet II, Internet Protocol (IPv4), and User Datagram Protocol (UDP).

Figure 2. UDP / IPv4 / Ethernet II Frame Structure

Although this overhead may seem high for traditional automation applications with small payloads (<10 bytes), it quickly amortizes when communicating with smarter devices. For example, each one of our X-Series actuators contains more than 40 sensors (position, velocity, torque, 3-axis gyroscope, 3-axis accelerometer, several temperature sensors, etc.) that get combined into a single packet that uses between 185 and 215 bytes payload. Typical feedback packets from an I/O Board are even larger and require about 300 bytes. When comparing overhead it is also important to consider the available bandwidth, i.e., as sending 100 bytes over Gigabit Ethernet (even over 100 Mbit/s) tends to be faster than sending a single byte using traditional non-Ethernet based alternatives such as RS485 or CAN Bus.

For these benchmarks we chose to measure the round-trip time for a payload of 200 bytes. After including all overhead, the actual size on the wire is 266 bytes. The theoretical time it takes to transfer 266 bytes over 100 Mbit/s and 1Gbit/s Ethernet is 20.3us and 2.03us respectively.

Note that while the size is representative of a typical actuator feedback packet, the round-trip times in production may be faster because outgoing packets (commands) tend to be significantly smaller than response packets (feedback).

Baseline - Single Switch

We can establish a baseline of the best-case round-trip time by having the sender and receiver nodes communicate with each other through a single Switch that does not see any external traffic. We did not setup a point-to-point connection without any Switches because the logging server needed to be on the same network and because we rarely see this case in practice.

Figure 3. Baseline setup using single Switch

We set the frequency to 100Hz and logged data for ~24 hours. We chose this frequency because it is a common control rate for sending high-level trajectories, and because 10ms is a safe deadline in case there are large outliers. During normal operations we typically used rates between 100-200Hz for updating set targets of controllers that get executed on-board each device (e.g. position/velocity/torque), and rates of up to 1KHz when bypassing local controllers and remotely controlling the output (e.g. PWM). The network would technically support even higher rates, but there are usually other limitations that come into play at around 1KHz (e.g. OS scheduler and limited sensor polling rates).

First, we looked at the jitter of the underlying embedded real-time operating system (RTOS). The figure below shows the difference between an idealized signal that ticks every 10ms and the measured transmit timestamps. 99% are within the lowest measurement resolution (250ns), and the total observed range is slightly below 6us. Note that this is significantly better than the 150us base jitter range we observed on real-time Linux (see The Importance of Metrics and Operating Systems).

Figure 4. OS jitter of ChibiOS 2.6.8 on STM32F407 (24h)

The two figures below show the round-trip time for all packets and the corresponding percentile distribution. There were a total of 8.5 million messages. None of them were lost and none of them arrived out of order.

Figure 5. RTT for 200 byte payload (24h)

Figure 6. Zoomed in view of RTT for 200 byte payload (10min)

90% of all packets arrived within 194us and a jitter of less than 1 microsecond. Roughly 80us of this time was spent on the wire, so using chips that support Gigabit (rather than 100Mbit) could lower the round-trip time to ~120us. Above the common case, there were three different periodically reoccuring modes that resulted in additional latency.

Mode 1 occurs consistently every ~5.3 minutes and lasts for ~15.01 seconds. During this time it adds up to 4 us latency.
Mode 2 occurs exactly once every 5 seconds and is always at 210 us.
Mode 3 occurs roughly once an hour and adds linearly increasing latency of up to 150 us to 10 packets.

The zoomed in view of a 10 minute time span highlights Modes 1 and 2. All three modes seemed to be related to actual time and independent of rate and packet count. We were unable to find the root cause of these modes, but after several tests we strongly suspected that all of them were caused by the programmed firmware rather than being tied to the Switch or the actual protocol.

Overall this initial data looked very promising for being able to use UDP for many real-time control tasks. With more tuning and a better implementation (e.g. lwip with zero copy and tuned options) it seems likely that the maximum jitter could be reduced to below 6us and maybe even below 1us.

Switching Cost

As mentioned in the background section, most modern Switches use the 'store-and-forward' approach that requires the Switch to fully receive a packet before forwarding it appropriately. Therefore, the latency cost per Switch is the time a packet spends on the wire plus any switching overhead. The wire time is constant (2.03us or 20.3us for 266 bytes), but the overhead depends on the Switch implementation. It can be difficult to find good performance data for specific devices, so depending on your requirements you may need to conduct your own benchmarks if you need to evaluate hardware.

Figure 7. Benchmark setup with additional Switch

For this benchmark we tested the three following Switches and added them individually to the baseline setup as shown above,

MICREL KSZ8863 (embedded in X-Series actuators)
NETGEAR ProSAFE GS105
MikroTik RB750Gr2 (RouterBOARD hEX) (technically a Router, but disabling DHCP makes it act similar to a Switch)

In total there were about 1 million packets. Again, we did not observe any packet loss or out-of-order delivery.

Figure 8. Comparison of RTT through different Switches (35min)

The figure below shows a zoomed view of the time series highlighting the added jitter characteristics. Modes 1 and 3 do not seem to be affected by additional switches. Mode 2 remains constant at 210 us and disappears for higher round-trip times, indicating an issue at the receiving step of the sender.

Figure 9. Zoomed in view of Switch comparison (10min)

Both KSZ8863 and the RB750Gr2 add a constant switching latency of 2.9 us and 3.6 us in addition to the wire time of 40.6 us and 4.06 us respectively to the RTT. The added jitter seems to be negligible at well below 1us. Surprisingly, the GS105 seems to have problems with this use case, resulting in higher latency and more jitter than the KSZ8863 even though it was connected using Gigabit. More details are in the table below.

Switch	Connection	90%-ile RTT	Overhead (not-on-wire)
Baseline	2x 100 MBit/s	193.8 us	112.6 us
MICREL KSZ8863	100 Mbit/s	+43.5 us	2.9 us
NETGEAR ProSAFE GS105	1 Gbit/s	+51.0 us	47 us
MikroTik RB750Gr2 (RouterBOARD hEX)	1 Gbit/s	+7.7 us	3.6 us

Switch

Connection

90%-ile RTT

Overhead (not-on-wire)

Baseline

2x 100 MBit/s

193.8 us

112.6 us

MICREL KSZ8863

100 Mbit/s

+43.5 us

2.9 us

NETGEAR ProSAFE GS105

1 Gbit/s

+51.0 us

47 us

MikroTik RB750Gr2 (RouterBOARD hEX)

1 Gbit/s

+7.7 us

3.6 us

According to the GS105 spec sheet, the added network latency should be below 10us for 1 Gbit/s and 20us for 100 Mbit/s connections. We did additional tests and the GS105 did seem to perform according to spec when using exclusively 100 Mbit/s or 1 Gbit/s on all ports.

We also conducted another baseline test that replaced the GS105 with a RB750Gr2. While we found a consistent improvement of 0.5us, we did not consider this significant enough to rerun all tests.

Scaling to Many Devices

So far all tests were measuring the round-trip time between a sender node and a single target node. Since real robotic systems can contain many devices, e.g., one per axis or degree of freedom, we also looked at how UDP performs with multiple devices on the same network. In conversations with other roboticists we often found an expectation that there would be significant packet loss if multiple packets were to arrive at a Switch at the same time. The worst case would occur if all devices were connected to a single Switch as shown below.

Figure 10. Multiple devices connected to a single Switch

In order to test the actual behavior we setup a test consisting of 40 HEBI Robotics I/O boards that were connected to a single 48-port Ethernet Switch (GS748T). All devices were running the same (receiver) firmware as before, so sending a single broadcast message triggered 40 response packets that caused more than 10 KB of total traffic to arrive at the Switch within occasionally less than 250 nanoseconds. These Microbursts were well beyond the sustainable bandwidth of Gigabit Ethernet. The setup shown below was representative of a high degree of freedom system such as a full body humanoid robot without daisy-chaining.

Figure 11. Network test setup with 40 HEBI Robotics I/O Boards

We would also like to mention that this setup heavily benefited from two side effects of using a standard Ethernet stack. First, there was no need for any manual addressing because of DHCP and device specific globally unique mac addresses. Second, we were able to re-program the firmware on all 40 devices simultaneously within 3-6 seconds due to the fact that we had a bootloader with TCP/IP support. It would have been very tedious to setup such a system if any step had required manual intervention.

Since the combined responses resulted in more load than the sender device was able to easily handle, we exchanged the sender I/O Board with a Gigabyte Brix i7-4770R desktop computer running Scientific Linux 6.6 with a real-time kernel. We setup the system as described in The Importance of Metrics and Operating Systems and disabled the firewall.

Running the benchmark at 100Hz for ~90 minutes resulted in more than 20 million measurements.

Again, we first looked at the jitter of the underlying operating system. The figure below shows the difference between an idealized signal that ticks every 10ms and the measured transmit timestamps. It shows that this setup suffers from more than an order of magnitude more jitter than the embedded RTOS. Note that the corresponding jHiccup control chart looks identical as in the OS blog post.

Figure 12. Operating system jitter of Scientific Linux 6.6 with MRG Realtime

The two figures below show the round-trip time for each measurement. It may be surprising, but there was again no packet loss or re-ordering of packets from any single source.

Rather than packets being dropped, what actually happened was that all packets were stored in the internal 1.6 MB buffer of the switch, queued, and forwarded to the target port as fast as possible. Since the sender was connected via Gigabit, the packets arrived roughly every ~2us. The time axis in the chart is based on the transmit timestamp, so each cycle shows up as vertical column in the graphs. We also conducted the same test at 1KHz and found identical results.

Figure 13. Zoomed in RTT for 40 devices

Figure 14. RTT for 40 devices (90 min)

However, the amount of latency and jitter turned out to be worse than we anticipated. We expected most columns to start at around ~180us and end at ~280us. While this was sometimes the case, the majority of columns started above 300 us. After some initial research we suspected that this delay was mostly caused by the Linux NAPI using polling mode rather than interrupts, and by using a low-cost network interface paired with suboptimal device drivers. While we expected the OS and driver stack to introduce additional latency and jitter, we were surprised by the order of magnitude.

The installed network interface and driver are below.

$ lspci | grep Ethernet

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)

$ sudo dmesg | grep "Ethernet driver"

r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded

Conclusion

Even consumer-grade Ethernet networks can exhibit very deterministic performance with regards to latency. In the more than 100 million packets that were sent for this blog post, we did not observe any packet loss or out-of order delivery. Even when communicating with 40 smart devices that represent a total of 1.600 sensors at a rate of 1KHz we found the network to be very reliable. While we still believe that large and dangerous industrial robots should be controlled using specialized industrial networking equipment, we feel that standard UDP is more than sufficient for most robotic applications.

We also found that most of the observed latency and jitter were caused by the underlying operating systems and their device drivers. To further illustrate this point we did additional comparisons of the baseline setup with the sender node running on different operating systems. The configurations were as follows:

ChibiOS 2.6.8 with lwIP 1.4.1 on 168 MHz STM32F407
Windows 10 on Gigabyte Brix-i7-4470R with Realtek NIC
Scientific Linux 6.6 with MRG Realtime on Gigabyte Brix-i7-4470R with Realtek NIC

The two charts below show the round trip time for each system communicating with a single I/O Board over a single Switch. Note that Linux and Windows were connected to the Switch via Gigabit and should have received datagrams ~40us before the embedded device.

Figure 15. Baseline RTT comparing RTOS vs RT-Linux vs Windows (10h)

Figure 16. Zoomed in baseline RTT comparing RTOS vs RT-Linux vs Windows (10min)

We realize that there are many more interesting questions that were beyond the scope of this work. We are currently considering the following networking-related topics for future blog posts:

Comparison of device drivers and network interfaces from various vendors
Performance impact of uncontrolled traffic (e.g. streaming video)
Redundant routes and sudden disconnects
Controlling through wireless networks
Clock drift and time synchronization using IEEE 1588v2

If there are other topics that you think would be worth covering, please leave a note in the comment section. If you are working for a hardware vendor that specializes in low-latency networking equipment and would be willing to provide samples for evaluation, please contact us through our website.

A Practical Look at Latency in Robotics : The Importance of Metrics and Operating Systems

Florian Enner — Tue, 20 Sep 2016 00:00:00 GMT

This is the first in a series of blog posts where I will try to share some of my own impressions and findings that have stemmed from several years of creating tools for robotics research.

Latency is an important practical concern in the robotics world that is often poorly understood. I feel that a better understanding of latency can help robotics researchers and engineers make design and architecture decisions that greatly streamline and accelerate the R&D process. I’ve personally spent many hours looking for information on the latency characteristics of various robotic components, but have had difficulty finding anything that is clearly presented or backed by solid data. From what I’ve found, most benchmarks focus on the maximum throughput and either ignore the subject of latency or measure it incorrectly.

Because of this my first post is on the topic of latency and will cover two main topics:

The details of how you measure latency matters.
The OS that you use affects your latency.

My own background is in academic research. I’ve spent several years as a staff software engineer at the Robotics Institute at Carnegie Mellon University and I am a co-founder of HEBI Robotics, a startup developing modular robotic components. The teams I’ve been part of have worked on many different types of robots, including collaborative manipulators, wheeled robots, walking robots and snake robots.

Over time, we have come to believe that robotics research can be greatly accelerated by designing systems that relax hard real-time requirements on the software that is exposed to users. One approach is to implement low-level control (motor control and safety features) in dedicated components that are decoupled from high-level control (position, velocity, torque, etc.). In many cases, this can enable users to leverage common consumer hardware and software tools that can accelerate development.

For example, the Biorobotics lab at Carnegie Mellon has researchers that tend to be mechanical or electrical engineers, rather than computer scientists or software engineers. As such tend they to be less familiar with Linux and C/C++ and much more comfortable with Windows/macOS and scripting languages like MATLAB. After our lab started providing cross-platform support and bindings for MATLAB (in ~2011), we ended up seeing a significant increase in our research output that roughly doubled the lab’s paper publications related to snake robots. In particular, the lab has been able to develop and demonstrate new complex behaviors that would have been difficult to prototype beforehand (see compliant control or inside pipes).

Measuring Latency

Robots are controlled in real-time, which means that a command gets executed within a deadline (fixed period of time). There are hard real-time systems that must never exceed their deadline, and soft real-time systems that are able to occasionally exceed their deadline. Missing deadlines when performing motor control of a robot can result in unwanted motions and 'jerky' behavior.

Although there is a lot of information on the theoretical definition of these terms, it can be challenging to determine the maximum deadline (point at which a system’s performance starts to degrade or become unsafe) for practical applications. This is especially true for research institutions that build novel mechanisms and target cutting-edge applications. Many research groups end up assuming that everything needs to be hard real-time with very stringent deadlines. While this approach provides solid performance guarantees, it can also create a lot of unnecessary development overhead.

Many benchmarks and tools make the assumption that latency follows a Gaussian distribution and report only the mean and the standard deviation. Unfortunately, latency tends to be very multi-modal and the most important part of the distribution when it comes to determinism are the 'outliers'. Even if a system’s latency behaves as expected in 99% of the cases, the leftover 1% can be worse than all of the other 99% of measurements combined. Looking at only the mean and standard deviation completely fails to capture the more systemic issues. For example, I’ve seen many data sets where the worst observed case was more than 1000 standard deviations away from the mean. Such stutters are usually the main problem when working on real robotic systems.

Because of this, a more appropriate way to look at latency is via histograms and percentile plots, e.g., "99.9% of measurements were below X ms". There are several good resources about recording latency out there that I recommend checking out, such as How NOT to Measure Latency or HdrHistogram: A better latency capture method.

Operating Systems

The operating system is at the base of everything. No matter how performant the high-level software stack is, the system is fundamentally bound by the capabilities of the OS, it’s scheduler, and the overall load on the system. Before you start optimizing your own software, you should make sure that your goal is actually achievable on the underlying platform.

There are trade-offs between always responding in a timely manner and overall performance, battery life, as well as several other concerns. Because of this, the major consumer operating systems don’t guarantee to meet hard deadlines and can theoretically have arbitrarily long pauses.

However, since using operating systems that users are familiar with can significantly ease development, it is worth evaluating their actual performance and capabilities. Even though there may not be any theoretical guarantees, the practical differences are often not noticeable.

Developing hard real-time systems has a lot of pitfalls and can require a lot of development effort. Requiring researchers to write hard real-time compliant code is not something that I would recommend.

Benchmark Setup

Azul Systems sells products targeted at latency sensitive applications and they’ve created a variety of useful tools to measure latency. jHiccup is a tool that measures and records system level latency spikes, which they call 'hiccups'. It measures the time for sleep(1ms) and records the delta to the fastest previously recorded sample. For example, if the fastest sample was 1ms, but it took 3ms to wake up, it will record a 2ms hiccup. Hiccups can be caused by a large number of reasons, including scheduling, paging, indexing, and many more. By running it on an otherwise idle system, we can get an idea of the behavior of the underlying platform. It can be started with the following command:

# record logs each second for 48 hours
intervalMs=1000
runtimeMs=172800000
java -javaagent:jHiccup.jar="-d 0 -i ${intervalMs}" -cp jHiccup.jar org.jhiccup.Idle -t ${runtimeMs}

jHiccup uses HdrHistogram to record samples and to generate the output log. There are a variety of tools and utilities for interacting and visualizing these logs. The graphs in this post were generated by my own HdrHistogramVisualizer.

To run these tests, I setup two standard desktop computers, one for Mac tests and one for everything else.

Mac Mini 2014, i7-3720QM @ 2.6 GHz, 16 GB 1600 MHz DDR3
Gigabyte Brix BXi7-4770R, i7-4770R @ 3.2 GHz, 16 GB 1600 MHz DDR3

Note that when doing latency tests on Windows it is important to be aware of the system timer. It has variable timer intervals that range from 0.5ms to 15.6ms. By calling timeBeginPeriod and timeEndPeriod applications can notify the OS whenever they need a higher resolution. The timer interrupt is a global resource that gets set to the lowest interrupt interval requested by any application. For example, watching a video in Chrome requests a timer interrupt interval of 0.5ms. A lower period results in a more responsive system at the cost of overall throughput and battery life. System Timer Tool is a little utility that let’s you view the current state. jHiccup automatically requests a 1ms timer interval by calling Java’s Thread.sleep() with a value of below 10ms.

Windows / Mac / Linux

Let’s first look at the performance of consumer operating systems: Windows, Mac and Linux. Each test started off with a clean install for each OS. The only two modifications to the stock installation were to disable sleep mode and to install JDK8 (update 101). I then started jHiccup, unplugged all external cables and let the computer sit 'idle' for >24 hours. The actual OS versions were,

Windows 10 Enterprise, version 1511 (OS build: 10586.545)
OS X, version 10.9.5
Ubuntu 16.04 Desktop, kernel 4.4.0-31-generic

Each image below contains two charts. The top section shows the worst hiccup that occured within a given interval window, i.e., the first data point shows the worst hiccup within the first 3 minutes and the next data point shows the worst hiccup within the next 3 minutes. The bottom chart shows the percentiles of all measurements across the entire duration. Each 24 hour data set contains roughly 70-80 million samples.

Figure 1. Windows vs. Linux vs. Mac hiccups (24h)

These results show that Linux had fewer and lower outliers than Windows. Up to the 90th percentile all three systems respond relatively similarly, but there are significant differences at the higher percentiles. There also seems to have been a period of increased system activity on OSX after 7 hours. The chart below shows a zoomed in view of a 10 minute time period starting at 10 hours.

Figure 2. Windows vs. Linux vs. Mac hiccups (10min)

Zoomed in we can see that the Windows hiccups are actually very repeatable. 99.9% are below 2ms, but there are frequent spikes to relatively discrete values up to 16ms. This also highlights the importance of looking at the details of the latency distribution. In other data sets that I’ve seen, it is rare for the worst case to be equal to the 99.99% percentile. It’s also interesting that the distribution for 10 minutes looks identical to the 24 hour chart. OSX shows similar behavior, but with lower spikes. Ubuntu 16.04 is comparatively quiet.

It’s debatable whether this makes any difference for robotic systems in practice. All of the systems I’ve worked with either had hard real-time requirements below 1ms, in which case none of these OS would be sufficient, or they were soft real-time systems that could handle occasional hiccups to 25 or even 100 ms. I have yet to see one of our robotic systems perform perceivably worse on Windows versus Linux.

Real Time Linux

Now that we have an understanding of how traditional systems without tuning perform, let’s take a look at the performance of Linux with a real-time kernel. The rt kernel (PREEMPT_RT patch) can preempt lower priority tasks, which results in worse overall performance, but more deterministic behavior with respect to latency.

I chose Scientific Linux 6 because of it’s support for Red Hat® Enterprise MRG Realtime®. You can download the ISO and find instructions for installing MRG Realtime here. The version I’ve tested was,

Scientific Linux 6.6, kernel 3.10.0-327.rt56.194.el6rt.x86_64

Note that there is a huge number of tuning options that may improve the performance of your application. There are various tuning guides that can provide more information, e.g., Red Hat’s MRG Realtime Tuning Guide. I’m not very familiar with tuning systems at this level, so I’ve only applied the following small list of changes.

/boot/grub/menu.lst ⇒ transparent_hugepage=never
/etc/sysctl.conf ⇒ vm.swappiness=0
/etc/inittab ⇒ id:3:initdefault (no GUI)
chkconfig --level 0123456 cpuspeed off

The process priority was set to 98, which is the highest priority available for real-time threads. I’d advise consulting scheduler priorities before deciding on priorities for tasks that actually use cpu time.

# find process id
pid=$(pgrep -f "[j]Hiccup.jar")

# show current priority
echo $(chrt -p $pid)

# set priority
sudo chrt -p 98 $pid

Below is a comparison of the two Linux variants.

Figure 3. Linux vs. RT Linux hiccups (24h)

Looking at the 24 hour chart (above) and the 10 minute chart (below), we can see that worst case has gone down significantly. While Ubuntu 16.04 was barely visible when compared to Windows, it looks very noisy compared to the real-time variant. All measurements were within a 150us range, which is good enough for most applications.

Figure 4. Linux vs. RT Linux hiccups (10 min)

I’ve also added the 24 hour chart for the real-time variant by itself to provide a better scale. Note that this resolution is getting close to the resolution of what we can measure and record.

Figure 5. RT Linux hiccups (24h)

Summary

I’ve tried to provide a basic idea of the out of the box performance of various off the shelf operating systems. In my experience the three major consumer OS can be treated relatively equal, i.e., either software will work well on all of them, or won’t work correctly on any of them. If you do work on a problem that does have hard deadlines, there are many different RTOS to choose from. Aside from the mentioned real-time Linux and the various embedded solutions, there are even real-time extensions for Windows, such as INtime or Kithara.

We’ve had very good experiences with implementing the low-level control (PID loops, motor control, safety features, etc.) on a per actuator level. That way all of the safety critical and latency sensitive pieces get handled by a dedicated RTOS and are independent of user code. The high-level controller (trajectories and multi-joint coordination) then only needs to update set targets (e.g. position/velocity/torque), which is far less sensitive to latency and doesn’t require hard real-time communications. This approach enables quick prototyping of high-level behaviors using 'non-deterministic' technologies, such as Windows, MATLAB and standard UDP messages.

For example, the high-level control in Teleop Taxi was done over Wi-Fi from MATLAB running on Windows, while simultaneously streaming video from an Android phone in the back of the robot. By removing the requirement for a local control computer, it only took 20-30 lines of code (see simplified, full) to run the entire demo. Actually using a local computer resulted in no perceivable benefit. While not every system can be controlled entirely through Wi-Fi, we’ve seen similar results even with more complex systems.

Latency is not Gaussian

Finally, I’d like to stress again that latency practically never follows a Gaussian distribution. For example, the maximum for OSX is more than 400 standard deviations away from the average. The table for these data sets is below.

	Samples	Mean	StdDev	Max	(max-mean) /stddev
Windows 10	80,304,595	0.55 ms	0.37	17.17 ms	44.9
OSX 10.9.5	65,282,969	0.32 ms	0.03	12.65 ms	411
Ubuntu 16.04	78,039,162	0.10 ms	0.01	3.03 ms	293
Scientific Linux 6.6-rt	79.753.643	0.08 ms	0.01	0.15 ms	7

The figure below compares the data’s actual distribution for Windows to a theoretical gaussian distribution. Rather than a classic 'bell-curve', it shows several spikes that are spread apart in regular intervals. The distance between these spikes is almost exactly one millisecond, which matches the Windows timer interrupt interval that was set while gathering the data. Interestingly, the spikes at above 2ms all seem to happen at roughly the same likelihood.

Figure 6. Actual Distribution compared to Gaussian-fit (Windows)

Using only mean and standard deviation for any sort of latency comparison can produce deceptive results. Aside from giving little to no information about the higher percentiles, there are many cases where systems with seemingly 'better' values exhibit worse actual performance.