Articles  »  Wheel Reinvention

2018-05-09

Robot software should be modelled on cloud-infrastructure

I was in the SF Bay Area recently, meeting with folks who fund companies and meeting with potential partners and customers of Capable Robot Components.

It was a great trip because of the conversations I had about robotic system design. Talking with new friends who have built robotic systems and who are building robotic systems helped me to condense and clarify my thoughts and experiences here.

Over the last years, I’ve been involved in the development of a wide variety of robotic system, including:

  • Semi-autonomous robots which look for land mines and improved explosives devices (IEDs)
  • Autonomous industrial floor scrubbers
  • Autonomous corn field fertilization robots
  • Autonomous robots and 3D sensors for condition assessment of sewers and other underground infrastructure
  • Humanoid robots for disaster response (specifically, the DARPA Robotics Challenge)

In all of these systems, technology is used to create functionality specific to the system’s application and environment. And an important part of the development process — for both hardware, software, and electronics — is a careful study of existing components to determine what can be bought and integrated and what needs to be built to enable required functionality.

The founding of Capable Robot Components

It’s the internal similarity between these system developments which caused me to start Capable Robot Components. Looking back on these projects, I see duplication of effort where technology was built, and then built again, because the existing products and components did not meet domain, platform, or environmental requirements.

I see a future where capable robots are rapidly built from standardized components and I’m excited to help bring that future to our present.

Right now, I’m not going to dive deeply into the integrated hardware, electronics, and software I’m currently developing. Instead, I’m writing an echo of a repeated conversation I had out in the Bay Area.

Robot Design : From Monolith to Distributed

Historically, mobile robots were monoliths. On the hardware side, the necessary actuators and sensors were connected (first via analog control lines, then digital busses like RS-485 & CAN, and later USB and Ethernet) to a central computer or two. The software stack has seen similar advancement from monoliths like 3T to CARMEN — until the advent of ROS, the first peer-to-peer or distributed framework for building robots.

The driving principles of ROS (peer-to-peer, tools-based, multi-lingual, thin, Free & Open Source) have served it, and the larger community, exceptionally well. ROS has quickened the pace of robot system development due to the focus on modular software design and run-time configuration.

ROS brought underlying tooling that was good-enough and flexible-enough for real system to be built upon. My personal opinion is that the key contributions of ROS were:

  • Easy standard and custom message description and serialization.
  • Data logging & playback tooling.
  • Transform Tree.
  • Standard ways of describing reconfigurable parameters and automatic generation of control interfaces for those parameters.
  • Focus on loose-coupling between software modules (through message topics). This allowed for the creation and distributed development of localization / mapping, computer vision, motion planning, and visualization packages.

ROS2 is an ambitious effort to transform ROS into a system more suitable for production development, instead of being focused on research development. To this end, ROS2 is tackling:

  • Information security.
  • Deterministic process bring up and monitoring.
  • Expansion of “peer” to include microcontrollers with the aim of allowing embedded processors to produce and subscribe to ROS2 topics without an intermediate bridge running on a Linux computer.
  • Going from a single master process for topic lookup to a master-less design.

I’ve been following the ROS2 development, but more specifically the design thinking behind it, for the past few years. I find these pre-implementation discussions fascinating — they are a window into the “why” before the “what” and “how”. What is surprising to me is that while I agree with all the “why” I find myself making different “what” choices — and expanding the scope of what should be considered as part of a robot software stack.

A new ecosystem of Distributed Software

The distributed software on a robot:

  • Discovers the locations and functionality of services.
  • Logs data over time.
  • Has changing computational workloads over time as it moves between different environments and functions.
  • Must be secure against external attacks (like forged messages, unauthorized reconfiguration, unauthorized software updates, and inspection of protected data streams).
  • Have a single (yet distributed and robust) understanding of state.
  • Support performance monitoring and debugging tools.
  • Have some sense of configuration management and either support atomic software updating or versioned APIs to allow asynchronous updating of various services.
  • Securely communicate (nominally over Ethernet) with local and remote devices (including computers and low-powered microcontrollers).

To me, this sounds like the internet. To me, this sounds like cloud-native software design. So, why is the robotics world reinventing the proverb-able wheel? Why are we not using and adapting scalable, existing, and proven software to our domain?

ROS2 is moving in this direction by building on top of Data Distribution Service (DDS) as middleware (or the glue that binds together other pieces of software and which allows them to communicate). In ROS2, DDS replaces ROS1’s custom serialization format, custom transport protocol, and custom central discovery mechanism. I am in full support of moving from custom software to existing standards, but DDS is a giant and monolithic beast — and in ways is opposite to the Unix Philosophy of “Do one thing and do it well” and the modular and layered approach to software and standards that allowed modern Web Standards to be built.

I believe that the robotics community is better served by building on top of the massive investment being made in Cloud Native Computing, rather than the OMG standard for middleware (DDS) that has gone from v1.0 in 2004 to v1.4 in 2015 and that is based on CORBA and UML.

I’m not alone in shying aware from large and complex middleware stacks. Tyler Treat, of Brave New Geek is an engineer who’s worked on a wide variety of distributed computing systems. He wrote Smart Endpoints, Dumb Pipes and I draw you attention to this paragraph:

People are easily seduced by “fat” middleware—systems with more features, more capabilities, more responsibilities—because they think it makes their lives easier, and it might at first. Pushing off more responsibility onto the infrastructure makes the application simpler, but it also makes the infrastructure more complex, more fragile, and more slow.

As you push more and more logic off onto more and more specialized middleware, you make it harder to move off it or change things. … Even with smart middleware, problems still leak out and you have to handle them at the edge—you’re now being taxed twice. This is essentially the end to end arguement. Push responsibility to the edges, smart endpoints, dumb pipes, etc. It’s the idea that if you need business-level guarantees, build them into the business layer because the infrastructure doesn’t care about them.

DDS does not follow this philosophy, but many other pieces of infrastructure software do.

The beautiful thing about the modern cloud world is the problems are so varied and so vast that no monolithic framework can be built fast enough, or with a full enough feature set, to gain significant momentum. The broad scope there has resulted in bounded and modular software components — akin to the Unix philosophy of doing one thing and doing it well.

This, in turn, allows engineers the freedom to assemble the right mix of components for their computing, messaging, logging, and reporting needs. Sometimes a bit of glue is needed, or an extension to connect one piece of open source software to another. But the effort in those shims in minuscule in comparison to the value of building on top of these existing stacks.

Software Building Blocks for Robots

Note that this is not an exhaustive list of the possible tools, frameworks, & libraries that exist in the cloud world. I’ve tried to select software that is open source, lightweight, does not require other cloud infrastructure to operate (as many robotic systems do not have persistent and reliable internet connections), and with a community around it.

  • Process isolation (e.g. containers) :
    • Docker : Container runtime / engine
    • Balena : Moby-based Container Engine for Embedded, IoT, and Edge uses
    • Moby : Open framework to assemble specialized container systems
    • RKT : Security-minded, standards-based container engine
  • Distributed message busses :
    • NATS : High performance messaging system
    • NSQ : Distributed messaging platform
  • Data serialization for those messages busses :
    • Protocol Buffers : Language-neutral, platform-neutral, extensible mechanism for serializing structured data.
    • FlatBuffers : Efficient cross platform serialization library that allows access of stream data and access without unpacking.
    • Cap’n Proto : A data interchange format and an in-memory representation, allowing access to structured data without unpacking.
  • Key Value Store / Truth Store:
    • FoundationDB : ACID transactions in a distributed database
    • CockroachDB : ACID transactions in a distributed database
  • Service Discovery :
    • CoreDNS : Plugin based DNS server designed for service discovery
    • HashiCorp Consul : Service Discovery & Failure Detection
  • Distributed tracing :
    • Jaeger : Open source, end-to-end distributed tracing. Monitor and troubleshoot transactions in complex distributed systems.
    • OpenZipkin Gather timing data needed to troubleshoot latency problems in micro-service architectures. It manages both the collection and lookup of this data
    • OpenCensus Distribution of libraries that automatically collects traces and metrics from your app, displays them locally, and sends them to any analysis tool
    • opentracing.io Vendor-neutral APIs and instrumentation for distributed tracing
  • Time series databases :
    • InfluxDB : Database designed specifically for time-series data. High performance with a powerful query language, and downsampling / retention policies.
    • Timescale DB : Time-series DB built as an PostgreSQL extension
  • Data / Log Collectors :
  • Secret stores :
    • HashiCorp Vault : Centralized Secrets Management and Encryption as a Service
    • Trousseau : File based encrypted key-value store
  • Performance monitoring / alerting :
    • Prometheus : Systems monitoring and alerting toolkit. Has an internal TSDB engine with metric-focus query and altering system.
    • Kapacitor : Real-Time Stream Processing Engine for InfluxDB, allowing for anomaly detection and flexible alerting and actioning.
  • Dynamic dashboards and metric visualization :
    • Grafana : Query, visualize, alert on and understand your metrics no matter where they are stored. Supports InfluxDB, Prometheus, MySQL, PostgreSQL, and many others.
    • Chronograf : Dashboard for InfluxDB and interface to Kapacitor.

Of course, not all cloud application need all of this functionality, and not all robotic systems will either. But to me, it seems we should be building robot planning / processing / etc. services on top of these robust tools, and not rolling our own.

This article was published on 2018-05-09