The Pursuit of Site Excellence in Starship | by Martin Pihlak | Starship Technologies
Running automated robots on city streets is a software engineering challenge. Some of this software runs on the robot itself but many of it actually runs in the backend. Things like remote control, route finding, pairing robots with customers, managing the health of the crowd but also interactions with customers and merchants. All of this should run 24×7, with no distractions and dynamic scales to match the workload.
Starship’s SRE is responsible for providing cloud infrastructure and platform services for running backend services. We have already tried on Governors for our Microservices and run it on top AWS. MongoDb is the main database for most backend services, but we also like PostgreSQL, especially where strong typing and transactional guarantees are required. For async message Kafka Choosing the messaging platform and we use it for almost everything except sending video streams from robots. For the observation we relied on Prometheus and Grafana, Loki, on the left and Jaeger. CICD is administered by Jenkins.
A good portion of SRE time is spent maintaining and improving the Kubernetes infrastructure. Kubernetes is our main deployment platform and there is always something to fix, just fine tune autoscaling settings, add Pod crash policies or optimize Spot instance usage. Sometimes it’s like laying bricks – simply installing a Helm chart to provide specific functionality. But often the “bricks” have to be carefully scrutinized and checked (is Loki good for timber handling, something the Mesh Service and aftermath) and sometimes there is no movement in the world and has to be written from on the left. If this happens we will always go with Python and Golang but also Rust and C if necessary.
Another key infrastructure responsible for SRE is data and databases. Starship started with a monolithic MongoDb – a strategy that has worked well so far. However, as the business grows we need to visit this architecture and start thinking about supporting robots of a thousand. Apache Kafka is about the scaling story, but we also need to know sharding, regional clustering and microservice database architecture. In addition we continue to develop tools and automation to manage the current database infrastructure. Examples: add MongoDb to be observed by a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate regular failover and retrieval tests, collect metrics for again in Kafka, continuing to retain the data.
Finally, one of the most important purposes of Site Reliable Engineering is to reduce the downtime for the manufacture of Starship. While SRE is sometimes called upon to negotiate infrastructure outages, more impact work is being done to prevent outages and ensure we can recover quickly. It can be a broad topic, from having a strong K8s infrastructure to engineering methods and business processes. There are so many opportunities to make an impact!
A day in the life of an SRE
Arriving at work, a few hours between 9 and 10 (sometimes working remotely). Grab a cup of coffee, check Slack messages and emails. Review the alerts fired at night, see if we have anything interesting there.
Note that MongoDb connection latencies explode during the night. Taking the Prometheus scales with Grafana, know that this happens while the backups are running. Why is this suddenly a problem, we’ve been running backups for years? We have been able to very aggressively compress backups to save on network and storage costs and this wastes all available CPU. It seems that the load on the database has grown a little to notice this. It occurs on a standby node, does not affect production, even if it is still a problem, if failure should precede it. Add a Jira item to fix it.
In forwarding, modify the MongoDb prober code (Golang) to add multiple histogram buckets to get a better understanding of the latency distribution. Run a Jenkins pipeline to set up the new production review.
At 10 a.m. have a Standup meeting, share your team updates and find out what others have come up with-set up monitoring for a VPN server, appoint in a Prometheus Python app, setting up ServiceMonitors for external services, debugging MongoDb connectivity issues, Managing Flagger canary shipments.
After the meeting, continue with the planned work for the day. One of the planned things I plan to do now is to set up an additional Kafka cluster in an internal test. We run Kafka on Kubernetes so we need to frankly remove the existing cluster YAML files and tweak them for the new cluster. Or, on second thought, should we use the Helm, or maybe there is a good Kafka operator now available? No, don’t go there-too much magic, I want more clear control over my statefulsets. It’s raw YAML. After an hour and a half a new cluster is running. The setup is pretty straightforward; Hot bugs that only register with Kafka brokers in DNS need a config change. Creating credentials for applications requires a little bash script to set up Zookeeper accounts. A little bit left hanging, is setting up Kafka Connect to capture database change events-it turns out that databases don’t run in ReplicaSet mode and Debezium can’t get the oplog from it . Backlog it and move on.
Now it’s time to prepare a scenario for the Wheel of Misfortune workout. At Starship we run it to improve our understanding of systems and to share troubleshooting techniques. It works by breaking down some parts of the system (usually on trial) and trying on some pathetic person to solve and alleviate the problem. In this case I will put a load test home to override the microservice for route calculations. Put it as a Kubernetes job called “haymaker” and hide it well so it won’t be immediately visible in the Linkerd service mesh (yes, bad 😈). Then run the “Wheels” exercise and write down any shortcomings we have in playbooks, metrics, alerts and so on.
In the last hours of the day, block all distractions and test and finish coding. I also implemented the Mongoproxy BSON parser as streaming asynchronous (Rust + Tokyo) and I wanted to know how well it works with real data. There is a bug somewhere in the guts of the parser and I need to add depth logging to figure it out. Find a unique Tokyo tracking library and bring it here…
Retrieval: the events described here are based on a true story. Not all of this happened on the same day. Some meetings and interactions with co-workers have already been edited. We picked up.