A few more books

The Art of Immutable Architecture: Theory and Practice of Data Management in Distributed Systems by Michael L. Perry

To be honest I’m not sure what I though about this book. It contains material on distributed systems and the difficulty of achieving consensus, and also explains how immutability helps us work around some of the problems. In particular there is a good discussion of using CRDTs. The author then goes on to describe his technique of historical modelling which brings together immutability and eventual consistency to give us a way to architect distributed systems.

Staff Engineer: Leadership beyond the Management Track by Will Larson

I really enjoyed the first half of this book which describes the Staff Engineer role, in an attempt to define what it is. There is lots of good advice in this section about how to do the role well – how to influence, how to present ideas and how to get the role in the first place.

The second half of the book is a series of interviews with Staff Engineers from a number of companies, where the various people are asked about what the role means in their company and how they got the role. It also asks them what an average day looks like. I must admit that I found this part of the book hard work and gave up reading the stories.

There’s also a large set of references to articles and blog posts which a Staff Engineer should read. I found this list really useful and have been working my way through the references.

Category Theory for Programmers by Bartosz Milewski

I worked my way through this book (yet again) and I absolutely love the book. The author explains the material really well, giving the motivations behind the various concepts and going into a good discussion of how they relate to programming. It won’t be long before I read it all again.

This book is put together from a series of blog posts, and the material is also available as an online set of lectures.

Posted in Uncategorized | Leave a comment

Stream Processing with Apache Flink

Stream Processing with Apache Flink: Fundamentals, Implementation and Operation of Streaming Applications by Fabian Hueske and Vasiliki Kalavri

I absolutely loved this book. I’d previously done a lot of reading about Streaming Systems, and after all of that theory it was good to see details about a concrete implementation.

The book starts with a good introductory chapter on stateful stream processing, talking about the different types of data processing and how systems evolved towards stream processing. The next chapter is on stream processing fundamentals and discusses dataflow graphs and the different semantics around time – event time and processing time. There is also a discussion on watermarks, which are required to allow the system to close windows and push results further through the pipeline.

The next chapter gives an overview of the Flink architecture, walking through the various components from the job manager to the various task runners. There is a good discussion of state and the consistent checkpointing mechanism that Flink offers as a way of allowing the stream to restart if something breaks in the pipeline. These processing streams are designed to run continuously for months, so we need a way to restart and get back to the current time if something goes wrong.

There’s a chapter on setting up a Flink development environment, with the authors showing you how to run a small example Flink application. After that we really get into the implementation. There is a chapter describing the DataStream API which talks about transformations on the streams, how the streams can be executed in parallel and how keys (which are used for partitioning) and rich functions can be defined. Rich functions offer a start up and shutdown action as well as the standard method for processing data values as they pass through the stream. The authors then cover time-based operators and window based operators, which also leads to a discussion of timers (used for example in session windows where the window will shut if there is a sufficient time different between some elements) and how the system handles late events (which arrive after the watermark has progressed).

There is more detail about implementation in the next chapters on stateful operators, which show how you’d write your own operators, which need to interact with the checkpointing mechanism. The next chapter covers reading and writing to external systems – in order to achieve exactly once semantics we need to have input sources that can be rewound, and transaction sinks to allow the data to be committed exactly once. Other guarantees are available depending on the sources and sinks, and there is a good discussion to illustrate this. This chapter talks about two common sources, files and Kafka which both have desirable properties like restartability.

The last two chapters talk about the operational side of things, such as how you should set up a cluster for running Flink, and how you manage it for running long term.

The book is really good. Looking at the lower levels of implementation really helped me understand streaming systems in a lot more detail, and the book’s many examples make it quite clear what is happening.

[And if you are interested in how to get started with Kafka, see this blog post]

Posted in Uncategorized | Leave a comment

Let’s Flink about it some more

I’m still interested in the idea of streaming systems, despite one of the recent Software Daily podcasts suggesting that the industry is moving away from this style.

There’s a free O’Reilly book, Streaming Integration, that is an introduction to why streaming is important to a business. Getting accurate results quickly from data can be a big business advantage.

There are some videos from Flink Forward 2020 concerning the use of Sql as a query language for streaming data. And I have ordered some books on Flink to get a better understanding of how it all works.

Of course, lots of these systems use micro-batching to actually do the processing. It is, however, an interesting question about how you take queries using map, aggregates like sum, and join over multisets of data and work out how to efficiently re-calculate results given some extra input data. There’s a an interesting implementation here and an explanation of how you might go about doing this, which is certainly worth a watch.

In other .NET related news, if you are thinking of porting an ASP.NET MVC application to .Net Core then this ebook is a good read, and this ebook is a discussion of how you might move a ASP.NET WebForms application to Blazor. If you think Linq is amazing, Reaqtor has been made open source, and there are some good introductory blog posts about the extension of Rx to support remote services including the serialization of expression trees for transmission across the wire (Bonsai trees) and taking snapshots of service state to allow us to restart. The book, A Little History of Reaqtor, which is available on the front page of the website, tells the story of how this was all developed over the years.

Oh, and if you are interested in logic and proof theory, this podcast was a great historical overview of the development of the subject.

Posted in Uncategorized | Leave a comment

Processing unbounded data with Sql like languages

I’ve been doing lots of reading and listening to various podcasts about processing unbounded data using streaming variants of Sql. I did a quick lightning talk at work on the subject.

Posted in Uncategorized | Leave a comment

And yet more links

The trouble with reading Hacker News and other sites every day is that I end up with a long list of links to interesting posts that tend to collect until I get chance to read them. Here’s a set of links from recent days.

Container networking is simple which gives a set of ip and nsenter commands to understand how Docker set sup its networking.
Kubernetes apply v patch v replace which talks about why there are three different kubectl commands
crun, an OCI container runtime written in C
Writing your first Kubernetes operator
Kubernetes failure stories
Get a shell into a Kubernetes node
Kubernetes operator best practices
Docker image history modification
Docker without docker

A compile dependency injection framework that uses source generators to do the work
Target typed expressions in C#

Go modules

How AKKA clusters work
Migrating millions of concurrent websockets to Envoy
How we scaled Github using a sharded rate limiter
Making MsQUIC blazingly fast

TLB and pagewalk coherence in x86 processors
All about thread local storage
Speculating the x86 instruction set
The microarchitecture behind meltdown

Software development topics I’ve changed my mind on
Cupid, the back story

Column store in Sql Server

Combinators, a centennial view
The visitor pattern is Church Encoding

Posted in Uncategorized | Leave a comment

And a little gRPC

I recently did a lightning talk at work on gRPC. The slides are available here.

Posted in Uncategorized | Leave a comment

Some more books

As usual, I got a load of interesting books that I bought with credits I was given for Xmas.

gRPC Up and Running: Building Cloud Native Application With Go and Java for Docker and Kubernetes by Kasun Indrasiri and Danesh Kuruppu

This is a fairly short book at less than 200 pages, but gives a good introduction to gRPC, including many worked examples in both Go and Java. It talks you thorough some of the different ways that applications can interact, including brief coverage of Thrift and GraphQL, and then jumps into a worked example implemented in both Go and Java so you can see how the various interface specifications are mapped into the two languages, gRPC lets clients and servers stream values as part of a method invocation, so these streams need to manifest themselves naturally in the implementation language. The book then discusses how everything is implemented on top of the HTTP/2 protocol, and then looks at some advanced features like interceptors (for before and after send and receive actions like authorization) and load balancing and deadlines (you can fail a call if it takes too long). There are then chapters on securing the communication channel, testing via a CI pipeline and some other useful projects.

C++ Move Semantics: The Complete Guide by Nicolai Josuttis

Move semantics seem like a valuable optimization in a call-by-value language, though the interaction where the moved from object is put into a default state is still very weird to me. Anyway this book does a good job of describing the semantics and the why for move semantics. I enjoyed it but it all feels very complicated, and it appears to be easy to get into the domain of undefined behaviour.

The GO Programming Language by Alan Donovan and Brian Kernighan

I have read this book in the past, but have just started writing GO at work, so though I should give it another read. The book is a little old now and doesn’t cover parts of GO like the module system, though there are many blog posts that explain this part of the language, The book is rich with examples that help communicate the style of GO programming, and the authors are happy to express strong views on aspects such as testing and interfaces. This is a really good read and covers the language really well.

What We Cannot Know by Marcus du Sautoy

This book explores where the human understanding currently ends and how far it might expand in the future in a number of domains, from quantum physics, cosmology, logic and artificial intelligence. It’s a good read and very interesting.

Posted in Uncategorized | Leave a comment

Getting up to speed on K8s

I’ve just moved teams at work, and that has given me a chance to work on a system that runs on top of Kubernetes. It’s been a while, maybe two years, since I last did anything serious with Kubernetes, so it was time to get out some of the old books.

Kubernetes Up and Running by Kelsey Hightower, Brenda Burns and Joe Beda is a great introduction to the main ideas behind Kubernetes, and the main components. The book is fairly short, but is full of simple examples to demonstrate ideas such as Services and Pods, and gives a great overview of Kubernetes and how to use it.

The one thing that it misses (for me) is how it all works under the covers, and I enjoyed Managing Kubernetes by Brendan Burns and Craig Tracey which gives you some insight into this with chapters on installing Kubernetes on a cluster, the API server and the scheduler which put things into perspective. The book is also aimed at describing how you maintain a real world K8s cluster and so there are additional chapters on user management, authorization, networking and disaster recovery.

The next important thing to learn about is Operators how how you can use custom resource definitions and the desired state reconciliation loop to define your own higher levels concepts. There are blog posts like this one that help to explain this.

You’re going to need some understanding of docker to really understand how it all hangs together and I enjoyed Docker Up and Running by Kane and Matthias as a way to understand this. This talks a lot about how to build reliable containers for production use, and goes into some of the low level details of debugging container.

It’s also worthwhile looking at some of the issues people have had with Kubernetes in the past like these failure stories – I ran into the missing anti-affinity rules problem in the past when I worked on a K8s system. There’s also an interesting article on how it might be designed now.

There a couple of additional books such as Designing Distributed System by Brendan Burns which talks about some of the common patterns such as sidecars which are included into pods to add extra functionality. I also read Practical Microservices with Dapr and .NET by Davide Bedin which offers a more Microsoft focused view of using Kubernetes and the Distributed Application Runtime. This book covers the major ideas of Dapr like service discovery, state management and virtual actors and shows their use in the context of .NET Core applications.

Posted in Uncategorized | Leave a comment

MapReduce for the win

MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems by Donald Moner and Adam Shook

This book is fairly old now – the version I read was published in December 2012, and I was lucky enough to find it in a pile of books that I had been intending to read for some time. I’ve also been doing a lot of reading about processing Big Data using streaming systems, and have become interested in the so-called Stream-Table duality. These kind of systems have the intention of being more real time than older systems designed using MapReduce, and so it’s good to read a book that highlights all of the great things that MapReduce can do.

The book is full of practical examples. It is aimed at Hadoop (1.0.3) and uses a stackoverflow dataset for the example implementation of many of the patterns.

The first chapter of the book covers a brief history of MapReduce, and contains a pointer to the original paper, MapReduce: Simplified Data Processing on Large Clusters. The example in this chapter is the canonical work count in Hadoop.

This is then followed by chapters on various patterns.

Summarization patterns looks at ways of counting in order to analyse data – various metrics like average, min and max values, standard deviation and generating inverted indexes. This chapter also mentions the use of a combiner – using a reduce like operation before the data has been shuffled in order to improve performance and cut down on the amount of data that needs to be transferred.

Filtering patterns looks at efficiently slimming down the data, and includes discussions on generating Top-N and using Bloom filters to make it possible to quickly get rid of data we are not interested in. For the latter pattern, we need to build the filter and then make it accessible to machine doing the processing.

Data organisation patterns which looks at efficient partitioning, binning and sorting the data.

Join patterns that looks at how we efficiently implement the various join patterns on top of MapReduce. This includes joining against the data set that we are processing, as well as joining against external static data.

There are then two chapters that talk more about the meta issues. How jobs can be chained in Hadoop, and also how we can join different MapReduce jobs together to make things more efficient. There is also a chapter on various input and output patterns in Hadoop.

The book was a quick but enjoyable read, and it makes it really clear how powerful the MapReduce model is, as it allows a vast number of processing methods to be expressed using the same framework.

Posted in Uncategorized | Leave a comment

Effective STL

Effective STL – 50 Specific Ways To Improve Your Use of the Standard Template Library by Scott Meyers

While I was at Facebook, I had the chance to write some C++ code, and C++ seems to have really moved on since I last wrote it. In particular, facilities like Move semantics took some work to understand. The language now has some nice features like lambda expressions (though the lack of garbage collection makes them a little trickier to use than C#).

This book is a little out of date, but the content was still good for me to understand the STL, though I think I gained most C++ from watching lots of CppCon talks and reading various posts on this blog which includes this guide to a feature of C++ 20.

Posted in Uncategorized | Leave a comment