Some Kafka and some algorithms

Mastering Kafka Streams and ksqlDB: Building real-time data systems by example by Mitch Seymour

I’m fascinated by the duality between streams and tables, the mixing of data at rest and data that flows through a pipeline. This was a great read to understand how these notions are realised inside Kafka. The book consists of a number of chapters that introduce the concepts and then use the concepts in a set of examples, which really brings out the understanding. The examples are also available on GitHub so you can try them out yourself.

The book starts with a rapid introduction to Kafka, going over topics and partitions, clusters and brokers. It then goes through the concepts behind Kafka streams, looking at how they relate to scalability, reliability and maintainability. There is some discussion of processor topologies and there is a look at the differences between the high level DSL and the lower level processor API. This chapter also looks at the stream table duality, looking at KStream and KTable. The next two chapters look at stateless and stateful processing, where the examples concern processing a twitter stream and a video game leaderboard. This is followed by a chapter on Windows and Time, and then a chapter on advanced state management which includes a description of rebalancing. The last chapter in this section of the book, looks in more detail at the Processor Api.

Section two of the book looks at ksqlDB. In this section we use an extended Sql to express our queries. The first chapter looks at how ksqlDB differs from a standard Sql database, and is followed by a chapter on connectors, which is the interface you can use to get data from other systems and move it into this world. There are then two chapters on stream processing using ksql.

The last section of the book contains a single chapter on Testing, Monitoring andDeployment.

This book is a really good read, and gives you a really good mental of model of what is happening as Kafka runs.

Algorithms and Data Structures in Action by Marcello La Rocca

The book does a good job in the introduction of explaining why data structures matter. It then has three sections: improving over basic data structures, multidimensional queries, planar graphs and minimal crossing numbers. It’s the choice of data structures that I really enjoyed – they are often extensions of the basic data structures we’ve all come across before, or more modern data structures applicable to large scale computing.

The basic data structures section covers priority heaps (d-way heaps), treaps, bloom filters, disjoint sets, tries and radix tries, and finally studies how to implement an LRUI cache.

The multidimensional queries section looks at nearest neighbours, multidimensional data indexing using K-d tress, approximate nearest neighbours for image retrieval, applications of nearest neighbour, clustering and map reduce and how it can be used to implement clustering.

The third section, on planar graphs, looks at minimum distance paths, graph embeddings, and uses this to study three ways of getting approximate solutions: gradient descent, simulated annealing and genetic algorithms. I thought this third section was really good, using the earlier problem to inspire the need for the approximate search algorithms.

The book contains working code that shows you how to implement the algorithms, which makes the explanations really clear, and all of the algorithms that the author chose are really interesting.

ADAPTIVE OPTIMIZATION FOR SELF: RECONCILING HIGH PERFORMANCE WITH EXPLORATORY PROGRAMMING by Urs Holze

This is a Phd thesis on the implementation of adaptive (JIT) compilation for the SELF language. It’s full of great ideas like polymorphic inline caches and dynamic de-optimization and on-stack replacement, which where all quite novel at the time. The thesis isn’t too long, and contains great explanations and measurements to give you a great idea of the improvements.

Posted in Uncategorized | Leave a comment

And a summer of reading

It was great that the lockdown came to an end over the summer, and it was great to use the time away to read through some of the books that had queued up.

Reinforcement Learning: An Introduction by Richard S Sutton and Andrew G Barto

This is quite a large academic book, but a great read if you have the time. It starts simple and works its way through the various models for reinforcement learnings, covering all of the variations. It sets out a model of policies and value functions, and then looks at the various algorithms that can be used to allow a computer to learn for itself by playing multiple scenarios. I liked the way it started with solvable Markov Decision Processes, and then generalized to the point where the algorithm is actually learning the value functions (which are implemented via a neural network). I may not have had the time to understand everything in detail, but it gave a great overview of how the magic happens.

A Philosophy of Software Design by John Ousterhout

To me this book was full of great advice and well reasoned arguments as to why the advice should be followed. I came across the book when reading a blog post that criticized “Clean Code”, which was an always recommended book in the past, but which had a few things that always seemed wrong to me (and that book doesn’t really give arguments why the advice is good).I think that’s why I think this book is so good – the author discusses trade-offs and doesn’t come across as advocating some perfect solution.

Web Development with Blazor by Jimmy Engstrom

This book introduces you to Blazor by developing client-side and server-side Blazor applications that implement a blogging platform. Personally I found the text a little hard to follow as the type setting often made it hard to follow what the text was actually describing. However, it was a good basic introduction to Blazor and the author can talk about his experiences developing Blazor application.

Androids: The Team that build the Android Operating System by Chet Hasse

This book covers the history of Android, from the days before it was acquired by Google though to just after the first few public releases. I liked the history and the explanation of how some of the features like Intents came about.

Don’t be evil The Case Against Big Tech by Rana Foroohar

I feel uneasy about how big tech is taking people’s data and using it for strange purposes. This book outlines why big tech is dangerous, from the ability to influence elections to the use of personal information in different scenarios. The book is well worth a read for understanding how we are at a turning point where more laws may be needed to protect the democracy that we have.

Software Architecture Patterns by Mark Richards

This was a free download from O’Reilly and gives some details about 5 architecture styles, from layered architectures to Microservices.

Software Engineering at Google: Lessons Learned from Programming Over Time

An amazing book rich, with advice about team processes and implementation at a company like Google. I guarantee that you will see something from your daily engineering job in a different light after reading this book.

Posted in Uncategorized | Leave a comment

Develop C# as if it were a dynamic language?

I’ve always been a massive fan of Edit and Continue, which used to be implemented as a method inside the CLR’s debugging Apis which allowed you to patch an assembly to change the metadata and IL instructions. In the past it was hard to figure out how to use it, but now Roslyn lets us look into the compiler source, it is much easier to see how the metadata and IL patch files are generated.

Anyway, this whole feature has been rebranded as “Hot Reload” and the Api is available as standard C# library method in order to support “dotnet watch”. I did a lightning talk at work at how this works under the covers, and slides are here. The Github repository also contains some examples about the new experience.

Posted in Uncategorized | Leave a comment

A few more books

The Art of Immutable Architecture: Theory and Practice of Data Management in Distributed Systems by Michael L. Perry

To be honest I’m not sure what I though about this book. It contains material on distributed systems and the difficulty of achieving consensus, and also explains how immutability helps us work around some of the problems. In particular there is a good discussion of using CRDTs. The author then goes on to describe his technique of historical modelling which brings together immutability and eventual consistency to give us a way to architect distributed systems.

Staff Engineer: Leadership beyond the Management Track by Will Larson

I really enjoyed the first half of this book which describes the Staff Engineer role, in an attempt to define what it is. There is lots of good advice in this section about how to do the role well – how to influence, how to present ideas and how to get the role in the first place.

The second half of the book is a series of interviews with Staff Engineers from a number of companies, where the various people are asked about what the role means in their company and how they got the role. It also asks them what an average day looks like. I must admit that I found this part of the book hard work and gave up reading the stories.

There’s also a large set of references to articles and blog posts which a Staff Engineer should read. I found this list really useful and have been working my way through the references.

Category Theory for Programmers by Bartosz Milewski

I worked my way through this book (yet again) and I absolutely love the book. The author explains the material really well, giving the motivations behind the various concepts and going into a good discussion of how they relate to programming. It won’t be long before I read it all again.

This book is put together from a series of blog posts, and the material is also available as an online set of lectures.

Posted in Uncategorized | Leave a comment

Stream Processing with Apache Flink

Stream Processing with Apache Flink: Fundamentals, Implementation and Operation of Streaming Applications by Fabian Hueske and Vasiliki Kalavri

I absolutely loved this book. I’d previously done a lot of reading about Streaming Systems, and after all of that theory it was good to see details about a concrete implementation.

The book starts with a good introductory chapter on stateful stream processing, talking about the different types of data processing and how systems evolved towards stream processing. The next chapter is on stream processing fundamentals and discusses dataflow graphs and the different semantics around time – event time and processing time. There is also a discussion on watermarks, which are required to allow the system to close windows and push results further through the pipeline.

The next chapter gives an overview of the Flink architecture, walking through the various components from the job manager to the various task runners. There is a good discussion of state and the consistent checkpointing mechanism that Flink offers as a way of allowing the stream to restart if something breaks in the pipeline. These processing streams are designed to run continuously for months, so we need a way to restart and get back to the current time if something goes wrong.

There’s a chapter on setting up a Flink development environment, with the authors showing you how to run a small example Flink application. After that we really get into the implementation. There is a chapter describing the DataStream API which talks about transformations on the streams, how the streams can be executed in parallel and how keys (which are used for partitioning) and rich functions can be defined. Rich functions offer a start up and shutdown action as well as the standard method for processing data values as they pass through the stream. The authors then cover time-based operators and window based operators, which also leads to a discussion of timers (used for example in session windows where the window will shut if there is a sufficient time different between some elements) and how the system handles late events (which arrive after the watermark has progressed).

There is more detail about implementation in the next chapters on stateful operators, which show how you’d write your own operators, which need to interact with the checkpointing mechanism. The next chapter covers reading and writing to external systems – in order to achieve exactly once semantics we need to have input sources that can be rewound, and transaction sinks to allow the data to be committed exactly once. Other guarantees are available depending on the sources and sinks, and there is a good discussion to illustrate this. This chapter talks about two common sources, files and Kafka which both have desirable properties like restartability.

The last two chapters talk about the operational side of things, such as how you should set up a cluster for running Flink, and how you manage it for running long term.

The book is really good. Looking at the lower levels of implementation really helped me understand streaming systems in a lot more detail, and the book’s many examples make it quite clear what is happening.

[And if you are interested in how to get started with Kafka, see this blog post]

Posted in Uncategorized | Leave a comment

Let’s Flink about it some more

I’m still interested in the idea of streaming systems, despite one of the recent Software Daily podcasts suggesting that the industry is moving away from this style.

There’s a free O’Reilly book, Streaming Integration, that is an introduction to why streaming is important to a business. Getting accurate results quickly from data can be a big business advantage.

There are some videos from Flink Forward 2020 concerning the use of Sql as a query language for streaming data. And I have ordered some books on Flink to get a better understanding of how it all works.

Of course, lots of these systems use micro-batching to actually do the processing. It is, however, an interesting question about how you take queries using map, aggregates like sum, and join over multisets of data and work out how to efficiently re-calculate results given some extra input data. There’s a an interesting implementation here and an explanation of how you might go about doing this, which is certainly worth a watch.

In other .NET related news, if you are thinking of porting an ASP.NET MVC application to .Net Core then this ebook is a good read, and this ebook is a discussion of how you might move a ASP.NET WebForms application to Blazor. If you think Linq is amazing, Reaqtor has been made open source, and there are some good introductory blog posts about the extension of Rx to support remote services including the serialization of expression trees for transmission across the wire (Bonsai trees) and taking snapshots of service state to allow us to restart. The book, A Little History of Reaqtor, which is available on the front page of the website, tells the story of how this was all developed over the years.

Oh, and if you are interested in logic and proof theory, this podcast was a great historical overview of the development of the subject.

Posted in Uncategorized | Leave a comment

Processing unbounded data with Sql like languages

I’ve been doing lots of reading and listening to various podcasts about processing unbounded data using streaming variants of Sql. I did a quick lightning talk at work on the subject.

Posted in Uncategorized | Leave a comment

And yet more links

The trouble with reading Hacker News and other sites every day is that I end up with a long list of links to interesting posts that tend to collect until I get chance to read them. Here’s a set of links from recent days.

Container networking is simple which gives a set of ip and nsenter commands to understand how Docker set sup its networking.
Kubernetes apply v patch v replace which talks about why there are three different kubectl commands
crun, an OCI container runtime written in C
Writing your first Kubernetes operator
Kubernetes failure stories
Get a shell into a Kubernetes node
Kubernetes operator best practices
Docker image history modification
Docker without docker

A compile dependency injection framework that uses source generators to do the work
Target typed expressions in C#

Go modules

How AKKA clusters work
Migrating millions of concurrent websockets to Envoy
How we scaled Github using a sharded rate limiter
Making MsQUIC blazingly fast

TLB and pagewalk coherence in x86 processors
All about thread local storage
Speculating the x86 instruction set
The microarchitecture behind meltdown

Software development topics I’ve changed my mind on
Cupid, the back story

Column store in Sql Server

Combinators, a centennial view
The visitor pattern is Church Encoding

Posted in Uncategorized | Leave a comment

And a little gRPC

I recently did a lightning talk at work on gRPC. The slides are available here.

Posted in Uncategorized | Leave a comment

Some more books

As usual, I got a load of interesting books that I bought with credits I was given for Xmas.

gRPC Up and Running: Building Cloud Native Application With Go and Java for Docker and Kubernetes by Kasun Indrasiri and Danesh Kuruppu

This is a fairly short book at less than 200 pages, but gives a good introduction to gRPC, including many worked examples in both Go and Java. It talks you thorough some of the different ways that applications can interact, including brief coverage of Thrift and GraphQL, and then jumps into a worked example implemented in both Go and Java so you can see how the various interface specifications are mapped into the two languages, gRPC lets clients and servers stream values as part of a method invocation, so these streams need to manifest themselves naturally in the implementation language. The book then discusses how everything is implemented on top of the HTTP/2 protocol, and then looks at some advanced features like interceptors (for before and after send and receive actions like authorization) and load balancing and deadlines (you can fail a call if it takes too long). There are then chapters on securing the communication channel, testing via a CI pipeline and some other useful projects.

C++ Move Semantics: The Complete Guide by Nicolai Josuttis

Move semantics seem like a valuable optimization in a call-by-value language, though the interaction where the moved from object is put into a default state is still very weird to me. Anyway this book does a good job of describing the semantics and the why for move semantics. I enjoyed it but it all feels very complicated, and it appears to be easy to get into the domain of undefined behaviour.

The GO Programming Language by Alan Donovan and Brian Kernighan

I have read this book in the past, but have just started writing GO at work, so though I should give it another read. The book is a little old now and doesn’t cover parts of GO like the module system, though there are many blog posts that explain this part of the language, The book is rich with examples that help communicate the style of GO programming, and the authors are happy to express strong views on aspects such as testing and interfaces. This is a really good read and covers the language really well.

What We Cannot Know by Marcus du Sautoy

This book explores where the human understanding currently ends and how far it might expand in the future in a number of domains, from quantum physics, cosmology, logic and artificial intelligence. It’s a good read and very interesting.

Posted in Uncategorized | Leave a comment