I’m fascinated by the duality between streams and tables, the mixing of data at rest and data that flows through a pipeline. This was a great read to understand how these notions are realised inside Kafka. The book consists of a number of chapters that introduce the concepts and then use the concepts in a set of examples, which really brings out the understanding. The examples are also available on GitHub so you can try them out yourself.
The book starts with a rapid introduction to Kafka, going over topics and partitions, clusters and brokers. It then goes through the concepts behind Kafka streams, looking at how they relate to scalability, reliability and maintainability. There is some discussion of processor topologies and there is a look at the differences between the high level DSL and the lower level processor API. This chapter also looks at the stream table duality, looking at KStream and KTable. The next two chapters look at stateless and stateful processing, where the examples concern processing a twitter stream and a video game leaderboard. This is followed by a chapter on Windows and Time, and then a chapter on advanced state management which includes a description of rebalancing. The last chapter in this section of the book, looks in more detail at the Processor Api.
Section two of the book looks at ksqlDB. In this section we use an extended Sql to express our queries. The first chapter looks at how ksqlDB differs from a standard Sql database, and is followed by a chapter on connectors, which is the interface you can use to get data from other systems and move it into this world. There are then two chapters on stream processing using ksql.
The last section of the book contains a single chapter on Testing, Monitoring andDeployment.
This book is a really good read, and gives you a really good mental of model of what is happening as Kafka runs.
Algorithms and Data Structures in Action by Marcello La Rocca
The book does a good job in the introduction of explaining why data structures matter. It then has three sections: improving over basic data structures, multidimensional queries, planar graphs and minimal crossing numbers. It’s the choice of data structures that I really enjoyed – they are often extensions of the basic data structures we’ve all come across before, or more modern data structures applicable to large scale computing.
The basic data structures section covers priority heaps (d-way heaps), treaps, bloom filters, disjoint sets, tries and radix tries, and finally studies how to implement an LRUI cache.
The multidimensional queries section looks at nearest neighbours, multidimensional data indexing using K-d tress, approximate nearest neighbours for image retrieval, applications of nearest neighbour, clustering and map reduce and how it can be used to implement clustering.
The third section, on planar graphs, looks at minimum distance paths, graph embeddings, and uses this to study three ways of getting approximate solutions: gradient descent, simulated annealing and genetic algorithms. I thought this third section was really good, using the earlier problem to inspire the need for the approximate search algorithms.
The book contains working code that shows you how to implement the algorithms, which makes the explanations really clear, and all of the algorithms that the author chose are really interesting.
This is a Phd thesis on the implementation of adaptive (JIT) compilation for the SELF language. It’s full of great ideas like polymorphic inline caches and dynamic de-optimization and on-stack replacement, which where all quite novel at the time. The thesis isn’t too long, and contains great explanations and measurements to give you a great idea of the improvements.