I was lucky to have a six week sabbatical over the summer, and felt that it would be a good time to read up on the technologies behind some of the large scale distributed systems that are around at the moment. This book is a great read for getting up to speed.
It has three sections. The first is on the foundations of data systems, and starts with a quick discussion of what the words reliability, scalability and maintainability actually mean. The book then moves on to the various data models, where the author discusses the birth of NoSQL , query languages and the various graph databases. The underlying implementations are covered, including B-trees, SSTables and LSM-trees, and various indexing structures. The section finishes with a discussion of data encoding and evolution.
The second section covers distributed data, and there are chapters on replication, partitioning and the rather slippery notion of a transaction. Distributed systems can fail in many interesting ways, all covered in the next chapter, including some discussion of Byzantine faults. The final chapter in the section talks about consistency and consensus. In all of the discussion the author is really happy to go into low level implementation details, and all of the chapters have lists of references of papers that you can consult for more information.
The final section is on derived data – how do we process the mass of data that we have accumulated. The first chapter is on batch processing, which covers map-reduce and later variants. This is followed by a chapter on stream processing. The final chapter of the book is the author’s idea for the future.
This book is a great read. It goes into loads of implementation details which helps the reader really get to grips with the ideas, though it might take more than a single read to understand the many ideas that are covered.