I just finished reading Martin Kleppmann’s book ”Designing Data-Intensive Applications”. Reading it cover to cover takes time, a lot of time, but it is definitely worth it.
It is not a book for anybody, it is a book for software developers and designers who work with systems that handle a lot of data and have high demands on availability and data integrity. The book is quite technical so it is preferable if you have a good understanding of the infrastructure of your system and the different needs it is supposed to meet.
The book is divided into three different parts, each made up of several chapters. The first part is called Foundations of Data Systems and consists of four different chapters. Examples of topics being explained are Reliability, Scalablity, and Maintainability, SQL and NoSQL, datastructures used in databases, different ways of encoding data, and modes of dataflow. The details are quite complex and you will most probably not be able to just scim over the pages and expect to be able to follow along.
The second part is called Distributed Data and has five chapters. It discusses Replication, Partitioning, Transactions, Troubles with Distributed Systems, and Consistency and Consensus. After reading this part of the book I remembered thinking that it is amazing that we have these big complex systems that actually work. Kleppman describes so many things that can go wrong in a system that you start to wonder how in the world it is possible that things actually do work…most of the time.
The third, and last, part of the book is called Derived Data and consists of three chapters. Here Kleppman describes different types of data processing, Batch Processing and Stream Processing. I am not at all familiar with the concepts of distributed filesystems, MapReduce, and other details discussed in the Batch Processing chapter so I found it a bit hard to keep focus while reading about it. However, the Stream Processing chapter was very interesting.
To sum up. I really enjoyed reading this book. I think it was a great read and it really helped me get a better understanding of the system I am working with (maybe most importantly what it is NOT designed for). I recommend anyone working with larger data intensive systems to read it, it will take time, but it is well invested time.
Finally, I would like to thank Kristoffer who told me to read this book. Thank you!