Reliable, Scalable and Maintainable Systems

Designing Data-Intensive Applications: The big ideas behind reliable, scalable and maintainable systems by Martin Kleppmann

I was lucky to have a six week sabbatical over the summer, and felt that it would be a good time to read up on the technologies behind some of the large scale distributed systems that are around at the moment. This book is a great read for getting up to speed.

It has three sections. The first is on the foundations of data systems, and starts with a quick discussion of what the words reliability, scalability and maintainability actually mean. The book then moves on to the various data models, where the author discusses the birth of NoSQL , query languages and the various graph databases. The underlying implementations are covered, including B-trees, SSTables and LSM-trees,  and various indexing structures. The section finishes with a discussion of data encoding and evolution.

The second section covers distributed data, and there are chapters on replication, partitioning and the rather slippery notion of a transaction. Distributed systems can fail in many interesting ways, all covered in the next chapter, including some discussion of Byzantine faults. The final chapter in the section talks about consistency and consensus. In all of the discussion the author is really happy to go into low level implementation details, and all of the chapters have lists of references of papers that you can consult for more information.

The final section is on derived data – how do we process the mass of data that we have accumulated. The first chapter is on batch processing, which covers map-reduce and later variants. This is followed by a chapter on stream processing. The final chapter of the book is the author’s idea for the future.

This book is a great read. It goes into loads of implementation details which helps the reader really get to grips with the ideas, though it might take more than a single read to understand the many ideas that are covered.

Advertisements
Posted in Uncategorized | Leave a comment

Designing system for scalability

I’ve been doing some reading on designing systems for scalability, and I thought I could quickly post some of the useful YouTube videos that I have found. There are numerous system design problems and solutions that have videos on YouTube, but I haven’t included the ones that I have watched.

Eventually I came across this video on system design, that actually gives a good list of the various technologies that are used in some of the most scalable applications available today.

This is an introduction to how Twitter is implemented, and mentions ideas like fanning-out to Redis and Memcached. There are videos about Facebook and Instagram

The choice of database is obviously important, and it is useful to understand the in-memory databases like Redis. Transactions also come up, via myths and surprises, and how the transaction levels relate to the CAP theorem.

Uber deal with some of the reliability data by storing data on their drivers’ mobile phones.

GraphQL came up several times as an alternative to REST APIs. It often requires fewer round trips, and makes tool support easy by using a schema. There is an introduction here and the coding of a server (which explains what you can do about the N+1 problem using an online demo system).

There is a good general talk about lessons learned here.

I had heard about Bloom Filters before, but hadn’t come across the Count-min sketch algorithm

 

Posted in Uncategorized | Leave a comment

Kubernetes is the winner

Kubernetes: Up & Running by Kelsey Hightower, Brendan Burns and Joe Beda

Everywhere you go these days, it’s all about containers and how they should be orchestrated. Software Engineering Daily had a great series about several container management systems, and so it was time to get the book about Kubernetes, by several of the founders of the project. There is recent blog post on the history of the project here.

The book itself is really good. It explains the need for an orchestration framework, and demonstrates the various parts of the Kubernetes system. It starts by showing you how to deploy a Kubernetes cluster and works through the use of the kubectl commands. It moves on to explain pods, and the labels and annotations that you can attach to  the containers that are being managed. This is very hands on, working against a demonstration container that the authors have made available.

The following chapters cover service discovery, Replicasets, Daemonsets, Jobs and ConfigMaps and then there is a chapter that covers deployments and upgrades. The last two chapters cover how you integrate storage with your applications and how to deploy some real world applications.

The book, as you would expect, covers the material really well. If you want to try the material out on the Azure cloud, the Azure documentation contains some worked tutorials.

If you need to understand Docker a little better, then I found this post useful. Ben Hall also did a recent talk on other container technologies. A competing idea is serverless, and there is a recent paper that looks at the implementation behind this for the three major cloud platforms.

 

Posted in Uncategorized | Leave a comment

What are micro-tasks in the browser all about?

I gave a quick lightning talk at work about micro-tasks in the browser, based on a recent talk by Jake Archibald. The slides are available here.

Posted in Uncategorized | Leave a comment

Let’s get started with Docker

Essential Docker for ASP.NET Core MVC by Adam Freeman

We are allowed to spent time at work on a Friday afternoon exploring new technologies, so a colleague and I decided to work through this book. Microsoft have recently started supporting Docker running on Windows, and I thought this would be an interesting way to see how well the Windows Docker eco-system has been progressing. Also, this book targets ASP.NET 1.1 and I wanted to see if things were easier with the latest 2.1 version of ASP.NET.

The first two chapters in the book are a really brief introduction to Docker, followed by a list of the docker utility’s commands.

Installing Docker on windows was really easy, requiring us to run an installer. We did have to turn on Hyper-V for Docker to use. This clashed with the Oracle VirtualBox that we typically use for testing, but fortunately I had a spare machine on which I could leave it turned on.

In chapter four of the book you write a fairly simple ASP.NET Core application which you then publish.

dotnet publish --framework netcoreapp2.0 --configuration Release --output dist

This application is then copied across to a docker container as part of the DockerFile

FROM microsoft/aspnetcore:2.0.3
COPY dist /app
WORKDIR /app
EXPOSE 80/tcp
ENV ASPNETCORE_URLS http://+:80
ENTRYPOINT ["dotnet", "dockerplay.dll"]

which we can then use to build a Docker container.

docker build . -t apress/exampleapp -f Dockerfile

The next chapter of the book deals with Volumes and Software Defined Networking. Volumes allow you to define some storage which can be attached to a container – this allows the container to run an application that writes to the file system to store its state, say a database. When we need to rebuild the container we can then re-attach the file system to the new container, and hence not lose any data.

This is where we diverged a little from the book. The book aims at Linux and mySQL, where we wanted to use SQL Server running on windows.

For this we pulled a pre-build image containing SQL Server.

docker pull microsoft/mssql-server-windows-express

And then used a volume to store the state.

docker volume create --name testdata

docker run -d -p 7002:1433 -e sa_password=ffddfdfdfdfd -e ACCEPT_EULA=Y -v testdata:c:\data microsoft/mssql-server-windows-express

The book moves on to SDN and the demo application uses two different network segments – one for the frontend and one for the backend. In the book, a proxy is used to load balance across the three servers that are set up.

Unfortunately there was no haproxy that would run in a Windows container, so we decided to use NGINX. Again. we had to build our own container for this, and I couldn’t build on nano server (because my windows drive had become corrupted)

 

FROM microsoft/windowsservercore
ENV VERSION 1.13.9

SHELL ["powershell", "-command"]
RUN Invoke-WebRequest -Uri http://nginx.org/download/nginx-1.13.9.zip -OutFile c:\nginx-$ENV:VERSION-win64.zip; \
	Expand-Archive -Path C:\nginx-$ENV:VERSION-win64.zip -DestinationPath C:\ -Force; \
	Remove-Item -Path c:\nginx-$ENV:VERSION-win64.zip -Confirm:$False; \
	Rename-Item -Path c:\nginx-$ENV:VERSION -NewName nginx

# Make sure that Docker always uses default DNS servers which hosted by Dockerd.exe
RUN Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name ServerPriorityTimeLimit -Value 0 -Type DWord; \
	Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name ScreenDefaultServers -Value 0 -Type DWord; \
	Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name ScreenUnreachableServers -Value 0 -Type DWord
	
# Shorten DNS cache times
RUN Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxCacheTtl -Value 30 -Type DWord; \
	Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters' -Name MaxNegativeCacheTtl -Value 30 -Type DWord

COPY nginx.conf c:/nginx/conf

WORKDIR /nginx
EXPOSE 80
CMD ["nginx", "-g", "\"daemon off;\""]

We had to write a config file that knew about the three instances that we wanted to load balance across

#user  nobody;
worker_processes  1;

error_log  logs/error.log;
error_log  logs/error.log  notice;
error_log  logs/error.log  info;

#pid        logs/nginx.pid;


events {
    worker_connections  1024;
}


http {
    upstream myapp1 {
        server dockerplay_mvc_1;
        server dockerplay_mvc_2;
        server dockerplay_mvc_3;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://myapp1;
        }
    }
}

We could run the various commands documented in the book to start the instances and add them to the right network. We could then load balance using the NGINX and we could refresh the web page to see that requests were being served by different machines at different times.

[There is a little too much hardwired in by name for my taste. The SDN inside docker runs a DNS that lets you look up other containers by name to get their IP address]

The next chapter of the book looks at Docker Compose. This gives you a way to wire things up using a single configuration file.

version: "3"

volumes:
  testdata:

networks:
  frontend2:
  backend2:

services:

  sqlexpress2:
    image: "microsoft/mssql-server-windows-express"
    volumes: 
      - testdata:c:\data
    networks: 
      - backend2
    environment:
      - sa_password=fddfdfdfsff
      - ACCEPT_EULA=Y

  dbinit:
    build:
      context: .
      dockerfile: Dockerfile
    networks:
      - backend2
    environment:
      - INITDB=true
      - DBHOST=sqlexpress2
      - DBPORT=1433
    depends_on:
      - sqlexpress2

  mvc:
    build:
      context: .
      dockerfile: Dockerfile
    networks:
      - backend2
      - frontend2
    environment:
      - DBHOST=sqlexpress2
      - DBPORT=1433
    depends_on:
      - sqlexpress2
    ports: 
      - 4020:4020 
      - 4021:4021

  loadbalancer:
    image: nginx
    build:
      context: ..\nginx
      dockerfile: Dockerfile
    ports: 
      - 8112:80
    networks:
      - frontend2

This is a really neat technology, allowing you to scale the various components up and down. Unfortunately for us, we didn’t have an easy way to reconfigure the load balancer when the scaling happens. In the book, the load balancer configuration has “links” and “volumne” lines that allow the compose to pass details of the instantiations of the load balanced service. We didn’t have time to look in to this.

The next chapter in the book looks at Docker Swarm. There was no equivalent on Windows, so we didn’t try it.

The last chapter of the book looks at allowing debugger access into the container. Visual Studio can do this if you run the appropriate components, but we didn’t try too hard to get this working. Later versions of Visual Studio can build containers and automatically configure them to allow debugger access.

I think our main observation was that Windows Docker seems to be a long way behind Docker on Linux.

The book was good as a set of instructions to follow, with the brief explanations helping a little to understand what was going on. Using a book that was a version behind was a good way of forcing us to debug and understand what was happening a little better.

On a related note, there’s an interview that discusses Service Fabric which is used to run loads of the Azure infrastructure.

Posted in Uncategorized | Leave a comment

Reactive extensions in action

Rx.NET In Action by Tamir Dresher

The reactive extensions have been around for a long time. I remember coming across them in C# something like a decade ago, but I don’t think I’ve seen a book or documentation that covers the whole of the implementation – sure, people spend a lot of time talking about the various combinators and hot and cold observables, but they don’t spend much time talking about schedulers and the threading model that sits below the system.

Part one of the book, which consists of three chapters, gives a basic introduction to Reactive programming, and also covers some of the C# you need to make use of the Rx libraries.

By way of some examples, the first chapter introduces us to the idea of making events a first class object,  the IObservable interface and its duality to the IEnumerable, and points out the differences between the push and pull models of event delivery. The author goes on to look at the properties described in the Reactive Manifesto. We are also introduced to marble diagrams, which allow us to visualise the various interactions.

Chapter two takes us through a “Hello, Rx” application. This time it isn’t Google suggest, which used to be the canonical example that is used in various write ups. In this book we look at a stock tracker application. This allows the author to cover how standard classes of .NET events can be converted easily into event streams, and the author gets a chance to talk a little about the threading concerns. I think that’s great, as threading is often hidden under the covers in tutorials, but as soon as you want the events to be processed by a GUI you get into the GUI library’s threading requirements.

Chapter three covers functional thinking in C#. Rx.NET encourages you to handle a pipeline for processing, with events feeding into the top of the pipeline, various filtering and processing happening in the middle, and then elements subscribing to the resulting output of the pipleine. This is a mechanism that the functional style handles very well.

The second part of the book has chapters on various Rx.NET concepts.

Chapter four starts with creating observables, which is demonstrated by writing an observer that logs the received events to the console (and we’ll use this observer through the book). Of course writing things yourself gives you a chance to break the protocol of the IObservable – in particular the protocol that the messages flow as

(OnNext) * (OnError | OnCompleted)

It is therefore often better to use the Rx library’s helpers for defining your own classes, so the author points to the ObservableBase class which makes it easy for you to define your own named types, or better still there are many overloads on Observable.Create to avoid the need to name a new type.

            var ob = Observable.Create(observer => 
            {
                Console.WriteLine("Started");
                Task.Run(() => { Task.Delay(TimeSpan.FromSeconds(2)); observer.OnNext(2); observer.OnCompleted(); });
                return () => { Console.WriteLine("Finished"); };
            });

            ob.Subscribe(x => Console.WriteLine(x));

This chapter also looks at converting the various .NET event styles to observables, and looks at converting from Enumerable to Observable and back again. We also see some of the more primitive observables that handle looping and single values.

            var evensBelow50 = Observable.Generate(0, x => x  state + 2, v => v);
            var singleValue = Observable.Return(10);
            var neverFinish = Observable.Never();
            var empty = Observable.Empty();
            var _ = Observable.Throw(new Exception("Bang"));

Chapter five covers how you make observables from asynchronous code. It starts with looking at async friendly versions of Observable.Create

            var ob = Observable.Create(async (observer, ct) => 
            {
                Console.WriteLine("Started");
                ct.Register(() => Console.WriteLine("Finished"));
                await Task.Delay(TimeSpan.FromSeconds(2));
                ct.ThrowIfCancellationRequested();
                observer.OnNext(2);
                ct.ThrowIfCancellationRequested();
                observer.OnCompleted(); 
            });

And then looks at the conversions between Task and Observable handled by the ToObservable method, and how SelectMany and Concat can be used to link different computations together.

Chapter six looks at the observer/observable relationship, in particular how to delay and re-subscribe to the observable. We walk through the DelaySubscription method and various other operators like SkipWhile and TakeUntil.

            var ob = Observable.Range(1, 5).Do(x => Console.WriteLine(x));

Some of the ideas are put together in a drawing application where the code tracks the mouse and the mouse button up and down lead to event streams starting and ending.

Chapter seven looks at controlling the temperature of observables. Observables can be categorised as hot or cold. Here cold means that the observable replays a set of event to each subscriber, whereas a hot observable only plays new events. In the case of the hot observable, if you weren’t subscribed when the event happened, then you don’t get to see it.

We start with an ISubject. Instances of this interface can act as an observer and as an observable, and Rx provides four types that implement this interface – Subject, AsyncSubject, ReplaySubject and BehaviorSubject. The book covers what all of these subjects do, and how they can be used to proxy hot and cold observables to give you something with various interesting behaviours.

Chapters eight and nine go through the many operators, from Max and Count all the way through to operators for partitioning an incoming event stream into a set of windowed buffers.

Chapter ten talks about concurrency and synchronisation, and is the best explanation I have read of this side of the Rx world. There are many types of IScheduler that are implemented by the library, ranging from a scheduler that uses threads from the thread pool to schedulers that hijack the current thread and don’t return until a series of actions have finished.

The last chapter talks about error handling and recovery, and it also touches on the subject of backpressure. It is also very good and informative.

It is worth also mentioning that the book has three appendixes – some general coverage of asynchronous programming in .NET, a section on the Disposables that the Rx library offers and a section on testing Rx which talks about how you might Unit Test your code and use test schedulers to control the execution.

It was really good to have a single place that covered all of this material. Typically you can find some of this in blog posts spread all over the internet, but having it in a consistent story that develops over eleven chapters is brilliant.

I also noticed that the pre-release System.Reactive Nuget package contains code around the IQbservable interface. It will be interesting to see where that goes in the future.

Posted in Uncategorized | Leave a comment

Stack allocated closures make it into C#

Back in the days when I worked on Lisp compilers, I remember adding stack allocation of the closure records when calling local functions. In C# 7, we now have local functions and it is interesting to look at the optimisations that are applied for these.

Just for a base line let’s have a quick look at the implementation of lambda expressions which close over variables the current method. The implementation, which has been around for a long time, re-homes the locals into a heap allocated (Display) class. This extends the lifetime of the variables allowing the reference from the lambda expression to govern their lifetime.

        static Func<int,int,int> Check(int a, int b)
        {
            return (x, y) => (x + y + a + b);
        }

This is converted into code that as the following form. “a” and “b” have been re-homed into the heap allocated instance.

private static Func<int, int, int> Check(int a, int b)
{
    <>c__DisplayClass1_0 class_ = new <>c__DisplayClass1_0();
    class_.a = a;
    class_.b = b;
    return new Func<int, int, int>(class_.b__0);
}

The DisplayClas has the following definition, where we see the fields corresponding the captured variable, the definition of the lambda method is encoded into this class too.

[CompilerGenerated]
private sealed class <>c__DisplayClass1_0
{
    public int a;
    public int b;

    internal int b__0(int x, int y)
    {
        return (((x + y) + this.a) + this.b);
    }
}

Local functions take us to code that has the following form.

        static Func<int, int, int> Check2(int a, int b)
        {
            return Local;
            
            int Local(int x, int y)
            {
                return (x + y + a + b);
            }
        }

This is code generated slightly differently,

private static Func<int, int, int> Check2(int a, int b)
{
    <>c__DisplayClass2_0 class_ = new <>c__DisplayClass2_0();
    class_.a = a;
    class_.b = b;
    return new Func<int, int, int>(class_.g__Local|0);
}

We have the same style of DisplayClass, with the body of the local added as a method (as expected).

[CompilerGenerated]
private sealed class <>c__DisplayClass2_0
{
    public int a;
    public int b;

    internal int g__Local|0(int x, int y)
    {
        return (((x + y) + this.a) + this.b);
    }
}

However, there are now more optimisation possibilities. First, if the local function is scoped to the method in which it is defined, then it would be good to avoid the heap allocation.

        static int Check3(int a, int b)
        {
            return Local(1,2) + Local(3,4);

            int Local(int x, int y)
            {
                return (x + y + a + b);
            }
        }

This is indeed what happens.

private static int Check3(int a, int b)
{
    <>c__DisplayClass3_0 class_;
    class_.a = a;
    class_.b = b;
    return (g__Local|3_0(1, 2, ref class_) + g__Local|3_0(3, 4, ref class_));
}

The DisplayClass has been optimised to a struct

[CompilerGenerated]
private struct <>c__DisplayClass3_0
{
    public int a;
    public int b;
}

and the body has been added as a method into the class in which contains the method containing the local

[CompilerGenerated]
internal static int g__Local|3_0(int x, int y, ref <>c__DisplayClass3_0 class_Ref1)
{
    return (((x + y) + class_Ref1.a) + class_Ref1.b);
}

The compiler has essentially noticed that the local method cannot escape from the method that uses it, and hence we can try to avoid the heap allocation.

We should also quickly look at the case where the local method doesn’t capture any locals.

        static int Check4(int a, int b)
        {
            return Local(1, 2) + Local(3, 4);

            int Local(int x, int y)
            {
                return (x + y);
            }
        }

In this case, the method compiles to the following

private static int Check4(int a, int b)
{
    return (g__Local|4_0(1, 2) + g__Local|4_0(3, 4));
}

and the local method is simply defined as a static method int he defining class

[CompilerGenerated]
internal static int g__Local|4_0(int x, int y)
{
    return (x + y);
}

While we are here we could quickly cover one memory management gotcha around closures and their implementation.

        static (Func<int,int,int>, Func<int,int,int>) Check(int a, int b)
        {
            return ((x, y) => (x + y + a), (x, y) => (x + y + b));
        }

The implementation decides to put the local variables into a single DisplayCLass

private static ValueTuple<Func<int, int, int>, Func<int, int, int>> Check(int a, int b)
{
    <>c__DisplayClass1_0 class_ = new <>c__DisplayClass1_0();
    class_.a = a;
    class_.b = b;
    return new ValueTuple<Func<int, int, int>, Func<int, int, int>>(new Func<int, int, int>(class_.b__0), new Func<int, int, int>(class_.b__1));
}

This means that if either of the returned lambda expressions is alive (from the pint of the view of the GC), then the variables “a” and “b” are still alive. This might not seem to matter too much, but if “a” and “b” were large objects (for example), it does mean that their lifetime can be extended further than you might expect.

Posted in Uncategorized | Leave a comment