PipelineDB, is an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables. A compelling feature for users interested in PipelineDB is that it is built into the PostgreSQL core and can be used as a drop in replacement for PipelineDB without making any application code changes.
Flocker is a container data volume manager that is designed to allow databases like PipelineDB to easily run in containers in production. When running a database in production, you have to think about things like recovering from host failure. Flocker provides tools for managing data volumes across a cluster of machines like you have in a production environment. For example, as a PipelineDB container is scheduled between hosts in response to server failure, Flocker can automatically move its associated data volume between hosts at the same time. This means that when your PipelineDB container starts up on a new host, it has its data. This operation can be accomplished manually using the Flocker API or CLI, or automatically by a container orchestration tool that Flocker integrates with like Docker Swarm, Kubernetes or Mesos.
In this example, we’ll be manually removing the container and moving it to another node using Docker Swarm, Docker Compose, and the constraints feature. We’ll run a continuous service that will continually stream data into our PipelineDB database. Future blog posts will show how to do the same thing using all the orchestration tools that Flocker supports.
Why run PipelineDB with Docker?
As your database workload scales up, you will want to make sure your PipelineDB server has enough CPU, RAM and network bandwidth to handle nearterm and longterm capacity needs. Running your PipelineDB Server in a container makes it portable so you can manually or automatically move that container to a more powerful machine with ease. You can also better respond to system failures like crashed servers. Flocker comes in by making sure your data directory moves to the new host along with your container, reducing downtime and headaches. The same thought process can be used if you wanted to downgrade the host server to something more affordable with moderate performance attributes.
What you will learn
In this tutorial, we will demonstrate running a basic PipelineDB server container on a host machine using Docker. The PipelineDB container will be created using the Flocker plugin for Docker being declared within our docker-compose file. Flocker will automatically create and mount a dataset to your host for storage of PipelineDB /mnt/pipelinedb/data directory.
When the PipelineDB container is shut down and started on a new host using the same –volume-driver flocker flag, Flocker will automatically recognize that the container has moved and unmount its data volume from the first host and remount it to the new host. This means that when your PipelineDB container starts up, it will have all of its data.
In this tutorial there are:
- 4 nodes
- 1 client node that we will execute Docker commands from
- 1 master node with our Flocker Control Service and Swarm Master installed
- 2 nodes with our Flocker Agent Services and Docker installed (our database is going to move between these two nodes)
In this example, we will be running our nodes on Amazon EC2 and creating and attaching volumes from Amazon’s EBS service.
Getting your Swarm cluster set up
We have a simple walkthrough on setting up this 3-node cluster on Amazon Web Services using CloudFormation.
If you are using an existing cluster or setting up your Swarm cluster manually, please restart the docker daemons, “flocker-node==1” for the first node and “flocker-node==2” for the second node.
Running PipelineDB with Docker on Node 1
SSH into our client node using its public ip address.
Clone this Github repo for the sample docker compose files you’ll use with your cluster.
Create our PipelineDB container using this docker compose file.
We can confirm that a volume has been created and attached on creation of this container by checking with the control service by running
Let’s connect to our pipeline container by using psql.
Once you have logged into your PipelineDB server, execute the following SQL commands.
Get Twitter app credentials
Before we can launch our containerized Twitter stream service, we will need to retrieve Twitter app credentials and user tokens.
Run Twitter stream service to generate some data
For this container to run correctly you will need to set it up with environment variables that allow it to connect to the Twitter pipeline. You can get your credentials by registering a Twitter application at https://apps.twitter.com/.
Move the PipelineDB container to node 2
Now let’s remove our container from node1. This step is necessary as you cannot have the same dataset mounted to multiple hosts.
Start up our pipeline container on node 2
You will see your Flocker dataset move from “attached” -> “detached” -> “attached” states.
Flocker will detect that you want to run the same container on this host and unmount and mount the appropriate volume onto host node2 and boom your PipelineDB server will still have your data.
Let’s connect to our PipelineDB container via psql again and verify.
Check for our data.
This was a basic example of manually migrating a database container from one node to another. PipelineDB has additional documentation options that involve replication, streaming replication and high availability. Using Flocker along with a mix of these strategies will provide a resilient PipelineDB cluster with persistent storage.
We’d love to hear your feedback!