Table of Contents
At ClusterHQ, we’re huge fans of KubeCon so last week’s announcement about KubeCon joining forces with Cloud Native Computing Foundation is fantastic news! You can read through our blog to see some of our favorite talks from KubeCon EU last month complete with video, slides and transcription including the talk below by Michael Ward, Principal Systems Architect at Pearson.
We hope you enjoy the presentation as much as we did!
ITNW (If This Now What): Orchestrating an Enterprise
Presented at KubeCon EU
So this is, now what, orchestrating an enterprise. We’re going to go through a lot of stuff today. I’m going to try and really move through some slides to get to the good stuff. I’ve got a couple demos to work through. Hopefully you guys will walk away thinking, wow those guys at Pearson are doing some cool stuff. I always like a few good quotes from Albert Einstein, in particular, “Education is what remains after one has forgotten everything you learned in school.” Today we’re going to go through a little bit of business justification right? Pearson is a massive company. We’re going to talk about that a little bit today. We have to justify what we’re going to do especially when we’re talking about doing tooling for our entire development base, which is 400 development teams at the moment. We’re going to talk a little bit about requirements, build pipelines, third party resources and Kubernetes, Ingress controller with SSL integration to vault, and ChatOps.
I’m Michael Ward, I’m the principal systems architect of our enterprise platform-as-a-service called Bitesize internally. Some of you may know me as Devoperandi. I have a blog post www.devoperandi.com. Feel free to reach out, but I will warn you. After today I’m out of pocket for two weeks so if I don’t respond, don’t get upset about it.
[1:35] So, Pearson, we work for Pearson. We have a small team. Pearson is 171 years old. We’ve been around for a very long time, and we’ve gone through a whole generation of different focuses. Originally, we were a construction company. Then back in the mid-1800s, we turned into kind of a publishing company. We’ve owned companies like Madame Tussauds, and The Economist. Again we’re 171 years old, 40,000 employees across 70 countries. That’s really important because our focus in the past has been very diversified. We haven’t had any particular focus. Over the last 20 years or so, we’ve moved towards education. We moved from not just printing and publishing education books and providing that throughout the world, but we’re also now getting to focus into digital education. What’s really important about that fact, and honestly I’ve heard some flack about it, is that we’re not just interested in providing education to the people that have money. We’re interested in providing education and knowledge to those that don’t. That’s why we’re moving into countries like Africa so that we can provide everyone in the world a same level of education. They all have the opportunities that they want and need.
Pearson is no small potatoes
[3:03] Pearsons is not small. We’ve got 36,000 servers, 2,000 applications, and 400 development teams. When I say 2,000 applications, I mean some of these applications that are not legacy make up 10 to 30 micro services per application. What we’re really talking about are stacks. This is hard. Every time we have a problem, it’s a big problem. There is no small problem in our world. Simple solutions are what we absolutely must have in every scenario. We really often have to think about what we’re doing.
Open source at Pearson
[3:40] On top of that Pearson is again education focused, and that’s what open source is about right? Open source is about sharing knowledge, sharing code, and everyone as a whole contributing back to a community to make it better. As you can see we’ve released several clients completely open source. We have a Kubernetes packed with stack storm, which we’re going to talk about a little bit more later. We’re even going to demo that. Then we’ve got an ingress controller that we originally wrote prior to the nginx alpha being able to manage multiple namespaces. We’ve got some integration with SSL and that sort of thing that we’re really going to deep dive into.
[4:23] First off, I really, I planned this for the end of my talk, but I wanted to show you guys something cool right at the beginning. We have complete ChatOps integration for building projects in Kubernetes. We have a master hu-bot and that hu-bot talks out to and creates a brand new namespace with Jenkins with a hu-bot for each development team and each project. From there we can deploy out containers and do all kinds of cool stuff. It also even integrates to the point where we can create brand new bitbucket repositories or git repositories, send you email invites, all kinds of stuff with just literally a forward command.
[5:06] You guys want to see it? All right, let’s do it. Just to show that we actually have hu-bot running, and he is alive right now. You can see we’ve got the platforms bot up, and already getting a … all right. As you can see platform bot is up. You can see that in my command line that my personal platform bot is responding. I can say create project, kubecondemo28 because I’ve practiced it that many times. All right, so literally just like that, and here I’ll make this a lot bigger for you. So just like that with literally a forward command, we’ve created a brand new namespace and our production cluster. We’ve created a default service account right? We don’t have service accounts turned off in Kubernetes. That’s a big security hold. We’ve created another hu-bot room. When we create a new project for our developers, we create them their own HipChat room to communicate with and the hu-bot that we’re going to deploy into the environment will actually chat back to that hu-bot room automajically. Then they can type their own commands through HipChat to or any other IM platform, to be able to say hey, what do you want me to do? Run Jenkins jobs all that kind of cool stuff.
You can see we’ve created some default build files. Those go into our git repository. We’ve created a new hu-bot, and we’ve created a bitbucket. Let’s take a look real quick, and I don’t have my email of course. You can see that because I put my email in there that I’ve been invited back to this project. What’s really cool is I’ve also been invited to have access to the bitbucket repository. Now I can easily be, pull down that repo, clone it, and push on to the next thing. We’re literally being able to provide the application is in the right state, we can literally start to bring on applications in minutes, not hours, not months, not anything. So pretty cool stuff. First demo success.
[8:05] All right, business justification. We’re talking about build pipelines and business justification for build pipelines because that’s kind of a central point of this talk. We went to our stakeholders and we said, “So what’s important to you guys?” Obviously the business at cost. Developers, management time, ease of use, standardization and compliance and visibility for the security team, and ease of use, agility, and minimal disruption for QA and perf. Within our organization we have a pretty bad habit of running perf test against our development environments and really hosing the entire environment and preventing developers from doing anything for the next 3 days.
[8:46] Let’s just talk about the business cost aspect. We went around to some of our development teams and we said, “So how long does it take you to create a build pipeline?” Most of them don’t even have QA and perf testing and SESO integrated. They’re telling us 3 months maybe, 3 to 6 months with 2 developers to get basically a good starting point. What that amounts to is $50,000 per development team just in development time. When you have 400+ development teams, you come out to 20 million dollars just in development time. Forget cost of plugins, forget all of that extra stuff, this is just development time. On top of that they have recurring costs right? Recurring time of developers in order to manage this build pipeline. Now when you’re talking about development teams at 400, you’re talking 2 million dollars a year just in managing that. We wanted to reduce that. We don’t want our developers spending time with that. We don’t even want them to touch Jenkins or Travis or anything like that. We just want them committing code. That’s what they’re good at, that’s what they enjoy. Just because I can, a couple quotes from some developers. “We spent six months solid building a good starting point.” “We never upgrade Jenkins once stable, because we can’t get time for it.”
[10:17] All right, so stakeholder requirements. We just talked about all the pain points, and the various pieces that these stakeholders want so we turned them into requirements. We said reduce migration costs, standardization, performance testing, compliance visibility, ease of use, quality testing, reduce time to manage, but we also had our own requirements right? If you’ve ever been in the ops world, which I have, you want this stuff to be geographically distributed. Pearson’s massive. We’re in 70 countries. We’ve got data centers everywhere. We can’t afford to be single homes. We have to be geographically distributed. Jenkins, when you’ve got 400 Jenkins or more machines, they have to be cattle. They need to be self-configuring, they need to be scalable, we need feedback loops back to our developers. Not just a “hey it’s broke, hey it’s still broke” we need more information right? That’s why we’ve added hu-bot and every namespace. As soon as Jenkins build fails, it’ll communicate back to hu-bot and tell us about it.
High level product design
[11:23] Here was our high level product design from our director that said, here’s basically what we want to do. We want a developer to be able to commit code, continue to develop and iterate on that, run through automation testing, and then run a global deployment to many many different datacenters. We want it all to happen in under an hour. Not too ambitious, but pretty good when you start accounting for SESO and QA and perf testing, that sort of thing.
[11:57] Alright, build pipelines. There’s a real key concept that we need to talk about here. That’s that right now you’re seeing, lord knows, how many Jenkins servers here, but you have to realize that our developers never touch these Jenkins machines. They literally configure from one place, inside their git repository, one time with three files and it works everywhere. Doesn’t matter where you are. You can see, we’ve got AWS, we’ve got our own private data center on OpenStack, we could be in Ireland. We have different namespaces, console-dev, stage, prod, and there can be any order of that across the entire world.
[12:43] What we did is we created a standard process by which to accomplish this. Something that would allow our developers to move teams because that happens regularly when you have 400 development teams, but be able to utilize the same exact development process. Another thing is we wanted to prevent our developers from having like one developer on the team that really knows the build CI/CD process. We wanted anybody to be able to use it, so it would be simple right? Excuse me. What we’ve done is we’ve got Jenkins deployed into every namespace and every cluster across the world, every datacenter. What happens is it’s a vanilla Jenkins, never been configured, never done anything, right? He automatically gets set up with what we call a C job, and that C job reaches out to our developers’ repository. That C job then starts to self-configure it. “Hey, I need to be able to develop for ruby applications, or python, or Java” or “Hey I need to be able to just deploy. I don’t have to build anything. We want that flexibility. We don’t want to build the same image in 100 different namespaces across the world for one application. We just want to build it once and then utilize it everywhere else. Also we need the flexibility of being able to have tests and different sets of tests per environment, whether it’s in Ireland, or Germany, or Singapore. Basically what we came up with and one of my colleagues, Simas, came up with along with Jeremy Darling who’s a developer was 3 simple yaml files. Based off these 3 simple yaml files anywhere from 10 to 50 lines of code, we can distribute this thing throughout the entire world. I know it sounds kind of crazy, but it works, it’s amazing.
Alright, so let’s talk about code a little bit. Let’s talk about these yamls. Every project, so think of a project as a development team. It can be many different applications within that team, but a project is synonymous for us within a development team. This is actual code running for our console application. Hed is higher ed. We have a base image. We’re telling Jenkins, hey we’re going to deploy for node.js 4.2.3. Now Jenkins knows to go out and get JS 4.2.3. Oh by the way, we have some components here. Console server, console UI, console stub, and these are all components of a single microservice build. Our developers requested the flexibility of not just creating microservice, but actually being able to take that into smaller bits of code so that they can manage it more easily. That’s what we’ve done here, we call them components. Then down here at the bottom we’ve said, we want to be able to specify what the runtime command is right?
Alright, so that was application.yaml, pretty easy. Build.yaml. With build.yaml, all we’re doing is specifying a set of dependencies by which we can configure our Jenkins machine. We’re specifying what the name of the components are that we’re going to create. We’re specifying the repositories down here that Jenkins needs to reference. Now we’re not just hitting one repository, we could hit 100. It doesn’t really matter. We can just keep adding to this list as our developers see fit. Actually the developers have complete control over this. We’re basically telling Jenkins, where are the other branch and repo locations that I need to be aware of. We’re then telling Jenkins how we build this, and then we’re also telling Jenkins a little bit more information about another piece of code in this example called console-stub.
[16:52] We have one more yaml file, environments.yaml. What’s really cool about this, and I didn’t really display it, is that, well actually I did. It’s right there next to environments. In this case, we’re going to build for our development environment. It’s going to be console-test. Our next environment is staging. What that means is after dev has launched, we’ll automatically route onto our staging environment and build there too, or deploy there and run a completely separate set of tests. Pretty nice. We’ve defined a service, very much synonymous with Kubernetes services and ingresses because we run ingress load balancers on our cluster. We’re basically defining a couple environment variables. We can define a port, we can say “okay, here’s console-dev.pearson.com.” That’s going to be our publicly available endpoint. This means that Jenkins automatically builds our ingresses for us, which gets added back into our ingress controller.
Then we’ve got tests. We don’t want to allow our QA and SESO and perf testing teams to be left out of the process. We want them to know that they always have a seat at the table. What’s really cool about this is all we do is specify another repository, another branch. We can provide shell commands, we can point to an API endpoint, we can do anything. All the devs and SESO guys, all they have to do is say, “hey developers this is where you need to point to for your tests.” Pretty rad, because now QA and SESO, they have complete control over how that application is going to get tested. Is it unit testing, is it integration testing, is it end-to-end testing, are we doing breach management testing? It doesn’t matter. The developers don’t even have to know about it. All they have to do is say “hey developers, point to this test location, and we’re done.” It’s that easy.
[19:05] All right, so what requirements have we met with this? We’ve met geographic distribution because even though we may have the same application across many different clusters, it’s based on the namespace. If we have console-test as an example, console-test reaches back to the same git repository and says “hey, I’m console-test. Tell me what the hell to do.” They’re self configuring obviously because now they’ve been told what dependencies to pull from. They’re standardized meaning across 400 development teams really the process is pretty similar. Really the only difference is the commands, right? It’s pretty easy to use, although we’re working on that. Again they’re cattle, it’s scalable. We don’t have this big monolithic Jenkins machine per dev team. We might have 12 Jenkins machines across one dev team, but those Jenkins machines can be a couple hundred megabytes in size. They may or may not be doing much. What’s even great about that is those Jenkins machines, what we’re working on right now, I don’t know if you have a skill set in this area, is we’re working to scale out that automatically based on off of the load that’s generated in this configuration.
If you guys have some insight into that, we’d love to hear from you. We reduce migration costs. One of our developers actually, the developer that gave me some of these comments that you saw earlier, he was able to help another team come on in like 15 or 20 minutes. Fifteen or 20 minutes to migrate from an existent virtual environment to getting started building on PaaS. That’s pretty rad. How many people can say that here? Hands? All right, I think it’s pretty rad.
[21:08] All right base builds. So in order to make our environment fast, what we’ve done is we’ve created what we call base builds. We’ve said basically we’re going to have Java, and Python, and node.JS, and PHP, and whatever applications our developers want to do, and we’re going to provide them as a base Docker image. Jenkins, it still goes through our entire build process. We’re still developing these things, excuse me, this is done, but what I’m saying is we still go through an entire development process and testing and QA and security on all of these base images. That way when security says node.js 4.2.3 is ready, they’ve already signed off from a compliance perspective, they’re happy with that. Is that kind of cool? How many of you can say hey, SESO’s already signed off on what I’ve done and I haven’t touched anything yet? Pretty awesome.
Then, after that obviously it goes through our normal feedback process, back to hu-bot, back to our core team, but Jenkins can do all of this stuff in an automated fashion. Again, what’s great about this, these are just base builds, these are not the application we’re going to use. We’re just using this as a method to secure our base infrastructure so that we don’t have to do as much testing when we do our app level development. If SESO and QA have already bought into these base build images, and said hey those are good to go, then really all they have to do is test the delta of the application, and of course any additional libs that are added onto it. We don’t prevent that. That’s totally fine. Alright, so after we get done with the base build, as you can see we just kick off Nginx 1.1.7 to our registry, and it’s ready for consumption. They can just specify it inside of their yaml, and they’re good to go. What have we done here? We’ve created a standard process across the enterprise by which to create and level set base images. We’ve ensured compliance because as long as it passed the tests that the SESO guys and the QA guys have set forth, we’re good. We’ve determined we have at least a minimal base set of security involved. How nice.
[23:35] All right, so app builds. App builds are very much similar to base builds. The only difference is that we require a base image here in this case nginx 1.1.7 that we’ve already verified and we’ve determined that it’s met a given security and QA and perf testing requirement. Then, we inject our application code in here. What we do currently is in the form of a dev package. All we do is take a series of dev packages, plop them into the right directory. We’ve created a standardized directory format. We throw them in, Jenkins populates that as another application image, and then tags that and throws it into our registry. What have we done now? We’ve increased speed because we’re not trying to build an entire base image and the application image at the same time. We’ve included performance testing. We’re now scalable because once it’s in the registry we can just deploy that image from anywhere. We’ve met SESO’s compliance, and quality testing, SESO and performance are all kind of built into the environment.
Manage External Resources with Kubernetes
[24:52] Man, I feel like I’m moving through this really fast. How much time do I have? All right, so what else? We’ve really been doing a lot of amazing things inside of our team. One of these is managing external resources with Kubernetes. What that means is, there’s something called a third party resource in Kubernetes. Basically you stack storm, which is an open source platform, and we’re allowed, or are able to watch the Kubernetes API to create new cool things. It could be databases, it could be anything you want, but what’s really cool about this is we’re managing this from kubernetes. Imagine being able to deploy an RDS database and saying I want RDS, I want it to be mysql, aurora, or whatever you want, and I want to be able to determine the size, how much storage it has, everything. All you do is push a little bit of code up to kubernetes and the rest just handles it for you. It then takes that data, and I don’t have a good slide for it, it then takes that data and populates into our dynamic configuration store Consul by Hashicorp, and populates the secrets into Vault. So when our application spin up, they automatically injest that as environment variables. Developers don’t even have to know what those in points are. All they have to do is call the same generic environment variable forever.
[26:31] Do you guys want to see that? Kind of rad huh? Alright, don’t fail me now. Alright, so let me show you what that looks like real quick. As I mentioned, this is just one config file. We’re saying hey I want a database, I want it to be mysql. Currently it’s a third party resource, we give it a little bit of description, we tell it which version we want it to run on. So we currently have one database running, but we want many many many more. Let’s go ahead and create that sucker. No. Oh man, doesn’t this always happen? Right when you need it. It only worked like 100 times before I got up on stage. Apparently I didn’t make a sacrifice to the demo gods today. Excuse me, one minute, ignore everything you’re seeing. Well, no wonder. I think I lost my … no. That’s what I get for trying two demos in one day right? All right, so while I’m working on this who has questions? By the way I got a pass that said delegate on it, and they scratched that out and wrote speaker. I said, “This is excellent. I’m going to delegate my speaking engagement to somebody else. It’s brilliant.”
Sarah: Does anybody have questions? Yes, hang on okay we’ll start here, and then go there.
Audience: Adoption, so you’ve got 400 development teams on the old platforms, taking care of, sticking to mind. Did you have to move any of them across with a bit more of a stick then a carrot?
Michael: No, is the answer. Honestly I think this is mostly a human element. Some development teams don’t want to change, and that’s perfectly acceptable. What we did was we brought them in early and we started warming them up to it. It’s kind of like hot chocolate in the microwave. It takes time to warm them up to it. Then, what we started doing, again we started investing in them. We said, so what do you want, what are your frustrations, how can we help? They gave us what they thought were impossible frustrations that no one would ever do, no one would ever complete. We spent a lot of time on that. We said, okay how can we solve this in a way that would really be meaningful to developers? At that point it’s kind of like that guy that stands out in Woodstock, and he dances for like 80% of the video right? Finally somebody says well that looks fine, I’m going to run out there with him. They start, you’ve got two people dancing for another 90% of the time, and then on the YouTube video, the entire swarm of 4,000 people is having a mosh pit in the middle of Woodstock. That’s kind of how it is.
Sarah: Community for the win.
Audience: I wanted to know how you manage upgrading Jenkins since it’s so easy to spin a new instance right and a new name space. How are you managing since Jenkins has a history of remote code execution ** which is kind of natural given how it works. How do you manage upgrading Jenkins across the fleet without disrupting developers work because there are so many instances …
Michael: Yeah it’s huge right? That’s why developers never upgrade Jenkins. What we’ve done was as I mentioned earlier, Jenkins is self-configuring. Regardless of what you’re running, it’s self-configures itself. We can literally just kill off a Jenkins pod, when it comes back up it re-configures itself. Now we use Jenkins as a base image in our process. What that means is, is before it ever gets to our developers, we’ve already run through a series of testing against it. We can even incorporate some of our developers yaml files that I was talking about earlier, and we can say is this going to run on top of this? We have the ability to steal those, we don’t have to have access to the repositories. The only thing that does is Jenkins. Then we can duplicate that set. The other piece is, and that’s an excellent question. It’s going to lead into something really cool.
We just designed a process by which to upgrade, and we have two different upgrade paths for infrastructure. One is Kubernetes-based infrastructure. The other one is the pods-based infrastructure. Essentially we’re building out something really cool for, because we have kind of what we call global services. Jenkins is a global service, hu-bot is a global service, caching is a global service. All these different things, and we need to be able to upgrade those efficiently across a distributed environment. What we’re doing is we’re creating what we call a brain to enable that. Essentially it will upgrade everything for us. We can do that at any time. We can set schedules, we can do all kinds of cool stuff, but because Jenkins is literally just an image to us, we can determine very easily if it’s going to be successful for our developers.
Sarah: We have 3 or 4 more minutes, do you want to close up or take another question?
Michael: Well, you know since my second demo failed utterly, I’m happy to take questions.
Sarah: Alright, awesome. We’ve got one way here at the back so I’m going to chase that way.
Audience: What are you doing for your networking?
Michael: Within the Kubernetes cluster?
Michael [34:38] Currently it’s Flannel. It’s worked quite well for us. We’re going to be moving onto other things. We saw romana for the first time here during this conference so we’re going to look into that. We’ve seen some other networking pieces that we really want to take a deep dive into. Currently we don’t allow things, for example our registry is actually a pod within our cluster that other pods or that Kubernetes calls in order to deploy pods. We don’t really expose much outside of our clusters. As a matter of fact I think I have a good slide on this. Within Kubernetes, or within our, we’re currently AWS, and we have the capability to deploy into OpenStack, but we develop our tiers. We’ve got a load balancing tier that is spread across multiple AZs. All of these are auto scaling groups by the way. All of our minions, all of our master, and what we call our load balancing minions, are all in auto-scaling groups. As you can see here, we’ve got our load balancing minion tier, and those are the only machines that have any exposure to the internet. All of our backend machines here have zero exposure to the internet, zero direct exposure at all. We’ve reduced our footprint significantly, but yeah flannel is currently the way we’re routing. We’ve tested some other things, but that’s currently what we’re using.
[36:04] Real quick, I do want to talk about something really cool that Mark Devlin on my team created, and it’s freaking brilliant. You guys all have heard about the Nginx alpha controller that can route traffic and listen on 80 or 443. The problem with that is SSL generally has been painful to say the least. I’ve heard some other guys at the conference saying they’re just adding SSL search to Kubernetes secrets which is fine, but Kubernetes secrets are really just an obfuscation of the data. They’re not actually encrypted. That was a problem for us because our compliance guy said no that isn’t going to work. What we’ve done is we’ve automated population of SSL search into Hashicorp Vault. Then whenever we create an Ingress controller which again is deployed through Jenkins. We create an Ingress, our Nginx controller pulls that as a backend, and then we have a little Go kind of program or agent that runs on those pods that then pulls in the SSL search based off the domain name. What’s really cool about that is we can now manage our SSL search from a secure store and vault. By the way I didn’t really mention this very heavily, but please understand that the vast majority of what we’re doing is entirely open source, or we will be open sourcing it once our code looks a little better.
Audience: How can people find that open source?
Michael: Great lead in. We’ve done a lot of integration. We’ve created the first Kubernetes pack for StackStorm which is an IFFT or event driven automation platform, and it is inside the public contrib. Again Martin Devlin created the SSL integration. We’re currently submitting that as a full request and to contrib for Kubernetes. This Jenkins build pipeline here, we’ve got to turn it into a plug in so it’s really easy installable for you guys. We’ll be making that open source as well. Then we’ve got our ChatOps integration that Simas and I do, or built, which we’ll be open sourcing very soon. Stay tuned, again my blog is devoperandi.com. I post a lot to there and reference a lot of these projects, so feel free to reference that. We’ll definitely post there as well as our Pearson blog so that you know once we finally open source this stuff.
Sarah: Okay, we have one super short question.
Audience: Are you going to post these slides on your blog?
Michael: Yes, we will. We’ll post these slides on the blog absolutely. One quick thing, etcd restore. We currently have the ability to restore etcd from files into a completely new cluster so we can do A/B deploys of clusters. It’s not pretty yet, but it is working. With the stuff that Brandon Phillips was talking about, we’ll be routing that way. Forget you ever saw this slide.
About the speaker
Michael Ward is the Principal Systems Architect at Pearson responsible for leading technical design around enterprise Platform-as-a-Service based on Kubernetes. Prior to Pearson, Michael has spent many years in the industry in various roles including Chief of Site Reliability at Ping Identity, the Identity Security company. Take him for a beer and pick his brains on anything you like. You might even come away with something valuable. No guarantees of cost:benefit ratio implied or guaranteed. Michael blogs at http://www.devoperandi.com/
More from KubeCon
Check out other talks from KubeCon!