On Trend: A Saks 5th Avenue Case Study

The Future of Containers in the Enterprise

February 10, 2016

New York

On Trend: A Saks 5th Avenue Case Study

How you doing? So as Bryan said my name is Matthew Pick, I work for Hudson Bay Corporation, the digital side of Hudson Bay Corporation. We work on four to five websites right now. saksfifthavenue.com, offfifth.com, lordandtaylor.com thebay.com and lebay.com which is the french site for thebay.com.

So the engineering position, the infrastructure engineering position at HBC digital is an interesting one. We have many cross-functional teams that take on projects. They're spread across several projects and embedded in these cross functional teams is an infrastructure engineering team member, and they are really in charge of DevOps.

They handle continuous integration, continuous delivery, they handle all technology assessments, they work closely with the team to do technology assessments, they handle architecture performance and capacity planning and like I said they're basically a DevOps role. So, we've been working with Docker and Containers for about a year now, At around five months we launched Docker Containers for a pretty big part of our site, two of our sites, in production and that's why I'm here to walk you through our journey getting there.

I feel like I have a lot of room though, I feel like I should do some interpretive Docker dance up here something but I won't do that. So this all began around two yeas ago, we embarked on a project to bring microservices to HBC. The idea was we have this monolithic ecommerce app and the idea was to take a page, at the time the product detail page.

For those who don't work in ecommerce, the product detail page is when you click on an item and it comes up and displays like price, and the product, copy, pictures, that kind of thing. So we took each part of that page and broke each part out into a small microservice. The idea was to make this more agile page. The project was a really really huge success.

At the time, we had no way to deliver microservices to development, and the word came up that we were going to start expanding our microservices, and getting more and more of them. So the team got together and we started really experimenting with Docker. This was known as the spit, duct tape and chicken wire phase of docker, because we spent a lot of time taking our existing processes and sort of cramming them into a container ecosystem. We took time to stand up like an internal docker repository that would promptly crash every other day, forget images. It's really fun to go from developer to developer and sheepishly explain why their images have completely disappeared inside the ether. But that's kinda what we were dealing with at the time. But we saw huge potential in this technology to deliver true, immutable infrastructure to production.

I know immutable infrastructure is one of those words. Tech we love our buzz words. But the idea of it is kind of rooted in the idea of immutable programming objects. The idea is when you push something to let's say production, you don't ever alter it, if you need to make a change, you blow it away and you re-create it. So at the time we were doing immutable infrastructure kind of as a half measure.

We were using Puppet to deliver immutable infrastructure. So to deliver our microservices, our recipes would blow the entire directory away, and then just re-copy—MD5 each file and copy over the microservices which took forever. So our deploys took forever, and it was very easy to break, and I'm sure people have used Puppet and it's a great tool but it's complex.

When we got the word that we were going to be expanding our microservices, we wanted to move 100% to containerization. We were at the time running 100 microservices spread across 6 bare metal servers. And the teams got together to try and re-design our container ecosystem, and design it around repeatable processes that we could use across multiple sites.

That was really important. But also we're infrastructure engineers and I am called the blocker, I would say about 20 times a day, so we all live under deadlines. So we had to pull together a tactical solution as well as a strategic solution. The tactical solution was going to be a way to sort of dip our feet into this technology without cannonballing into it.

So we're gonna use our existing virtual machine technology, bring up some VMs and put the containers there. We were gonna partner up with the developers to bring docker in a way that we could use it and kind of just iterate on it. Well at the same time we will get together and work on a strategic solution for long term.

So we started with the repository for our microservices. The developers had come up with a template for micro services, so we wanted the template to put forth infrastructure standards as well as coding standards, so we got together and in each of these microservices, you'd have a dockerfile and a docker compose file.

And what was nice about this is from a docker file standpoint a developer could pull down repository, build a container run it locally, run it on what we call a slot, slot is a development environment. But also with docker compose, with a docker compose file we could then modify it when it goes out to our environments QA, pre-production, production and lets us layer on infrastructure changes on top of it without altering the container.

Which is pretty important, so we rebuilt our continuous delivery and CI processes from the Ground up. We decided to change, we used ThoughtWorks Go as a continuation integration tool. We decided to change from just kinda building pipelines ad hoc to templating. The idea was that you would build a template and it would have all the steps you need to build or deploy an application and from there you would join micro services to this template.

So this made it easy to change on the fly when you're dealing with a technology as new as Docker—now I know containers aren't a new concept, but Docker kinda is and a lot of tools surrounding docker are like just now coming out of beta, so when you're dealing with technology like that, it's important to be able to change on the fly like change on the quick, so we use this templating technology to make changes and cascade them all the down to all the pipelines.

So I'll also at the time moved from my spit, duct tape, and chicken wire repository to an enterprise level artifact repository, we're using Artifactory which has plugins for docker, and Node package manager. So for deploy, we talked about before the Docker compose file. We use this expensively to layer on infrastructure changes. Things like setting reliable ports that we used.

We were using Rancher for container orchestration (I'll get to that in a second but) to layer those changes on top of the containers we made the change in the docker compose file that way again, not altering the container from the development standpoint, altering when it goes to an environment and not altering the container at all right because we're only altering the docker compose file.

And we use the same versioning number that ThoughtWorks Go uses for our artifacts, so typically a build would start, it would compile, then we would check in a compiled artifact, then we would build our Docker image, use that same build number to tag the Docker image and then check that into our artifact repository, and then as I said before we settled on Rancher for container orchestration.

We did a technology assessment and since we're under as usual deadlines we landed on Rancher because it was easy to use, easy to set up, gave all of our cross-functional teams a super nice views into their environments, and really gave them the idea that they could go and check log files and everything right straight through a GUI.

So we kinda had a Container ecosystem that we are really happy with and psyched to use. So we started to talk about putting together a launch plan. This launch plan had to have minimal customer impact, which is important. So we have an application pool in front of our micro-services.

I'm sure everybody does who has some kind of application out in production, has some kind of application pool. So the idea was, we're gonna bring up our Containers on our VMs, extend our application pool so it covers both the bare metal applications and the VMs, send live productions traffic so they hit both the bare metal apps, and the Containers, monitor, then bring down the bare metal applications when we were satisfied.

There was a real concern at the time how these containers would perform. We left room in there for load test and performance testing, but to be honest there's not anything quite like production-level traffic against an application, against new architecture, against anything. So we've really at HBC adopted a dark launch mentality.

We try our best to launch either new architecture or new applications In production with little to no customer impact. This is done through toggles, this is done through things like load balancing, this is done through a wide variety of ways. What's best for the project. So at the same time, I don't know if this was either foolhardy, or very brave, we decided to test the performance of our containers.

We would launch them during our busiest season. Black Friday and Cyber Monday. >> [LAUGH] >> Right. Which would really give us an idea how these things were gonna perform in production, or we would crash and burn and everyone would cry and we would get yelled at. So we were supposed to spend a few days in this mixed mode soft launch.

We spent weeks there. We had all kinds of problems, and if I could offer any kind of advice to launching something like this in production, it would be to expect problems. Plan for failure. You're gonna have issues. Be prepared to iterate, be prepared to throw technology out if it's not working for you, redesign from the ground up.

Especially with new technology such as this. Many issues we faced were both from inexperience on our team of running containers at this level and also from the relative maturity or beta state of some of the tools. We would do things like upgrade Docker versions and then not be able to push images to our repository. We would do things like upgrade Rancher and lose the database and then have to rejoin containers to Rancher because they were all gone. Our biggest issue that we had was with our host file space.

So right after we shut down the bare metal applications, and everything was on our Docker containers, the Docker daemon just started mysteriously crashing. We notice we were running out of space on the host filesystem. We couldn't figure out why because we've spent all this time creating automatic jobs that cleaned up after Docker, that cleaned up images, that cleaned up log files but we could not get space back.

We went as far as changing our Docker compose to connect a volume, we were gonna use our volume as a bucket for our log files, and connect each container to this volume, so we just dump log files there and we can clean up and delete as quickly as possible, the host operating system was still not getting space back.

I still have nightmares of waking up in the morning, checking space stopping Docker, restarting the host, sometimes removing Docker entirely, rejoining the containers, redeploying the containers, and then going in front of the Ops team who we were driving insane like Frank Drebin when the firework store is on fire saying, there's nothing to see here, it's all fine, move along.

But we finally figured out that—quick show of hands, how many people here are running Docker in production? Okay. Leave your hand up if you are using the device map loopback driver Yeah. You guys are geniuses, because we didn't realize that until we actually looked it up and like the first thing that came up in Google was don't use device map loopback driver in production.

So we switched to an overlay file system, that fixed all the problems. But you go into a project like this expecting to have issues. What I didn't expect was having to manage a cultural shift for containers. The biggest issue the teams had were selling this technology to cross functional teams from merchandising to leadership to QA to the data team, getting all the teams on the same page about Containers is not an easy task.

I mean some of them are tech savvy, some of these teams don't need to be a tech savvy, they have no need to understand what a hypervisor is. But if it's in production it has to go into pre-production and pre-production is where these teams spend a lot of their time. QA has to know what's in the QA environment, they have to know how to modify the QA environment, they have to understand what a docker container is. So funny story, I was really, really psyched about this technology.

I would go home and just completely talk about it all the time. My wife at the time whom whenever I'm excited about something has to listen to me just drone on about it, started calling containers my tupperware project. >> [LAUGH] >> And then she called docker my pants project. So like I sat her down and said Melis, this is a technology this is what it's all about and I am watching her eyes glaze over and I got like really upset because I am crazy so I am like Melis, why aren't you listening to me, don't you understand? She's like Matt, you're talking to me like I am one of your engineers.

You're talking to me in nerd speak. Captain Kirk does does not care how Scotty bypassed broken dilithium crystals to fix the engine on the enterprise. All he cares is that the Enterprise is fixed, and that Scotty looks like a miracle worker, because Scotty is a miracle worker, so we had to adopt this sort of different way to sell this, I'm a terrible sales man, most engineers are terrible sales men.

We had to re-asses how we approached knowledge transfer. We had to embrace the idea of teaching the ungeeky. Instead of talking about what the actual, what the technology is, talk about what the technology brings, how it enhanced the day to day processes, the idea of one build of an app that's deployed the same way from development to QA to production, never tampered with.

This is a way easier sell. The idea of an app that self heals, that if it crashes will restart that's a way easier sell than me talking about the underlying technology. We had to adopt like a real showcase mentality from soup to nuts. We went ahead and we built a micro cosmic version of our target production environment With rancher on top of it and let the teams get in there and play around with rancher, play around with being able to access log files.

Giving them sort of power and control over the environments they never really had before we brought containers in. And at that point we were Scotty. Like we were heroes they love this technology. Even going as far as having to explain to the infrastructure operations team and we have a really great operations team.

But they had to reassess how they would support this technology in production. Again immutable infrastructure It's hard to explain to a guy who is used to getting in to Linux and like digging in and making changes in production that is not what you should be doing anymore. If you are doing that we have fundamentally made a huge, there's like a huge break in our QA and testing chain, cuz we've had this container through development QA and production and if you have to get into that container to make a change in production, we have made—mistakes have happened.

So getting that mentality through.
The idea that we have eight micro services that are all containers, if three of them go down it's not a four alarm fire that you have to wake up at 4 AM for. They'll probably restart, and the other five we've done capacity planning should handle it, you can wake up at 9AM and say okay what happened, what's broken? And we are still in the process of making this transition constantly.

So at the time we were again focusing on the strategic solution as well and we did a bunch of technology assessments and landed on Triton and Joyent as a future state for our containers. We wanted also to get to a place where we are going to have less reliant on puppet for config management. So we got to a point now where all puppet does is bring a config file to the Docker host, connects that config file directory to volumes in the containers and that's where the config management is again the idea of never altering the containers.

We also wanted to get to a point where we don't have to rely on puppet recipes anymore, and the puppet node files got kind of insane with the amount of variables that were having for config management, so we also landed on moving to consul and key/value pairs, but again selling all this to leadership is not that easy, selling the idea of container native to someone who has no idea what a container is, know what a hypervisor is, is not an easy task.

I made all kinds of graphs, part of the reason why I don't have anything up here, is aesthetically I'm terrible, I'm not a front end developer. I am a back end infrastructure engineering developer. You don't wanna give me access to Visio with the colour wheel. It's bad, and I made all these designs and graphs. I partnered with other tech leadership to kind of sell this idea to other member of the leadership teams.

We brought in the big guns, we partnered with actual technology leaders, with Joyent and with people from consul and such that to bring this idea to our leadership. Again the idea of embracing a better way to sell the technology, to get the culture to embrace this technology, and it's been a huge success, the transition from native bare metal applications to containers was a huge success. All future apps are being designed with containers in mind. All projects are being designed with containers in mind and will utilize, containers and the surrounding technology extensively.

We've moved to a point now build-wise where we couple even our monitoring with containers, New Relic is in there. We're also getting to a point where when a container joins, it's container orchestration, it will hopefully it'll create or remove Nagios alerts. We'd like to get to a point where, so we move to a spot where we've actually containerized New Relic and nginx.

So nginx we started using as software load balancing and that is fully containerized in production with CI built around it. We'd like to get to a point where we doing have to modify if have either a new microservice or a new member of an existing microservice. Right now it's kind of clunky, right? We have to redeploy nginx and add in the member into the comp file.

We'd like to get to a point where that scales automatically.
So, that's what's moving ahead. I wanna thank everybody for listening about my trials and tribulations. Thank you to Joyent and the rest of the sponsors for having me here, and I wanna thank HBC digital teams who are extraordinary people and a joy to work with, some of the development teams and my engineering teams are here and they're excellent, thank you guys.

>> [APPLAUSE]

Speaker:

Matthew Pick: Technical Manager, HBC Digital (Saks 5th Ave)