Here’s the scene: your company was one of the pioneers in bringing retail to the internet. Your store has millions of monthly visitors. During your massively promoted sales, you see HUGE spikes in traffic. Every sale counts. And everything is running on one giant Java application.
This story isn’t fiction, it’s Robin Glen’s job. In this interview, he tells us about how his dev team re-architected the massive designer fashion site Net-a-Porter, and the organizational changes their technology teams made to do it.
What’s the focus of the tech team at Net-a-Porter?
So, the tech team is split into cross-functional customer focused sub teams; my team is made up of full-stack developers, test specialists, product owners, delivery managers. We’re responsible for the listing pages, product pages and site search — basically the entire product catalog. Right now our team is replatforming our architecture into micro-services and working on this concept of “headless commerce”. This allows for the entire e-commerce platform to be front-end agnostic.
How did the decision to move to “headless commerce” come about?
When Net-a-Porter was started, our e-commerce site was one big giant Java application, which is probably quite a common situation. Our delivery rate was too slow and we wanted to move towards continuous integration, but we had this big application that didn’t have great testing around. We needed to improve and evolve.
Was there a specific problem you were trying to solve?
We get a huge amount of web traffic whenever we launch a sale. There’s a massive amount of physical hardware you need to have for a short-term sale, which results in a lot of redundancy during normal times of traffic. This was obviously inefficient and a pretty big problem. We decided to break out the sales part of the site into a listing application in the cloud so we could have horizontally scalable applications to handle sale traffic.
This was a huge success and kicked off the concept of dev ops at Net-a-Porter, so developers were doing the ops and we were building full stack applications ourselves. This really opened the floodgates. As soon as the sale application was out, we began to look at every part of the site to see how we could make it scale horizontally, we are still on that journey.
How’s it going so far?
We’re still pushing towards Continuous Integration. We can do multiple releases a day. We don’t need regression testing, so we can actually move forward a lot faster. At a company of our size, innovation obviously gets a lot more expensive because you have to stay so far ahead, so all these things are helping us to keep moving forward.
Another interesting thing is the cultural shift that has occurred as a result of this change. It’s promoted a lot of ownership. Everyone in our team is able to quickly iterate and find out what’s been successful and not successful. The developers who write the applications are now also responsible for supporting them and for setting up monitoring alerts to make sure things are running.
It sounds like monitoring and alerting are pretty critical to ensure things are running. What role do performance metrics and analytics play in that process?
My colleague Matthew Green and I were tasked with answering the question “How can we make the website more performant?”. But in order to do that, we first had to answer the question “How can we measure our current performance?”. You can’t improve what you can’t measure.
So we started tracking browser performance metrics and reporting that data to Keen IO. We even extended this to collect performance metrics on client-side API calls. So now if I want to know how long it’s taking to see what’s in someone’s basket, or how long it takes to add something to the basket, we can do that.
As our experience and understanding grew, we started to add more granular metrics to help us identify and diagnose issues quickly
For example, when the website throws a 5xx error we can dive into real customer errors and I can say, “This page errors the most. In this country, that is where the problem is happening.” We use Keen to track any type of error into complete granularity. It enables us to identify errors in a customer’s experience across all parts of the site. Of course, with an e-commerce site, errors can be very expensive if you don’t diagnose and fix them quickly.
What makes my job easier is knowing: Are things working? Are they being used? If they are not, we can alert on them. We can do out-of-hours calls like, “Okay, the 500 errors have gone up. Send a text to the team,” or “The website is running slow. Send a text.”
Why did you choose Keen IO as your monitoring solution?
Keen gave me the granularity I wanted, complete control. How we use the data, visualize the data, it’s all in our control. We were doing a lot of work to get the metrics we needed, but the monitoring tools we had were not granular enough to tell us if our improvements were working. The data was not democratized. I wanted to be able to monitor anything and make it readily available.
What we’re using Keen for now is monitoring granular performance and availability, and for that it’s been perfect. We have a lot of ideas of what we want to do in the future and because of Keen’s openness it gives us is endless potential.
As an example, we tag all of our events with the build number of each application. We know how long it takes for a ticket to move through our workflow, we know how long it takes to run our build and test scripts, and now we know with how long it takes for customers to start interacting with new features. This data can give accurate development estimates to our business stakeholders.
So we can now accurately say how long a story or feature has taken from inception to real customer interaction.
What are you most excited about for the coming few months at Net-a-Porter tech?
We are market leaders in luxury fashion e-commerce and we’re on the path to making our customer experience even better. All of this re-platforming is going to unlock innovation. I want our customer experiences to feel as slick as a native app.
I’m a big believer in the web as a platform. The reason that it took off originally is because it’s frictionless. You don’t need to find a website in a store, buy it, download it, install it and launch it before you use it. I totally understand why native apps took the lead, they felt more responsive, they work offline, give you access to the device’s hardware, and overall they gave users a better mobile customer experience. This even lead some people to proclaim “the web is dead” and for a while maybe it was. Browser technology was in flux, no one could agree where mobile web was going and it stagnated, we fell behind.
This however seems to be changing. New web APIs are coming through rapidly, and thanks in part to Google’s great work, there’s no reason that we shouldn’t be able to create native-feeling apps on the web right now.
My goal is to show it’s possible to build, test, deploy, and monitor “progressive web apps” in production, at scale, for a large e-commerce websites.
Robin will be joining us in San Francisco on May 17th to dive deeper into how Net-a-Porter scaled their developer culture. If you’re interested in hearing more about their team, the infrastructure they’ve built, and their experience at Google I/O, be sure to register for our event on May 17th. Hope to see you there!