Let’s get ready to rumble!
Black Friday or Cyber Monday, who of us hasn’t heard about these bargain days to officially open the Christmas season? We all know the news articles or videos that show people almost attack a shop (and sometimes fellow customers) to get the best deals. Officially the first Black Friday has been organised in the US, but nowadays also very popular in Europe. Additionally the marketing guys also invented Cyber Monday, which is a way to even extend the Black Friday bargains to the online webshops. All great for us as customers, but what is the effect on your webshop? Especially now during Corona, when we have limitations on the total of customers in a physical store. One thing is for sure, luckily for us we can expect less injuries during bargain attacks, but what about our online shop? Is your infrastructure prepared? Good preparation is key during these periods. Customers still expect a good experience when buying online, not only the actual shopping but also the payment and logistics in the backend. Most troubles occur during peak moments when the system is overwhelmed by traffic and eventually crashes.
But what can we do as engineers? Especially SREs have to be prepared and take precautions in collaboration with the DevOps teams. Below my idea how to approach a solid, performant and reliable system.
Key differentiators are still:
- Responsive website
- Great user experience
- Accurate stock
- Visible and fast customer service support
- Simple and guided order process
- Stable and secure payment provider
Business plan / approach
It all starts aligning with the business, identify their sales goals and marketing strategy.
- Ask the question what their expectation is around the target volume of customers they want to attract as business goal. This already can give you a good indication of how much load you can expect. Especially large discounts or exclusive deals tend to attract many customers.
- Identify the various ways of attracting the customer. What is the exact marketing approach and their planned campaigns? Which strategies and digital channels do they want to use?Think about a customer mailing, social media, special landing page or the actual webshop. Additional targeted campaigns can give insides when peak moments can be expected.
Learn from the past
Sometimes it is good to start with some reflection. Just like learning from failure, look at the last two years of figures. Start with consulting your website statistics. Most companies use Google Analytics or SiteSpect that give perfect metrics about page loads, top searches, top products, peak moments and succeeded payments. For better insight into the application you should include data from APM tools like Dynatrace, Elastic, New Relic or else into your analysis. These give a good view on the User Experience. For cloud native environments look into Prometheus metrics. For online applications take extra attention on the golden metrics, which are latency, traffic (requests per second), errors (failure rate) and saturation (how full is the service).
Did your company experienced issues during a previous season? Lookup the incident records and benefit from their Post mortems. Especially look if the potential issues still exist and how they were solved.
Identify a solid baseline
Now that we have good input we can define our capacity plan. Yes with cloud we get elasticity, but still we need a solid baseline so we are not wasting resources, but also not have disappointed customers. Still middleware services or underlying infrastructure (like queueing/search systems, databases, instances, firewalls or networking bandwidth) have capacity limits we can hit. Additionally horizontally scaling out can also introduce some delay, which your system need to respond correctly without annoying customers with ugly http error codes. As cherry on top you will identify potential bottlenecks to fix and guard that you can easily handle peak loads with use of auto-scaling.
First step is to take the metrics and try to predict the initial baseline. In most simple cases we extrapolate your current number of users and throughput to the expected load we have learned from the previous figures. Are we done? No this is just the start.
To identify a solid baseline and proof certain performance it’s a practice to run several load test scenarios. Don’t use only one single functional scenario and also try to include all services and components that can affect the customer journey, especially business critical parts like the online webshop, payment and the logistics system. Increase and/or optimise all resources till you reach the minimal baseline and ensure that your system can easily scale above this without any customer disruption. Use your testing and monitoring tools to measure all the metrics, especially user satisfaction (Apdex) and golden metrics you agreed on (SLO/SLI).
Proof resilience
But what if your system still breaks under heavy load, cloud infrastructure errors or unidentified single-point of failures? Can your system still recover smoothly without any major issues? This is the most forgotten stage! Everything seems to be ready, but during heavy load or cloud providers issues your system fails. To prevent us for getting into this situation we have resilience testing or well-know as chaos engineering. Use information from the past incidents and known critical services to validate their resilience. Write down the hypothesis and define your experiment to proof actual resilience and validate the detection systems that are in-place. For this you can use either Chaos Toolkit or Gremlin.
After all we want a reliable and performant system.
Dry-run “lean process”
When the moment is there, be prepared. Ensure that a week before Black Friday you have validated the runbooks and dry-runned potential scenarios. This helps to address potential delays that occur. Dry-run in this context means that you have organised a rehearsal with all involved teams to practice on the various situations that can happen and how they respond. Most optimal is that all involved teams are participating, such as SRE, DevOps teams, Business and Major Incident management. They should be explained which are the most important communication channels (like Slack or Teams), how to use and interpret of the various observability tools, dashboards and how to handle specific alarms (OpsGenie, Pagerduty) when they occur.
The end result should be, an optimal process with clear runbooks to observe, monitoring and immediate triage and mitigation incidents if necessary.
Judgement day
Always start with a briefing, this helps to address the importance. This could be a brief message during the standup. Teams should be confident that that SREs are fully observing their detection systems and ready for incident response. Chaos should be avoided, therefore good communication and alignment during an outage is crucial. Also inform non-IT support staff like customer service support that has direct contact with the customer through their known service channels and/or chats.
I hope your systems all handled the load well this year! Otherwise this article can help you next year with planning, organising and setting up a proper approach.
My last wish for you!! Buy your new gadget and survive Black Friday, Cyber Monday enjoy the Holiday season.
If you have questions regarding this article or one of the topics I’m happy to help you!
#SRE hashtag#cloud hashtag#observability hashtag#monitoring hashtag#retailers hashtag#webshops hashtag#onlinecommerce hashtag#sales hashtag#business