EPOS resilience: planning for the Saturday afternoon, not the Tuesday morning
Most EPOS faults are tolerable on a quiet Tuesday. The same fault on a Saturday afternoon in December is a different conversation. Design for the worst trading hour, not the average.
Retailers tend to specify EPOS reliability based on uptime averages. Ninety-nine point something, measured across the year, looks reassuring on a slide. That's the wrong question. The right one is: when this goes wrong at the worst possible moment, what happens in the next twenty minutes?
Twenty minutes is roughly how long a queue takes to evaporate on a busy Saturday afternoon. After that, customers walk, and they don't come back the same day. The financial damage is concentrated in those twenty minutes; the rest of the week barely matters.
Where the time goes when it goes wrong
A typical EPOS outage in a multi-site retailer looks like this. Something breaks at 14:07. The store manager tries the obvious things until 14:14. They ring the helpdesk at 14:15. The helpdesk asks for screenshots and a description. The ticket gets logged at 14:22. The right engineer picks it up at 14:35. By the time anyone is looking at the actual fault, the queue is out of the door and the day is broken.
Most of the lost time is in the human handoff, not the engineering. Better infrastructure helps. Better escalation helps more.
Three concrete asks
First, every store should have a tested offline mode for both EPOS and card payments. Card terminals that can take a payment without a live link, EPOS that can record a sale and reconcile later, and staff who have actually practised it. Untested offline modes are not offline modes; they're hopeful diagrams.
Second, the network into every store should fail over to 4G or 5G automatically, and you should know it's happened before the store manager does. A monitored dual-circuit setup with a small SD-WAN appliance covers most of this for a sensible monthly cost.
Third, the escalation path from till to head office shouldn't go through a single phone number that nobody answers on weekends. There should be an explicit P1 path that reaches a named on-call engineer within two minutes, twenty-four seven, and the store managers should know it exists.
Monitoring per store, not per estate
Estate-wide uptime numbers are useful for a board pack and useless for trading. What matters is per-store visibility. Is store 042 currently degraded? Did store 113's payment terminal lose its connection ten minutes ago? Is there a pattern across stores in the same region that suggests a circuit issue?
Mature retailers have a single screen in head office that shows the trading state of every store in near real time, with alerting tuned tightly enough that the support partner is already moving before the store manager has noticed.
Black Friday and the rest of December
Peak trading periods are not a good time to discover that the platform doesn't scale, the firewall rules are wrong, or a payment provider's TLS certificate is about to expire. They are an even worse time to be planning capacity.
The retailers who get through peak well have done their hard thinking in September. Load testing the ecommerce platform, capacity-checking the network, running a peak readiness drill with the EPOS partner, and freezing changes from mid-November onwards. None of this is exotic. It just doesn't happen by accident.
The vendor stack is your stack
Retail tech is rarely a single vendor. EPOS, payments, ecommerce, stock and head office all come from different places. When something breaks, finger-pointing between vendors can easily eat the twenty minutes that mattered.
Having a partner who sits across the stack and takes accountability for the user-visible outcome is worth more than any individual SLA. The customer doesn't care which vendor's component failed; they care whether they could pay.
What good looks like
A monitored estate where store managers know within a minute that something has degraded, head office knows before the customer complaint arrives, and the support partner is already working on it. Offline modes that have been used in anger, not just configured. An escalation path that's measured in minutes, not hours.
That's not exotic. It's just deliberate. The retailers that have it didn't get there by buying a better EPOS. They got there by treating the trading day as the design constraint.
Communicating to head office and stores
One underrated element of resilience is the communication layer. When a store is degraded, the store manager needs to know what's happening, what the workaround is, and when normal service will resume. Head office needs the same information at an aggregated level.
A simple internal status page, updated automatically by the monitoring platform, removes the worst of the noise. Store managers stop ringing the helpdesk for updates. Trading directors stop ringing IT directors. The support engineers can focus on the fault rather than the phone.
A final habit worth building: a weekly trading-day review with the IT partner during peak. Fifteen minutes, every Monday, looking at the previous week's incidents per store, the patterns, and the one thing to fix this week.
That cadence is the difference between an estate that improves and one that just survives. Over a peak season it compounds into a meaningfully calmer trading day.
Need the right partner for this?
We'll connect you with a UK specialist.
Tell us where you are and we'll introduce a Microsoft-focused managed support specialist who fits.
Connect me with a specialistMore in strategy
- 22 April 2026 · 7 min
Why Copilot rollouts stall before they pay back
Most Microsoft 365 Copilot pilots get bought, lit up, and quietly stall. The problem usually isn't the AI - it's the data and the habits underneath it.
Read - 25 March 2026 · 8 min
Internal IT hire or managed partner? Two different problems
The choice between hiring in-house and outsourcing isn't really about cost. It's about the kind of work you need someone to own.
Read