Few things can tarnish the reputation of a public sector organisation faster than system problems at the very moment citizens need them the most.
Whether it’s the annual slow down of the HMRC self assessment portal in the run up to deadline day, or learner drivers finding that securing a driving test slot is often harder than the test itself, performance failures become headlines.
For public sector IT leaders, the mantra has become “scale or fail.” Ensuring that digital services can handle peak loads and remain reliable under stress is absolutely critical.
That’s why modern QA strategies put a heavy emphasis on performance testing, load simulation, and resilience engineering. By testing how systems behave at “citizen scale” of thousands or even millions of users, and under adverse conditions, teams can identify weaknesses and fortify services before they go live.
The goal is simple: plan and test now so there aren’t any unpleasant surprises when real demand hits.
Life is unpredictable. And this creates spiky demand
By their nature, Government services often serve tens of millions of citizens, and usage can fluctuate wildly.
Unlike a steady commercial application which understands and plans around peaks and troughs, public service applications often see quiet periods followed by a sudden deluge. This could be caused by a political, societal or life event, or an upcoming deadline.
For example, when Covid restrictions were lifted, demand for citizen services dramatically surged on top of the backlogs the restrictions created, catching many services off guard. Similarly, voter registrations surged ahead of the 2024 general election; land registry transactions broke records ahead of stamp duty rises; and applications for pension credit jumped 145% in the four months following the government's announcement of cost-cutting plans.
It means that designing for the “average” load is not enough. Citizen services must be optimised to handle worst-case scenarios without crashing or slowing to a crawl.
This is where performance testing comes in. QA teams create load test scenarios that simulate large numbers of concurrent users and transactions. Using tools or cloud-based test harnesses, they generate traffic that hits the system just like real users would, simulating activity though clicking, searching, submitting forms, etc.
This virtual user load is progressively ramped up to see at what point the system’s response time becomes unacceptable or errors start occurring.
These tests often reveal bottlenecks. They could, for example, reveal that a central database can’t handle more than X queries per second, or an authentication service rate-limits after Y logins per minute.
By discovering these limits in a controlled test environment, the team can then address them long before actual users encounter them.
One best practice is to test not only expected peak load but surge beyond peak. For instance, if you forecast 100,000 concurrent users at peak, you might test to 200,000 or more to see how the system behaves. This builds a cushion.
Indeed, many teams now test to the system’s breaking point, intentionally finding how much load it takes to cause failure, so they know the true capacity. They can then set scaling rules or procurement decisions accordingly.
In cloud-hosted systems, performance testing also helps tune auto-scaling configurations. It means that teams can observe at what load new instances spawn, whether they spawn fast enough, and whether there’s any dip in performance during that scale-up time.
These are things that can often only be learnt by pushing the system in pre-production environments.
A key insight from performance testing is that sometimes upstream or downstream systems are the choke points. A service might be designed and optimised to scale, but if it relies on an older legacy database or an external API that has limits, they’re likely to fail under high load. That’s why comprehensive performance tests will include those dependencies or use realistic stubs for them.
For example, if a welfare application system were to check against a downstream NHS system, the test must account for how that NHS system performs under heavy concurrent checks. If the external dependency can only handle 50 requests/second, then the system design might need a request queue or caching to avoid overwhelming it.
Catching such issues in testing enables teams to implement workarounds, like graceful degradation or back-off strategies, to maintain overall service availability.
After all, it’s far better to discover that “Service X will throttle us after 100 calls/minute” in a planned test than during a real user surge.
Load, stress or soak?
Performance testing isn’t a single monolithic practice. It’s got several components, each serving a slightly different purpose:
- Load Testing: This is the basic approach of applying a specific expected load to the system and measuring its performance. For instance, simulate 10,000 users using the system simultaneously and see if response times stay within 2 seconds. Load tests usually target normal peak conditions, essentially answering a “does this handle what we expect regularly?” question.
- Stress Testing: This approach pushes a system beyond normal loads to see how it behaves when stressed to or beyond its technical limits. The purpose is often to find the breaking point or to ensure the system fails gracefully. Perhaps, for example, the system should return an error message instead of just hanging. The stress test will answer questions like: what happens if 5x our peak users show up? Does the system crash, or does it start shedding load politely by returning a “please try again later” type message?
- Spike Testing: This variant of stress testing mimics scenarios where a sudden sharp increase in load is applied. It evaluates the system’s ability to scale up quickly, or handle the shock of a rapid and unplanned spike in traffic.
- Endurance/Soak Testing: This form of testing runs the system at high load for an extended period, often over many hours or days, to see if its performance holds up over time. It can unearth issues like memory leaks, slow memory bloat, or degradation caused by resource exhaustion that only manifests after a long period of usage. As guidance from GDS notes, soak tests under peak load can uncover long-term reliability problems and are crucial for services meant to be always-on, to ensure they can sustain heavy usage day after day.
- Configuration Testing: Here, different system configurations (like varying number of app server instances, different database sizes, etc.) are tested under load to find the optimal setup, or to validate scaling linearity.
In public sector projects, teams often automate these tests and integrate them into pre-release pipelines. For example, before any major release, a performance test suite might automatically deploy the new version to a staging environment and simulate one hour of peak traffic to ensure no regressions in speed. The metrics collected (throughput, response times, error rates, resource utilisation) are compared to baseline. If the new version is slower or uses significantly more CPU, then a red flag is raised to investigate before approving the release.
Life is unpredictable. And this creates spiky demand
The concept of resilience testing has gained traction as systems become more distributed and complex. Issues like network outages, hardware failures, or even cyberattacks like DoS attacks, can impact service continuity.
Resilience testing often overlaps with security and disaster recovery testing. For example, testers might simulate a denial-of-service flood on non-critical endpoints to see if the system’s defence mechanisms (rate limiters, firewalls) kick in and protect the core service. Or in cloud environments, it could be as simple as turning off a cluster node randomly like Netflix’s Chaos Monkey approach to verify that auto-healing and redundancy are effective.
The Netflix approach is now being adopted beyond the streaming provider, and is used in some government and defence contexts. The rationale is that you want to find latent issues under controlled conditions rather than during a real crisis. For instance, a chaos test might reveal that if the primary identity verification service is down, the whole application hangs because a timeout is too long or there’s no fallback. That insight can lead to adding a secondary verification service or a circuit breaker that skips verification after a timeout with an apology message, thereby keeping the system partially functional. Government platforms especially “must stay online during critical events” and cannot afford single points of failure.
For example, imagine a digital voting or census survey platform is being developed. The delivery team would want to test not just that it can handle millions of submissions, but also its resilience. A scenario could be something like what would happen if there was a sudden surge at 10PM and a database failover occurred at the same time. Through such tests, teams can bolster their architecture (adding redundancy, tuning failover thresholds, etc.) and their operational playbooks (having runbooks for recovery steps). It instils confidence that even if multiple things went wrong, like a hardware failure right at peak traffic, then the service would degrade gracefully, not catastrophically.
Another angle is long-term capacity management. Performance testing results feed into capacity planning: How many servers, what kind of network bandwidth, how large a database cluster is needed to meet demand and headroom? While over-provisioning wastes money, under-provisioning tempts disaster. The data from load tests help identify this sweet spot.
While government services are moving to cloud infrastructure, allowing for elastic capacity, it’s still possible to run out of capacity, say if unreasonably low budgets are set or autoscaling isn’t configured properly. Performance tests can simulate a scale-up and ensure that, say, 10 new instances can launch within 2 minutes when traffic spikes. It’s also an opportunity to see the cost implications: testing 7x normal load might reveal that the system can scale, but would incur significant cost at that level, prompting discussions of what the balance of risk versus pragmatism should be.
The tools and techniques to test at scale
Achieving realistic performance tests needs the right tools and data.
There are several Open Source and commercial options including Apache JMeter, Gatling, k6, and Locust, amongst others. These tools allow scripting of user scenarios and can simulate thousands of virtual users.
GDS teams, for example, frequently use Gatling and have scenario recorders to build out tests quickly.
The choice of tool often depends on the tech stack and the familiarity of the team, but all aim to produce a high load with realistic interactions.
Realism is key. A mistake is to create a very synthetic test that doesn’t mimic actual user behavior (for example, hitting the same endpoint with a fixed delay). Good performance testers incorporate human-mimicked think times, varied paths, and appropriate data variations to simulate real usage patterns. They might use production logs to inform the distribution of actions, e.g., 60% of users do query A, 30% do query B, 10% other actions. This helps ensure the test stressors are representative, otherwise it may optimise for a scenario that never happens while missing one that does.
Another important aspect is the test environment. Ideally, performance tests should run in an environment that is as close to production as possible. The closer to real, the more accurate the findings. This means creating environments of similar configurations, data volumes, and connected systems. But, in reality, setting up a full-scale replica can be expensive, so sometimes budget-stretched teams will test on a subset and extrapolate.
Monitoring during tests is non-negotiable because metrics as the test runs from application servers, databases, network components and other parts of the service are fundamental. The analysis of these results is just as important as running the test because they surface and help diagnose performance pain points.
The confidence that digital services can weather the storm
Performance testing and resilience engineering give public sector organisations the confidence that their digital services can weather the storm. It’s about proactively finding the limits and failure points, and then pushing those limits out further or adding safeguards.
For a CTO, seeing the performance test report that says “System A can handle seven times current volume with response times under two seconds” is more than reassuring, it’s strategic insight too. It means the service can support future growth, policy changes that lead to spikes in uptake, or emergency situations that drive usage, all without a scramble for urgent fixes.
Government services carry a mandate to be available, reliable and accessible to all citizens who need to use them. With rigorous performance and resilience testing, CIOs can ensure that their services meet this mandate and are rolling out high-quality digital government innovation.