Today a handful of our users experienced a temporary service outage across 10 or so of our services.
The service outage was caused by an upgrade then went a bit wrong.. We were bringing new servers into our cluster and they put excessive load on our current servers. We cancelled this and then noticed that our blog service (PrimaryBlogger) was serving 5x more pages than average, we have noticed a gradual increase in page views however this week the load was way above our expectations.
To resolve the issue we put a temporary heavy load holding page on PrimaryBlogger, this allowed us to reduce the load enough to finish off the replication and bring the other servers into the cluster.
Usually we do this at 11pm GMT when our load is relatively small however today we had to make a start early on or we would of not been able to serve web pages.
We aim for 100% up time and we have a lot of hardware and technology providing web pages, sometimes we just make a bit of a mistake and things go a bit wrong.. This unfortunately was one of those times, stability is mission critical and we recognise how frustrating it is not to be able to access our websites.
No premium websites or websites that provide an SLA were affected by this outage, only our free services were affected.
What have we learned from today’s outage?
We should have a heavy load page by default, if we can’t serve a certain page due to heavy load you should be informed straight away.
To contact us today, call us on 01274 649731 or get in touch here.