Recently I was asked to help fixing a mysql server issue. The mysql server couldn’t start on a somewhat older Ubuntu (=Trusty).
I checked /var/log/mysql/error.log, and it said something like that it might be a mysql bug or even the mysql binaries (or libraries?) may not be for this platform. WTF?
Finally the customer explained that there had been a power outage, and even the UPS had been failing, resulting a corrupted database. So the innodb database was in a pretty bad shape. OK, let’s try to bring it back to life by healing it:
No cigar. I tried up to 4 which is the highest recommended or safe(?) value (a higher than 4 value may permanently corrupt data files) according to the official mysql docs. Still no luck. Because at this point I had nothing (more) to lose (there was no backup of the database, and customer couldn’t start mysqld), I took a deep breadth, and told the customer to prepare for even the worst (ie. data loss), and tried –innodb-force-recovery=5, and then –innodb-force-recovery=6.
The last attempt was successful in a sense that at least mysqld started, but it was logging the following message in every second:
InnoDB: Waiting for the background threads to start.
OK, then I fixed the command, and managed to start mysqld finally:
After that it was possible to make it work with the usual: service mysql start command. Fortunately (after a quick sanity check on the piler database) it seemed that no data were lost, despite the dreaded –innodb-force-recovery=6 settings.
The moral of the story:
always have a working power supply backed up with a working UPS
be sure to backup the piler mysql database at least daily
additionally you may setup a master-slave mysql replication as well
Infrastructure monitoring (HW, OS, processes, network) is important, but not enough, because it can’t tell about the application health, neither the customers’ perspective of your applications.
Health checks may tell you whether your application is available or not. However, such tests should be done from a certain “distance”, as close to your users as possible. A health check may be fine checked from the next host in the same data center. But what if your host becomes unavailable from the Internet, because the network access of the datacenter is down? Then your green health check results won’t help the users. So synthetic tests are best performed from another location, another datacenter, etc. Also note that uptime is not the same as availability.
Real-User Monitoring (RUM) helps you understand the behavior of your users better. Using some monitoring tools you may follow your users’ journeys on your site to detect behavioral bottlenecks in your applications, and even the need for design optimizations. Developers may identify and fix page load problems and performance bottlenecks in the browser.
However, resource usage, customer experience, and availability only can tell whether it’s working, but can’t tell why it’s not working. This is where the logs of your application may help you. Logs can help identify problems that have occurred, and pinpoint areas where improvements of the application are necessary.
You also need some metrics of the application as well. Piler has a tool (pilerstats) to reveal some inside info about the application: