Fixing a corrupt InnoDB database

Recently I was asked to help fixing a mysql server issue. The mysql server couldn’t start on a somewhat older Ubuntu (=Trusty).

I checked /var/log/mysql/error.log, and it said something like that it might be a mysql bug or even the mysql binaries (or libraries?) may not be for this platform. WTF?

Finally the customer explained that there had been a power outage, and even the UPS had been failing, resulting a corrupted database. So the innodb database was in a pretty bad shape. OK, let’s try to bring it back to life by healing it:

mysqld –user=mysql –datadir=/var/lib/mysql –innodb-force-recovery=1

No cigar. I tried up to 4 which is the highest recommended or safe(?) value (a higher than 4 value may permanently corrupt data files) according to the official mysql docs. Still no luck. Because at this point I had nothing (more) to lose (there was no backup of the database, and customer couldn’t start mysqld), I took a deep breadth, and told the customer to prepare for even the worst (ie. data loss), and tried –innodb-force-recovery=5, and then –innodb-force-recovery=6.

The last attempt was successful in a sense that at least mysqld started, but it was logging the following message in every second:

InnoDB: Waiting for the background threads to start.

OK, then I fixed the command, and managed to start mysqld finally:

mysqld –user=mysql –datadir=/var/lib/mysql –innodb-force-recovery=6 –innodb-purge-threads=0

After that it was possible to make it work with the usual: service mysql start command. Fortunately (after a quick sanity check on the piler database) it seemed that no data were lost, despite the dreaded –innodb-force-recovery=6 settings.

The moral of the story:

  • always have a working power supply backed up with a working UPS
  • be sure to backup the piler mysql database at least daily
  • additionally you may setup a master-slave mysql replication as well

slapd high memory usage in docker

I installed slapd in Docker, and it was using 712 MB memory even with a few entries.

The fix is to run slapd after ulimit -n 1024, eg.

#!/bin/bash

ulimit -n 1024

slapd -d3

Starting slapd with such a wrapper script has improved the situation considerably:

$ docker stats –no-stream –format \
“table {{.Container}}\t{{.MemUsage}}” slapd

CONTAINER MEM USAGE / LIMIT
slapd 3.855MiB / 1.944GiB

 

Application performance monitoring (APM)

Just read a blog series (App in a box) from Peter Hack at https://www.dynatrace.com/news/blog/app-in-a-box/, https://www.dynatrace.com/news/blog/app-in-a-box-customer-perspective/ and https://www.dynatrace.com/news/blog/app-in-a-box-part-3-logs/.

Infrastructure monitoring (HW, OS, processes, network) is important, but not enough, because it can’t tell about the application health, neither the customers’ perspective of your applications.

Health checks may tell you whether your application is available or not. However, such tests should be done from a certain “distance”, as close to your users as possible. A health check may be fine checked from the next host in the same data center. But what if your host becomes unavailable from the Internet, because the network access of the datacenter is down? Then your green health check results won’t help the users. So synthetic tests are best performed from another location, another datacenter, etc. Also note that uptime is not the same as availability.

Real-User Monitoring (RUM) helps you understand the behavior of your users better. Using some monitoring tools you may follow your users’ journeys on your site to detect behavioral bottlenecks in your applications, and even the need for design optimizations. Developers may identify and fix page load problems and performance bottlenecks in the browser.

However, resource usage, customer experience, and availability only can tell whether it’s working, but can’t tell why it’s not working. This is where the logs of your application may help you. Logs can help identify problems that have occurred, and pinpoint areas where improvements of the application are necessary.

You also need some metrics of the application as well. Piler has a tool (pilerstats) to reveal some inside info about the application:

{
“rcvd”: 4,
“size”: 83808,
“ssize”: 16016,
“sphx”: 0,
“ram_bytes”: 18822431,
“disk_bytes”: 204260152,
“error_emails”: 6,
“last_email”: 4630,
“smtp_response”: 0.95
}

Grade A SSL report

It’s important that we setup ssl/tls encryption properly, and like most good students achieve an “A” grade.

The test can be done for instance at https://www.ssllabs.com/ssltest/index.html

All you have to do is to enter your site name, then wait until several tests and checks are performed. Provided that you have nginx, try the following ssl/tls options:

ssl_session_cache shared:le_nginx_SSL:1m;
ssl_session_timeout 1440m;
ssl_protocols TLSv1.1 TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;

ssl_ciphers “ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-DES-CBC3-SHA:ECDHE-RSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:!DSS”;

 

 

Sidenotes for a mail migration

I’ve decided to migrate my emails to another host. It includes a postfix sandwich config with a spam filter in the middle. By the way, I use dovecot as the virtual transport to let my emails sorted to different folder, including spam to the Junk folder.

So far so good (even though there was hell of a troubleshooting why the spam filter became terribly slow. It’s fixed by now, anyway). Then I tested with some emails, and found that I got a mail loops back for my@email. Wtf?

The culprit was the already existing Delivered-To: header in the emails. So I’ve fixed master.cf like

dovecot unix – n n – – pipe
flags=Rhu user=vmail:vmail argv=/usr/lib/dovecot/deliver -d ${recipient} -e

and it started to work propely.

However, I just couldn’t let it go that it was working properly with DRhu flags.

Then I’ve fixed the email, and removed any additional header that was added by either the local postfix or the content filter, and reverted the flags to DRhu, and it worked!

dovecot unix – n n – – pipe
flags=DRhu user=vmail:vmail argv=/usr/lib/dovecot/deliver -d ${recipient} -e

The moral of the story: test with new emails. Or at least remove any locally added headers 🙂

Docker vs. systemd

I’ve decided to setup a few docker hosts. I needed to access them remotely, so I deployed the necessary CA and server keys and certs (see https://docs.docker.com/engine/security/https/#create-a-ca-server-and-client-keys-with-openssl for more). So far, so good.

I knew that docker should have been instructed to use these files, also to listen on 0.0.0.0. So I edited /etc/default/docker (on Ubuntu Bionic), restarted the docker daemon, and nothing happened.

I rushed to the docker site to figure out what da heck, and end up at https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-configuration-file  telling me that unfortunately it wouldn’t work with systemd, you must use /etc/docker/daemon.json.

I’ve created the file:

{
“hosts”: [“0.0.0.0:2376”],
“tlsverify”: true,
“tlscacert”: “/etc/docker/ca.pem”,
“tlscert”: “/etc/docker/server-cert.pem”,
“tlskey”: “/etc/docker/server-key.pem”
}

 

then restarted docker, and still nothing. The -H fd:// option in /lib/systemd/system/docker.service file caused trouble preventing docker to listen on 0.0.0.0:

ExecStart=/usr/bin/dockerd -H fd://

Fear not, the fix is to remove -H fd:// as follows:

ExecStart=/usr/bin/dockerd

Then run systemctl daemon-reload && systemctl restart docker, and you should be able to connect to docker on the remote host.

Disable udp ports for Jenkins

I’ve noticed that Jenkins has an unpleasant habit to listen on two UDP ports (5353 and 33848) on all interfaces even if it was told to listen on 127.0.0.1:8080.

These ports are for UDP multicast broadcast. You may not need either of them, and you can disable them by adding the following options:

-Dhudson.DNSMultiCast.disabled=true -Dhudson.udp=-1

eg.

java -Dhudson.DNSMultiCast.disabled=true -Dhudson.udp=-1 -jar jenkins.war –httpListenAddress=127.0.0.1 –httpPort=8080 –daemon –logfile=/home/jenkins/jenkins.log

See the Jenkins docs for more https://wiki.jenkins.io/display/JENKINS/Features+controlled+by+system+properties