Hogyan kommunikálj az Ops csapattal

Egy jó írás a Rackforest-től Don’t Throw Your Code Over The Wall: 5 Ways To Work With Ops Engineers címmel. Már korábban is el akartam küldeni, de csak most vettem észre, hogy draft-ban van a cikk. Ne vesszen kárba ez a régi fordítás.

1. Részletes leírás kell

Ne csak annyit mondj, hogy pl. mysql kell neked, mondd meg a kívánt/preferált verziót, hogy kell-e replikáció (és ha igen, akkor milyen), stb. Azt is mondd el, mennyi erőforrás kell az alkalmazásodnak (diszk, cpu, memória), egyszóval mindent, hogy az Ops csapat tisztában legyen a projekttel.

2. Használható loggolás

Olyan logbejegyzések kellenek, amelyekből kiderül az Ops csapat számára, hogy mi a probléma, merre induljanak el.

3. Legyen rollback terved

Ha meglepi van az upgrade után, akkor vissza kell tudni állni az előző, még jól működő verzióra.

4. Világosan kommunikált SLA szükséges

Már a HW környezet megtervezésekor tudni kell, hogy hány 9-es rendelkezésre állásra van szükség. A 99,999% elvárt uptime évi 5 perc állást enged meg. Azt is érhetően el kell magyarázni, hogy ha pl. egy reboot miatt kieső idő mérhető bevételkiesést okoz, vagy ha csak alig pár user morog egy kicsit.

Using cadvisor to get a peek to Docker

You may use the Docker stats API to get some basic information from docker.

Try the following to get a json output:

curl -s –unix-socket /var/run/docker.sock http://localhost/containers/json | python -m json.tool

To get data for a single container:

curl -s –unix-socket /var/run/docker.sock http://localhost/containers/c101546a3690/json | python -m json.tool

Pros: simple
Cons: no aggregation, no visualization.

To take it to the next level, give cAdvisor a shot: it taps the Docker API, and gives you a visual and historical data what’s going on inside Docker. cAdvisor runs as a docker container:

docker run \
–volume=/:/rootfs:ro \
–volume=/var/run:/var/run:rw \
–volume=/sys:/sys:ro \
–volume=/var/lib/docker/:/var/lib/docker:ro \
–publish=8080:8080 \
–detach=true \
–name=cadvisor \

Then simply visit

Pros: visual
Cons: limited timeframe, limited metrics

You may also want to put these data to a time series database, eg. InfluxDB or similar, and visualize it with Grafana providing even better visuality and better history.

Or you may give chance to sysdig to provide much more. See https://dzone.com/storage/assets/9981079-dzone-refcard236-dockermonitoring.pdf for more on the topic.

Retire your old stuff with retire.js

I’ve stumbled in an article on dzone 5 Quick Wins for Securing Continuous Delivery mentioning a javascript library to scan the given webpage for security vulnerabilities using retire.js. Note that it has an addon for Firefox and Chrome as well.

So I’ve installed the firefox addon, and for the good. Because I just learned the terrible truth: jquery versions 1.x and 2.x have some unfixed issues.

So I’ve updated the piler enterprise configs to use the most recent versions of jquery and other js libs from CDN networks, and now the addon is happy for piler enterprise GUI.

It’s actually a new config option, called JS_CODE in config.php, so you are able to fix it to use local versions of the used js libraries if your users are on a network without access to the Internet.

Fixing a corrupt InnoDB database

Recently I was asked to help fixing a mysql server issue. The mysql server couldn’t start on a somewhat older Ubuntu (=Trusty).

I checked /var/log/mysql/error.log, and it said something like that it might be a mysql bug or even the mysql binaries (or libraries?) may not be for this platform. WTF?

Finally the customer explained that there had been a power outage, and even the UPS had been failing, resulting a corrupted database. So the innodb database was in a pretty bad shape. OK, let’s try to bring it back to life by healing it:

mysqld –user=mysql –datadir=/var/lib/mysql –innodb-force-recovery=1

No cigar. I tried up to 4 which is the highest recommended or safe(?) value (a higher than 4 value may permanently corrupt data files) according to the official mysql docs. Still no luck. Because at this point I had nothing (more) to lose (there was no backup of the database, and customer couldn’t start mysqld), I took a deep breadth, and told the customer to prepare for even the worst (ie. data loss), and tried –innodb-force-recovery=5, and then –innodb-force-recovery=6.

The last attempt was successful in a sense that at least mysqld started, but it was logging the following message in every second:

InnoDB: Waiting for the background threads to start.

OK, then I fixed the command, and managed to start mysqld finally:

mysqld –user=mysql –datadir=/var/lib/mysql –innodb-force-recovery=6 –innodb-purge-threads=0

After that it was possible to make it work with the usual: service mysql start command. Fortunately (after a quick sanity check on the piler database) it seemed that no data were lost, despite the dreaded –innodb-force-recovery=6 settings.

The moral of the story:

  • always have a working power supply backed up with a working UPS
  • be sure to backup the piler mysql database at least daily
  • additionally you may setup a master-slave mysql replication as well

slapd high memory usage in docker

I installed slapd in Docker, and it was using 712 MB memory even with a few entries.

The fix is to run slapd after ulimit -n 1024, eg.


ulimit -n 1024

slapd -d3

Starting slapd with such a wrapper script has improved the situation considerably:

$ docker stats –no-stream –format \
“table {{.Container}}\t{{.MemUsage}}” slapd

slapd 3.855MiB / 1.944GiB


Application performance monitoring (APM)

Just read a blog series (App in a box) from Peter Hack at https://www.dynatrace.com/news/blog/app-in-a-box/, https://www.dynatrace.com/news/blog/app-in-a-box-customer-perspective/ and https://www.dynatrace.com/news/blog/app-in-a-box-part-3-logs/.

Infrastructure monitoring (HW, OS, processes, network) is important, but not enough, because it can’t tell about the application health, neither the customers’ perspective of your applications.

Health checks may tell you whether your application is available or not. However, such tests should be done from a certain “distance”, as close to your users as possible. A health check may be fine checked from the next host in the same data center. But what if your host becomes unavailable from the Internet, because the network access of the datacenter is down? Then your green health check results won’t help the users. So synthetic tests are best performed from another location, another datacenter, etc. Also note that uptime is not the same as availability.

Real-User Monitoring (RUM) helps you understand the behavior of your users better. Using some monitoring tools you may follow your users’ journeys on your site to detect behavioral bottlenecks in your applications, and even the need for design optimizations. Developers may identify and fix page load problems and performance bottlenecks in the browser.

However, resource usage, customer experience, and availability only can tell whether it’s working, but can’t tell why it’s not working. This is where the logs of your application may help you. Logs can help identify problems that have occurred, and pinpoint areas where improvements of the application are necessary.

You also need some metrics of the application as well. Piler has a tool (pilerstats) to reveal some inside info about the application:

“rcvd”: 4,
“size”: 83808,
“ssize”: 16016,
“sphx”: 0,
“ram_bytes”: 18822431,
“disk_bytes”: 204260152,
“error_emails”: 6,
“last_email”: 4630,
“smtp_response”: 0.95

Grade A SSL report

It’s important that we setup ssl/tls encryption properly, and like most good students achieve an “A” grade.

The test can be done for instance at https://www.ssllabs.com/ssltest/index.html

All you have to do is to enter your site name, then wait until several tests and checks are performed. Provided that you have nginx, try the following ssl/tls options:

ssl_session_cache shared:le_nginx_SSL:1m;
ssl_session_timeout 1440m;
ssl_protocols TLSv1.1 TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;