Under review

Post-Mortem report on yesterday's situation

7 years ago • updated 7 years ago • 10

Yesterday morning at around 08:00 GMT, I've finished a series of planned infrastructure upgrades, re-locations and optimizations. Everything went as planned with barely anyone noticing. At around 12:00 GMT, Monero Block 1685555 was mined, triggering the Monero Hardfork. I've noticed this primarily because I saw 100% rejected shares coming in from all our Monero Pool Edge-Servers. A quick investigation revealed a bug in our codebase that was not caught during testing because the Monero Testnet is using a different value for its Mainnet than for Testnet. This bug was fixed quickly and the updated code was deployed to the pool servers stopping the influx of rejected Monero shares.

Roughly 90 minutes later I've noticed that hashrates for pretty much all other pools - except the Verge Pools (which run a different stratum software) and Bittube - where down close to zero. Inspecting the live logs revealed the cause: Pretty much all shares got rejected by the pools. I've spent the next 4 hours with going through the pool code's changelog inspecting each and every change for clues of a change the might be causing the rejections. I couldn't find any. I decided to check if non-code related problems might cause the rejections but once again came out empty handed. This took another 2 hours. My next approach was cloning the pools on a standby server and testing each pool in isolation. To my surprise they worked flawless - identical hardware, setup and configuration. The only difference was that I tested them in isolation, not as a cluster of pools like the production servers did. So I started adding pools back to my test cluster until would be able to reproduce the problem. This happened as soon as I added a Monero Pool. After a some tests it became evident that handling Monero shares inside the pool process caused other hash calculations to error out. The culprit was a bug in the Cryptonight v2 code that was borrowed from pre 2.8 versions of the XMrig Miner.

Since it was already late, I was tired, and didn't even know what bug was causing the trouble with the other pools, the only viable solution was to close down the Monero pool for the remainder of the night. This morning I was able to pinpoint the bug, fix it and redeploy the Monero pool. All in all, close to 16 hours of work caused by a time-bomb implanted some weeks ago for some coin hardfork. #pool-op-life

Vote

Replies 10
Oldest first
- Newest first
- Oldest first

7 years ago

Thanks for the quick fix and attention (as always!). And of course the explanation is greatly appreciated.
Cheers

7 years ago

Thanks, Oliver.

7 years ago

I get still socket errors :(

https://pastebin.com/3m1KeTav

Under review

7 years ago

That doesn't look right. Remove the "stratum+ssl://" part from "pool_address" in pools.txt and it should look like this:

[2018-10-19 22:43:27] : Fast-connecting to xmr.coinfoundry.org:3133 pool ...
[2018-10-19 22:43:27] : TLS fingerprint [xmr.coinfoundry.org:3133] SHA256:DyvyNe/NGgJqU6XVPQZz6yq29nrh//eAFu+dxHYSq8Q=
[2018-10-19 22:43:27] : Pool xmr.coinfoundry.org:3133 connected. Logging in...
[2018-10-19 22:43:28] : Difficulty changed. Now: 25000.
[2018-10-19 22:43:28] : Pool logged in.

Don't forget "use_tls" : true in the same file.