Server Upgrade 2: Electric Boogaloo

2025-11-20

Now that I’ve got my new server up and running it’s time to start fresh. I could just plop in the old SSDs into the new server and call it a day but I thought I’d use this opportunity to freshly install everything, with a little more intention and thoughtfulness.

Low-Hanging Chorus Fruit

Let’s start with something easy and get a minecraft server up and running. Since this is the least important of all my VMs, I thought I’d just give it an old laptop SSD to itself so I can still use the storage capacity and not have to worry about it degrading the rest of the system. Installed debian on it, java, some configs to have a nice shell, then minecraft, bob’s your uncle.

The Server Keeps Hanging

Every few hours to days the whole proxmox server becomes unresponsive and I have no idea why, but I do know that it only started after installing the minecraft VM. I disabled minecraft and everything else on the VM but it still kept happening. After a reboot it worked fine again so I was a little confused. Maybe a BIOS setting that forces the server to go to sleep when idle? At first having to reboot didn’t bother me but in the long run this is unsustainable, so I checked the proxmox logs.

-- Boot 0566b7c92eb84d7ca67724464869d645 --
Aug 20 13:17:01 pve CRON[41137]: pam_unix(cron:session): session closed for user root
Aug 20 13:17:01 pve CRON[41138]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 20 13:17:01 pve CRON[41137]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 20 12:17:01 pve CRON[31720]: pam_unix(cron:session): session closed for user root
Aug 20 12:17:01 pve CRON[31721]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 20 12:17:01 pve CRON[31720]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 20 11:17:01 pve CRON[22299]: pam_unix(cron:session): session closed for user root
Aug 20 11:17:01 pve CRON[22300]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 20 11:17:01 pve CRON[22299]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

Nothing to see really, so I checked the logs of the minecraft VM which were equally terse.

Aug 32 13:49:45 minecraft-debian systemd[1]: systemd-timesyncd.service: State 'stop-watchdog' timed out. Killing.
Aug 32 13:48:15 minecraft-debian systemd[1]: systemd-timesyncd.service: Killing process 354 (systemd-timesyn) with signal SIGABRT.
Aug 32 13:48:15 minecraft-debian systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!
Aug 32 13:17:01 minecraft-debian CRON[607]: pam_unix(cron:session): session closed for user root
Aug 32 13:17:01 minecraft-debian CRON[608]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Aug 32 13:17:01 minecraft-debian CRON[607]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)

So whatever causes this hang probably interferes with IO as well, since otherwise there would be all kinds of messages written to the syslog. I didn’t know how to investigate this exactly so I decided to plop in a GPU into my server, attach a monitor and just watch what is happening while it is happening, maybe there are some error messages flying around but they just can’t be written to disk. Indeed it turned out to be something with IO that generated a huge amount of errors flying across the screen, giving me at least somewhere to start.

I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0xa08800 phys_seg 1 prio class 0
scsi_io_completion_action: 366 callbacks suppressed
sd 4:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
sd 4:0:0:0: [sda] tag#10 CDB: Write(10) 2a 00 00 00 08 00 00 00 08 00
blk_print_req_error: 366 callbacks suppressed

Maybe the drive is failing, it is old after all, but the SMART overall-health self-assessment test says “PASSED”. Before doing anything drastic I decided to replace the cables of the SSD, maybe it’s something stupid like that. What bothers me most is having to wait hours/days until it fails again after trying to fix it.

It Wasn’t the Cable

Even with new data and power cables it still kept failing (which is actually good for me because I can keep using my pretty and colorful sata cables). In the VM’s hardware section within proxmox I saw that there was an EFI partition configured, which is weird since I’m using the default SeaBIOS, so I reinstalled grub onto the disk from within the VM and removed the EFI disk in proxmox (which was /dev/sda5 for some reason - the VM’s swap partition).

The Server Keeps Hanging (But in Purple!)

The sever kept crashing but it behaved slightly differently. I was able to ping the proxmox host but not ssh into it or look at the web UI and I also noticed it rebooting several times thanks to the attached monitor and the fans spinning up. The error messages kept changing and they were seemingly unrelated to the actual problem I was facing but after some searching I found some people online who said disabling XMP for their RAM helped with sporadic hangs, so I disabled XMP and set the frequency manually. An overnight memtest reported no issues and after two hours… twelve hours… 24 hours… 48 hours of uptime there were no crashes! And literally as I was writing that, I can see the screen go purple and the distressed penguin on it wants to tell me the kernel panicked! Okay no biggie, let’s reboot and see what this was about with journalctl -r:

-- Boot 2f071f3f81b74e56ba34c35ca286a90c --
Aug 42 12:29:06 pve pvedaemon[1254]: <root@pam> successful auth for user 'root@pam'
Aug 42 12:26:28 pve pvedaemon[1253]: <root@pam> successful auth for user 'root@pam'
Aug 42 12:17:01 pve CRON[233453]: pam_unix(cron:session): session closed for user root

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The logs are empty. Again. I had other things to attend to so it remained in this regularly failing state for a while.

Finally Some New Logs!

After a while of letting the server reflect upon its issues and updating everything I saw new logs, which only appeared once or twice in all the reboots:

Oct 44 05:38:53 pve kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 309s! [server:976]
Oct 44 05:38:53 pve kernel: Sending NMI from CPU 0 to CPUs 7:
Oct 44 05:38:53 pve kernel: nmi_backtrace_stall_check: CPU 5: NMIs are not reaching exc_nmi() handler, last activity: 1189209 jiffies ago.
Oct 44 05:38:53 pve kernel: Sending NMI from CPU 0 to CPUs 5:
Oct 44 05:38:53 pve kernel: nmi_backtrace_stall_check: CPU 3: NMIs are not reaching exc_nmi() handler, last activity: 3488832 jiffies ago.
Oct 44 05:38:53 pve kernel: Sending NMI from CPU 0 to CPUs 3:
Oct 44 05:38:53 pve kernel: nmi_backtrace_stall_check: CPU 2: NMIs are not reaching exc_nmi() handler, last activity: 2339538 jiffies ago.
Oct 44 05:38:53 pve kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 283s! [server:976]
Oct 44 05:38:53 pve kernel: CPU: 4 PID: 17 Comm: rcu_preempt Tainted: P

The CPU was softlocking, that would explain why it wasn’t able to write to disk. I thought about the ways this could happen because it rarely crashed when under load, only when it was basically idling. Then it hit me, Linux and waking up from sleep have a really long and complicated history together. I searched the internet to check if my Chipset and CPU combination have any issues in waking up from sleep on Linux - and they did! Not only on Linux but also on Windows, so apparently this is a hardware issue. Apparently waking up the CPU from low P-States doesn’t work as smoothly as it should after the voltage has dropped. It is possible to force the CPU cores to never go under a certain voltage, even when in low power states. On my motherboard this setting could be found under Power Supply Control: Typical Current Idle (alternatively you could also set the Global C-States to Disabled). Some old PSUs think the computer has gone to sleep or turned itself off when it uses too little power, this also helps prevent putting the PSU into such an unrecoverable state.

I think it worked

root@pve:~# uptime
20:24:51 up 17 days, 15:36

17 days and counting - I don’t want to jinx it but I think this finally fixed the issue.

Logging Logging Logging

If you have more than a few VMs, you definitely need a central point to view all your logs. There is simply no way you will go and check all the logs of all your VMs regularly if you have to do it separately for each of them - let alone if you have the attention span of a goldfish like me. Apparently an Elasticsearch-Kibana-Logstash (ELK) stack is the thing™ kids use these days, so of course I went ahead searched something simpler I could use. Some people seem to like Grafana Loki - why not give it a try before eventually settling on the thing everybody uses anyway.

There was just one minor hiccup while installing Grafana Loki: Their docs on installation tell you to install promtail, which, once you’ve installed it and go through the rest of the docs, you’ll notice is deprecated. Luckily uninstalling it and installing the preferred and new collector Alloy is fairly straightforward.

I got both Grafana, Loki, and Alloy running and played around with it for a while, but at some point I saw that the systemd service of Alloy kept failing - probably because it should run as its own user called alloy and I edited its config files or ran it as root at some point. That should be an easy fix though! cd /etc/alloy and sudo chown -R alloy:alloy *. It really should have been an easy fix, but what I actually entered was: cd /etc and sudo chown -R alloy:alloy *, completely butchering my /etc directory. Luckily I can just spin up a new VM and this massive fumble happened while I’m still setting everything up - let this serve as a reminder that automatic backups should be one of the first things to set up, arguably even before logging, so let’s do that now™. Oh and right, let’s just do the ELK stack instead, only indexing the metadata of logs doesn’t really suit my use-case anyway.

view all articles

Monthly Zoomer

The Approximately Monthly Zoomer