Proxmox VE update

Proxmox VE 8 to 9 Upgrade

Introduction

This document describes performance issues observed after upgrading a Proxmox VE cluster from version 8.x to 9.x. Although the upgrade itself was successful, it resulted in critical stability problems with virtual machines (VMs) during backup job execution.

1. Problem Description

After the update, while backup jobs were running in Snapshot mode, services inside the virtual machines became unavailable. The symptoms included:

Application interruptions and connection timeouts.
Critical kernel errors in the virtual machine logs:
- BUG: soft lockup - CPU#X stuck for XXs!
- rcu: INFO: rcu_preempt detected stalls on CPUs/tasks
High load messages on the Proxmox host:
- perf: interrupt took too long (...)

Notably, before the upgrade to Proxmox 9, the same backup jobs on the same infrastructure caused no such issues.

2. Key System Configuration

Hypervisor: Proxmox VE 9.x (upgraded from 8.x)
Storage: LVM on a Hardware RAID controller.
Guest Systems: Debian 12, Gentoo, CentOS 9.
Backup Mode: Snapshot.
Backup Architecture: Backups are created locally on the Proxmox host, then pulled via rsync to a NAS server on the office network.
QEMU Guest Agent: Installed, updated, and running on all virtual machines.

3. Root Cause Diagnosis

The root cause of the problem is a performance regression in the interaction between Proxmox 9 and LVM snapshots. LVM snapshots operate in "Copy-on-Write" (CoW) mode, which generates significant overhead and I/O load. The newer kernel or QEMU in Proxmox 9 manages these operations in a way that is more sensitive to latency. This, combined with the guest workload, leads to temporary I/O starvation of the virtual machine, causing it to "stun" (freeze).

4. Applied Solutions

Standard methods like Rate Limit for the backup proved insufficient. The problem was ultimately resolved by applying the following steps:

Step 1: Setting I/O Priority for Backup Jobs (Most Effective Fix)

The most crucial step was to lower the I/O priority for the backup process (vzdump). This was set globally on each Proxmox host by editing the /etc/vzdump.conf file and adding the line:

ionice: 8

The value 8 corresponds to the lowest priority (the "Idle" class), which makes the backup process use the disk only when other processes do not need it.

Step 2: Disabling Transparent Huge Pages (THP) in Guest VMs

As an optimization and to eliminate soft lockup and rcu stall errors, Transparent Huge Pages were disabled in the guest operating systems. This was achieved by adding the transparent_hugepage=never kernel parameter to the GRUB configuration.

Step 3: Rescheduling Backups

Additionally, to minimize resource contention, backup jobs were moved to off-peak hours (2:00-5:00 AM) when the activity on the virtual machines is lowest.