Skip to content

Soft lockup in smp_call_function_many_cond during TLB flush on Framework 16 (Ryzen AI 300 + RTX 5070) #136

@ijbaird

Description

@ijbaird

Device Information

System Model or SKU

Please select one of the following

  • Framework Laptop 12 (13th Gen Intel® Core™)
  • Framework Laptop 13 (11th Gen Intel® Core™)
  • Framework Laptop 13 (12th Gen Intel® Core™)
  • Framework Laptop 13 (13th Gen Intel® Core™)
  • Framework Laptop 13 (AMD Ryzen™ 7040 Series)
  • Framework Laptop 13 (AMD Ryzen™ AI 300 Series)
  • Framework Laptop 13 (Intel® Core™ Ultra Series 1)
  • Framework Laptop 16 (AMD Ryzen™ 7040 Series)
  • Framework Laptop 16 (AMD Ryzen™ AI 300 Series)
  • Framework Desktop (AMD Ryzen™ AI 300 PRO Series)

BIOS VERSION

3.04

Describe the bug

Summary
Kernel soft lockup occurs during TLB flush operations on Framework Laptop 16 with AMD Ryzen AI 300 series CPU and NVIDIA GeForce RTX 5070 discrete GPU. Multiple CPUs become stuck in smp_call_function_many_cond() waiting for IPI acknowledgment during flush_tlb_mm_range(), causing complete system freeze requiring hard power-off.
System Information

Hardware: Framework Laptop 16 (AMD Ryzen AI 300 Series)
Mainboard: FRANMHCP09
BIOS: 03.04 (2025-11-06)
CPU: AMD Ryzen AI 9 HX 370 (or AI 7 350) - Strix Halo
iGPU: AMD Radeon 860M (integrated)
dGPU: NVIDIA GeForce RTX 5070 (Graphics Module)
OS: Fedora 43
Kernel: 6.17.9-300.fc43.x86_64 (PREEMPT_DYNAMIC)
Desktop: GNOME (Wayland)

Problem Description

The system experiences complete freezes requiring hard power-off. The kernel watchdog reports soft lockups on multiple CPUs simultaneously, all stuck in the same code path waiting for inter-processor interrupts (IPIs) to complete during TLB shootdown operations.
The lockups occur during normal desktop usage - no specific trigger identified. Affected processes include gnome-shell, tailscaled, glycin-svg, and abrt-dump-journ - suggesting any process performing memory operations can trigger the issue.

Kernel Log Evidence
Primary soft lockup (CPU#12 - gnome-shell):
watchdog: BUG: soft lockup - CPU#12 stuck for 27s! [gnome-shell:5507]
CPU#12 Utilization every 4s during lockup:
#1: 100% system, 0% softirq, 1% hardirq, 0% idle
#2: 100% system, 0% softirq, 0% hardirq, 0% idle
#3: 100% system, 0% softirq, 1% hardirq, 0% idle
#4: 100% system, 0% softirq, 1% hardirq, 0% idle
#5: 100% system, 0% softirq, 0% hardirq, 0% idle

RIP: 0010:smp_call_function_many_cond+0x114/0x560
Call Trace:

? __pfx_flush_tlb_func+0x10/0x10
on_each_cpu_cond_mask+0x24/0x40
flush_tlb_mm_range+0x153/0x1f0
tlb_finish_mmu+0x79/0x1e0
do_mprotect_pkey+0x4e6/0x540
__x64_sys_mprotect+0x1f/0x30
do_syscall_64+0x7e/0x250
Concurrent lockup (CPU#13 - tailscaled):
watchdog: BUG: soft lockup - CPU#13 stuck for 27s! [tailscaled:2149]
RIP: 0010:smp_call_function_many_cond+0x114/0x560
Call Trace:

? __pfx_flush_tlb_func+0x10/0x10
on_each_cpu_cond_mask+0x24/0x40
flush_tlb_mm_range+0x153/0x1f0
ptep_clear_flush+0x61/0x70
wp_page_copy+0x2a8/0x740
__handle_mm_fault+0x551/0x6a0
handle_mm_fault+0x111/0x360
do_user_addr_fault+0x21a/0x690
exc_page_fault+0x74/0x180
Extended lockup (CPU#9 - 100 seconds):
watchdog: BUG: soft lockup - CPU#9 stuck for 100s! [glycin-svg:148083]
Analysis

All lockups show identical RIP: smp_call_function_many_cond+0x114/0x560
All are waiting in TLB flush path (flush_tlb_mm_range)
Register analysis shows RBX = 0x16 (CPU 22) consistently - suggesting CPU 22 is not responding to IPIs
100% system time with 0% idle indicates busy-wait spin loop
Kernel is tainted with NVIDIA out-of-tree modules (OE flags)

Loaded GPU-Related Modules
nvidia_uvm(OE)
nvidia_drm(OE)
nvidia_modeset(OE)
nvidia(OE)
amdgpu
amdxcp
Hypothesis
This appears to be a hybrid graphics (PRIME) issue where the NVIDIA driver or AMD iGPU driver holds interrupts disabled for too long, preventing a CPU from acknowledging the TLB shootdown IPI. This causes other CPUs to spin indefinitely waiting for the IPI acknowledgment.
Potential contributing factors:

PRIME render offload synchronization
GPU runtime power management state transitions
MES (Micro Engine Scheduler) timeouts on AMD iGPU
Display mux switching between iGPU and dGPU

Steps to Reproduce

Framework Laptop 16 with Ryzen AI 300 + RTX 5070 Graphics Module
Install Fedora 43 (or other distro with kernel 6.11+)
Install NVIDIA proprietary drivers (RPMFusion or direct)
Use the system normally with both GPUs active (hybrid/PRIME mode)
System will eventually freeze, typically within hours of use

Expected Behavior
TLB flush IPIs should complete in a timely manner without causing system lockup.
Actual Behavior
Multiple CPUs become stuck waiting for IPI acknowledgment, causing complete system freeze.
Additional Information
This issue is being reported by multiple Framework 16 users with the new Ryzen AI 300 + RTX 5070 configuration. Similar issues have been observed on desktop systems with AMD Ryzen 9000 series + RTX 5070/5080.
Related reports:

Framework Community: AMD GPU MES Timeouts thread
Arch Linux Forums: "System freeze but no crash since nvidia-open 575.x"
Framework GitHub Issue Tracker: FW16 Freeze then Hang (FTH) #58

Attachments
Full journalctl log attached below.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions