In the Linux kernel, the following vulnerability has been resolved: net/mlx5e: Take state lock during tx timeout reporter mlx5e_safe_reopen_channels() requires the state lock taken. The referenced changed in the Fixes tag removed the lock to fix another issue. This patch adds it back but at a later point (when calling mlx5e_safe_reopen_channels()) to avoid the deadlock referenced in the Fixes tag.
In the Linux kernel, the following vulnerability has been resolved: bpf: fix recursive lock when verdict program return SK_PASS When the stream_verdict program returns SK_PASS, it places the received skb into its own receive queue, but a recursive lock eventually occurs, leading to an operating system deadlock. This issue has been present since v6.9. ''' sk_psock_strp_data_ready write_lock_bh(&sk->sk_callback_lock) strp_data_ready strp_read_sock read_sock -> tcp_read_sock strp_recv cb.rcv_msg -> sk_psock_strp_read # now stream_verdict return SK_PASS without peer sock assign __SK_PASS = sk_psock_map_verd(SK_PASS, NULL) sk_psock_verdict_apply sk_psock_skb_ingress_self sk_psock_skb_ingress_enqueue sk_psock_data_ready read_lock_bh(&sk->sk_callback_lock) <= dead lock ''' This topic has been discussed before, but it has not been fixed. Previous discussion: https://lore.kernel.org/all/6684a5864ec86_403d20898@john.notmuch
In the Linux kernel, the following vulnerability has been resolved: LoongArch: Fix sleeping in atomic context for PREEMPT_RT Commit bab1c299f3945ffe79 ("LoongArch: Fix sleeping in atomic context in setup_tlb_handler()") changes the gfp flag from GFP_KERNEL to GFP_ATOMIC for alloc_pages_node(). However, for PREEMPT_RT kernels we can still get a "sleeping in atomic context" error: [ 0.372259] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 0.372266] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 [ 0.372268] preempt_count: 1, expected: 0 [ 0.372270] RCU nest depth: 1, expected: 1 [ 0.372272] 3 locks held by swapper/1/0: [ 0.372274] #0: 900000000c9f5e60 (&pcp->lock){+.+.}-{3:3}, at: get_page_from_freelist+0x524/0x1c60 [ 0.372294] #1: 90000000087013b8 (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x50/0x140 [ 0.372305] #2: 900000047fffd388 (&zone->lock){+.+.}-{3:3}, at: __rmqueue_pcplist+0x30c/0xea0 [ 0.372314] irq event stamp: 0 [ 0.372316] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 0.372322] hardirqs last disabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0 [ 0.372329] softirqs last enabled at (0): [<9000000005947320>] copy_process+0x9c0/0x26e0 [ 0.372335] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 0.372341] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.12.0-rc7+ #1891 [ 0.372346] Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022 [ 0.372349] Stack : 0000000000000089 9000000005a0db9c 90000000071519c8 9000000100388000 [ 0.372486] 900000010038b890 0000000000000000 900000010038b898 9000000007e53788 [ 0.372492] 900000000815bcc8 900000000815bcc0 900000010038b700 0000000000000001 [ 0.372498] 0000000000000001 4b031894b9d6b725 00000000055ec000 9000000100338fc0 [ 0.372503] 00000000000000c4 0000000000000001 000000000000002d 0000000000000003 [ 0.372509] 0000000000000030 0000000000000003 00000000055ec000 0000000000000003 [ 0.372515] 900000000806d000 9000000007e53788 00000000000000b0 0000000000000004 [ 0.372521] 0000000000000000 0000000000000000 900000000c9f5f10 0000000000000000 [ 0.372526] 90000000076f12d8 9000000007e53788 9000000005924778 0000000000000000 [ 0.372532] 00000000000000b0 0000000000000004 0000000000000000 0000000000070000 [ 0.372537] ... [ 0.372540] Call Trace: [ 0.372542] [<9000000005924778>] show_stack+0x38/0x180 [ 0.372548] [<90000000071519c4>] dump_stack_lvl+0x94/0xe4 [ 0.372555] [<900000000599b880>] __might_resched+0x1a0/0x260 [ 0.372561] [<90000000071675cc>] rt_spin_lock+0x4c/0x140 [ 0.372565] [<9000000005cbb768>] __rmqueue_pcplist+0x308/0xea0 [ 0.372570] [<9000000005cbed84>] get_page_from_freelist+0x564/0x1c60 [ 0.372575] [<9000000005cc0d98>] __alloc_pages_noprof+0x218/0x1820 [ 0.372580] [<900000000593b36c>] tlb_init+0x1ac/0x298 [ 0.372585] [<9000000005924b74>] per_cpu_trap_init+0x114/0x140 [ 0.372589] [<9000000005921964>] cpu_probe+0x4e4/0xa60 [ 0.372592] [<9000000005934874>] start_secondary+0x34/0xc0 [ 0.372599] [<900000000715615c>] smpboot_entry+0x64/0x6c This is because in PREEMPT_RT kernels normal spinlocks are replaced by rt spinlocks and rt_spin_lock() will cause sleeping. Fix it by disabling NUMA optimization completely for PREEMPT_RT kernels.
In the Linux kernel, the following vulnerability has been resolved: block: Prevent potential deadlocks in zone write plug error recovery Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that the alignment to a zone write pointer of write BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs. However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. But with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock. Avoid this problem by completely removing from the zone write plugging code the use of report zones operations after a failed write operation, instead relying on the device user to either execute a report zones, reset the zone, finish the zone, or give up writing to the device (which is a fairly common pattern for file systems which degrade to read-only after write failures). This is not an unreasonnable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible by comparing the current position of a zone write pointer with what their assumption about the position is. The changes to remove the automatic error recovery are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error(). - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write plug whenever a write opration targetting the zone of the zone write plug fails. This flag indicates that the zone write pointer offset is not reliable and that it must be updated when the next report zone, reset zone, finish zone or disk revalidation is executed. - Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO. - Modify the function disk_zone_wplug_set_wp_offset() to clear this new flag, thus implementing recovery of a correct write pointer offset with the reset (all) zone and finish zone operations. - Modify blkdev_report_zones() to always use the disk_report_zones_cb() callback so that disk_zone_wplug_sync_wp_offset() can be called for any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user. - Modify blk_revalidate_seq_zone() to call disk_zone_wplug_sync_wp_offset() for all sequential write required zones when a zoned block device is revalidated, thus always resolving any inconsistency between the write pointer offset of zone write plugs and the actual write pointer position of sequential zones.
In the Linux kernel, the following vulnerability has been resolved: ALSA: us122l: Use snd_card_free_when_closed() at disconnection The USB disconnect callback is supposed to be short and not too-long waiting. OTOH, the current code uses snd_card_free() at disconnection, but this waits for the close of all used fds, hence it can take long. It eventually blocks the upper layer USB ioctls, which may trigger a soft lockup. An easy workaround is to replace snd_card_free() with snd_card_free_when_closed(). This variant returns immediately while the release of resources is done asynchronously by the card device release at the last close. The loop of us122l->mmap_count check is dropped as well. The check is useless for the asynchronous operation with *_when_closed().
In the Linux kernel, the following vulnerability has been resolved: ALSA: caiaq: Use snd_card_free_when_closed() at disconnection The USB disconnect callback is supposed to be short and not too-long waiting. OTOH, the current code uses snd_card_free() at disconnection, but this waits for the close of all used fds, hence it can take long. It eventually blocks the upper layer USB ioctls, which may trigger a soft lockup. An easy workaround is to replace snd_card_free() with snd_card_free_when_closed(). This variant returns immediately while the release of resources is done asynchronously by the card device release at the last close. This patch also splits the code to the disconnect and the free phases; the former is called immediately at the USB disconnect callback while the latter is called from the card destructor.
In the Linux kernel, the following vulnerability has been resolved: ALSA: usx2y: Use snd_card_free_when_closed() at disconnection The USB disconnect callback is supposed to be short and not too-long waiting. OTOH, the current code uses snd_card_free() at disconnection, but this waits for the close of all used fds, hence it can take long. It eventually blocks the upper layer USB ioctls, which may trigger a soft lockup. An easy workaround is to replace snd_card_free() with snd_card_free_when_closed(). This variant returns immediately while the release of resources is done asynchronously by the card device release at the last close.
In the Linux kernel, the following vulnerability has been resolved: mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation When compiling kernel source 'make -j $(nproc)' with the up-and-running KASAN-enabled kernel on a 256-core machine, the following soft lockup is shown: watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760] CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95 Workqueue: events drain_vmap_area_work RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0 Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75 RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202 RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949 RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50 RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800 R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39 R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0 Call Trace: <IRQ> ? watchdog_timer_fn+0x2cd/0x390 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x300/0x6d0 ? sched_clock_cpu+0x69/0x4e0 ? __pfx___hrtimer_run_queues+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_update_offsets_now+0x7f/0x2a0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? hrtimer_interrupt+0x2ca/0x760 ? __sysvec_apic_timer_interrupt+0x8c/0x2b0 ? sysvec_apic_timer_interrupt+0x6a/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? smp_call_function_many_cond+0x1d8/0xbb0 ? __pfx_do_kernel_range_flush+0x10/0x10 on_each_cpu_cond_mask+0x20/0x40 flush_tlb_kernel_range+0x19b/0x250 ? srso_return_thunk+0x5/0x5f ? kasan_release_vmalloc+0xa7/0xc0 purge_vmap_node+0x357/0x820 ? __pfx_purge_vmap_node+0x10/0x10 __purge_vmap_area_lazy+0x5b8/0xa10 drain_vmap_area_work+0x21/0x30 process_one_work+0x661/0x10b0 worker_thread+0x844/0x10e0 ? srso_return_thunk+0x5/0x5f ? __kthread_parkme+0x82/0x140 ? __pfx_worker_thread+0x10/0x10 kthread+0x2a5/0x370 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Debugging Analysis: 1. The following ftrace log shows that the lockup CPU spends too much time iterating vmap_nodes and flushing TLB when purging vm_area structures. (Some info is trimmed). kworker: funcgraph_entry: | drain_vmap_area_work() { kworker: funcgraph_entry: | mutex_lock() { kworker: funcgraph_entry: 1.092 us | __cond_resched(); kworker: funcgraph_exit: 3.306 us | } ... ... kworker: funcgraph_entry: | flush_tlb_kernel_range() { ... ... kworker: funcgraph_exit: # 7533.649 us | } ... ... kworker: funcgraph_entry: 2.344 us | mutex_unlock(); kworker: funcgraph_exit: $ 23871554 us | } The drain_vmap_area_work() spends over 23 seconds. There are 2805 flush_tlb_kernel_range() calls in the ftrace log. * One is called in __purge_vmap_area_lazy(). * Others are called by purge_vmap_node->kasan_release_vmalloc. purge_vmap_node() iteratively releases kasan vmalloc allocations and flushes TLB for each vmap_area. - [Rough calculation] Each flush_tlb_kernel_range() runs about 7.5ms. -- 2804 * 7.5ms = 21.03 seconds. -- That's why a soft lock is triggered. 2. Extending the soft lockup time can work around the issue (For example, # echo ---truncated---
In the Linux kernel, the following vulnerability has been resolved: Bluetooth: iso: Fix circular lock in iso_listen_bis This fixes the circular locking dependency warning below, by releasing the socket lock before enterning iso_listen_bis, to avoid any potential deadlock with hdev lock. [ 75.307983] ====================================================== [ 75.307984] WARNING: possible circular locking dependency detected [ 75.307985] 6.12.0-rc6+ #22 Not tainted [ 75.307987] ------------------------------------------------------ [ 75.307987] kworker/u81:2/2623 is trying to acquire lock: [ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO) at: iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308021] but task is already holding lock: [ 75.308022] ffff8fdd61a10078 (&hdev->lock) at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308053] which lock already depends on the new lock. [ 75.308054] the existing dependency chain (in reverse order) is: [ 75.308055] -> #1 (&hdev->lock){+.+.}-{3:3}: [ 75.308057] __mutex_lock+0xad/0xc50 [ 75.308061] mutex_lock_nested+0x1b/0x30 [ 75.308063] iso_sock_listen+0x143/0x5c0 [bluetooth] [ 75.308085] __sys_listen_socket+0x49/0x60 [ 75.308088] __x64_sys_listen+0x4c/0x90 [ 75.308090] x64_sys_call+0x2517/0x25f0 [ 75.308092] do_syscall_64+0x87/0x150 [ 75.308095] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 75.308098] -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 75.308100] __lock_acquire+0x155e/0x25f0 [ 75.308103] lock_acquire+0xc9/0x300 [ 75.308105] lock_sock_nested+0x32/0x90 [ 75.308107] iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308128] hci_connect_cfm+0x6c/0x190 [bluetooth] [ 75.308155] hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth] [ 75.308180] hci_le_meta_evt+0xe7/0x200 [bluetooth] [ 75.308206] hci_event_packet+0x21f/0x5c0 [bluetooth] [ 75.308230] hci_rx_work+0x3ae/0xb10 [bluetooth] [ 75.308254] process_one_work+0x212/0x740 [ 75.308256] worker_thread+0x1bd/0x3a0 [ 75.308258] kthread+0xe4/0x120 [ 75.308259] ret_from_fork+0x44/0x70 [ 75.308261] ret_from_fork_asm+0x1a/0x30 [ 75.308263] other info that might help us debug this: [ 75.308264] Possible unsafe locking scenario: [ 75.308264] CPU0 CPU1 [ 75.308265] ---- ---- [ 75.308265] lock(&hdev->lock); [ 75.308267] lock(sk_lock- AF_BLUETOOTH-BTPROTO_ISO); [ 75.308268] lock(&hdev->lock); [ 75.308269] lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO); [ 75.308270] *** DEADLOCK *** [ 75.308271] 4 locks held by kworker/u81:2/2623: [ 75.308272] #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x443/0x740 [ 75.308276] #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)), at: process_one_work+0x1ce/0x740 [ 75.308280] #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3} at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308304] #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2}, at: hci_connect_cfm+0x29/0x190 [bluetooth]
In the Linux kernel, the following vulnerability has been resolved: netfilter: IDLETIMER: Fix for possible ABBA deadlock Deletion of the last rule referencing a given idletimer may happen at the same time as a read of its file in sysfs: | ====================================================== | WARNING: possible circular locking dependency detected | 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted | ------------------------------------------------------ | iptables/3303 is trying to acquire lock: | ffff8881057e04b8 (kn->active#48){++++}-{0:0}, at: __kernfs_remove+0x20 | | but task is already holding lock: | ffffffffa0249068 (list_mutex){+.+.}-{3:3}, at: idletimer_tg_destroy_v] | | which lock already depends on the new lock. A simple reproducer is: | #!/bin/bash | | while true; do | iptables -A INPUT -i foo -j IDLETIMER --timeout 10 --label "testme" | iptables -D INPUT -i foo -j IDLETIMER --timeout 10 --label "testme" | done & | while true; do | cat /sys/class/xt_idletimer/timers/testme >/dev/null | done Avoid this by freeing list_mutex right after deleting the element from the list, then continuing with the teardown.
In the Linux kernel, the following vulnerability has been resolved: Revert "ALSA: firewire-lib: operate for period elapse event in process context" Commit 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context") removed the process context workqueue from amdtp_domain_stream_pcm_pointer() and update_pcm_pointers() to remove its overhead. With RME Fireface 800, this lead to a regression since Kernels 5.14.0, causing an AB/BA deadlock competition for the substream lock with eventual system freeze under ALSA operation: thread 0: * (lock A) acquire substream lock by snd_pcm_stream_lock_irq() in snd_pcm_status64() * (lock B) wait for tasklet to finish by calling tasklet_unlock_spin_wait() in tasklet_disable_in_atomic() in ohci_flush_iso_completions() of ohci.c thread 1: * (lock B) enter tasklet * (lock A) attempt to acquire substream lock, waiting for it to be released: snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed() in update_pcm_pointers() in process_ctx_payloads() in process_rx_packets() of amdtp-stream.c ? tasklet_unlock_spin_wait </NMI> <TASK> ohci_flush_iso_completions firewire_ohci amdtp_domain_stream_pcm_pointer snd_firewire_lib snd_pcm_update_hw_ptr0 snd_pcm snd_pcm_status64 snd_pcm ? native_queued_spin_lock_slowpath </NMI> <IRQ> _raw_spin_lock_irqsave snd_pcm_period_elapsed snd_pcm process_rx_packets snd_firewire_lib irq_target_callback snd_firewire_lib handle_it_packet firewire_ohci context_tasklet firewire_ohci Restore the process context work queue to prevent deadlock AB/BA deadlock competition for ALSA substream lock of snd_pcm_stream_lock_irq() in snd_pcm_status64() and snd_pcm_stream_lock_irqsave() in snd_pcm_period_elapsed(). revert commit 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context") Replace inline description to prevent future deadlock.
In the Linux kernel, the following vulnerability has been resolved: tpm: Lock TPM chip in tpm_pm_suspend() first Setting TPM_CHIP_FLAG_SUSPENDED in the end of tpm_pm_suspend() can be racy according, as this leaves window for tpm_hwrng_read() to be called while the operation is in progress. The recent bug report gives also evidence of this behaviour. Aadress this by locking the TPM chip before checking any chip->flags both in tpm_pm_suspend() and tpm_hwrng_read(). Move TPM_CHIP_FLAG_SUSPENDED check inside tpm_get_random() so that it will be always checked only when the lock is reserved.
In the Linux kernel, the following vulnerability has been resolved: scsi: ufs: core: Fix another deadlock during RTC update If ufshcd_rtc_work calls ufshcd_rpm_put_sync() and the pm's usage_count is 0, we will enter the runtime suspend callback. However, the runtime suspend callback will wait to flush ufshcd_rtc_work, causing a deadlock. Replace ufshcd_rpm_put_sync() with ufshcd_rpm_put() to avoid the deadlock.
In the Linux kernel, the following vulnerability has been resolved: drm/xe: Drop VM dma-resv lock on xe_sync_in_fence_get failure in exec IOCTL Upon failure all locks need to be dropped before returning to the user. (cherry picked from commit 7d1a4258e602ffdce529f56686925034c1b3b095)
In the Linux kernel, the following vulnerability has been resolved: drm/panthor: Lock XArray when getting entries for the VM Similar to commit cac075706f29 ("drm/panthor: Fix race when converting group handle to group object") we need to use the XArray's internal locking when retrieving a vm pointer from there. v2: Removed part of the patch that was trying to protect fetching the heap pointer from XArray, as that operation is protected by the @pool->lock.
In the Linux kernel, the following vulnerability has been resolved: mm/thp: fix deferred split unqueue naming and locking Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases.
In the Linux kernel, the following vulnerability has been resolved: mptcp: init: protect sched with rcu_read_lock Enabling CONFIG_PROVE_RCU_LIST with its dependence CONFIG_RCU_EXPERT creates this splat when an MPTCP socket is created: ============================= WARNING: suspicious RCU usage 6.12.0-rc2+ #11 Not tainted ----------------------------- net/mptcp/sched.c:44 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mptcp_connect/176. stack backtrace: CPU: 0 UID: 0 PID: 176 Comm: mptcp_connect Not tainted 6.12.0-rc2+ #11 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) mptcp_sched_find (net/mptcp/sched.c:44 (discriminator 7)) mptcp_init_sock (net/mptcp/protocol.c:2867 (discriminator 1)) ? sock_init_data_uid (arch/x86/include/asm/atomic.h:28) inet_create.part.0.constprop.0 (net/ipv4/af_inet.c:386) ? __sock_create (include/linux/rcupdate.h:347 (discriminator 1)) __sock_create (net/socket.c:1576) __sys_socket (net/socket.c:1671) ? __pfx___sys_socket (net/socket.c:1712) ? do_user_addr_fault (arch/x86/mm/fault.c:1419 (discriminator 1)) __x64_sys_socket (net/socket.c:1728) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) That's because when the socket is initialised, rcu_read_lock() is not used despite the explicit comment written above the declaration of mptcp_sched_find() in sched.c. Adding the missing lock/unlock avoids the warning.
In the Linux kernel, the following vulnerability has been resolved: net: nfc: fix deadlock between nfc_unregister_device and rfkill_fop_write A deadlock can occur between nfc_unregister_device() and rfkill_fop_write() due to lock ordering inversion between device_lock and rfkill_global_mutex. The problematic lock order is: Thread A (rfkill_fop_write): rfkill_fop_write() mutex_lock(&rfkill_global_mutex) rfkill_set_block() nfc_rfkill_set_block() nfc_dev_down() device_lock(&dev->dev) <- waits for device_lock Thread B (nfc_unregister_device): nfc_unregister_device() device_lock(&dev->dev) rfkill_unregister() mutex_lock(&rfkill_global_mutex) <- waits for rfkill_global_mutex This creates a classic ABBA deadlock scenario. Fix this by moving rfkill_unregister() and rfkill_destroy() outside the device_lock critical section. Store the rfkill pointer in a local variable before releasing the lock, then call rfkill_unregister() after releasing device_lock. This change is safe because rfkill_fop_write() holds rfkill_global_mutex while calling the rfkill callbacks, and rfkill_unregister() also acquires rfkill_global_mutex before cleanup. Therefore, rfkill_unregister() will wait for any ongoing callback to complete before proceeding, and device_del() is only called after rfkill_unregister() returns, preventing any use-after-free. The similar lock ordering in nfc_register_device() (device_lock -> rfkill_global_mutex via rfkill_register) is safe because during registration the device is not yet in rfkill_list, so no concurrent rfkill operations can occur on this device.
In the Linux kernel, the following vulnerability has been resolved: mptcp: avoid deadlock on fallback while reinjecting Jakub reported an MPTCP deadlock at fallback time: WARNING: possible recursive locking detected 6.18.0-rc7-virtme #1 Not tainted -------------------------------------------- mptcp_connect/20858 is trying to acquire lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280 but task is already holding lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&msk->fallback_lock); lock(&msk->fallback_lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by mptcp_connect/20858: #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0 #1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0 #2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 stack backtrace: CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme #1 PREEMPT(full) Hardware name: Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6f/0xa0 print_deadlock_bug.cold+0xc0/0xcd validate_chain+0x2ff/0x5f0 __lock_acquire+0x34c/0x740 lock_acquire.part.0+0xbc/0x260 _raw_spin_lock_bh+0x38/0x50 __mptcp_try_fallback+0xd8/0x280 mptcp_sendmsg_frag+0x16c2/0x3050 __mptcp_retrans+0x421/0xaa0 mptcp_release_cb+0x5aa/0xa70 release_sock+0xab/0x1d0 mptcp_sendmsg+0xd5b/0x1bc0 sock_write_iter+0x281/0x4d0 new_sync_write+0x3c5/0x6f0 vfs_write+0x65e/0xbb0 ksys_write+0x17e/0x200 do_syscall_64+0xbb/0xfd0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fa5627cbc5e Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005 RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920 R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c The packet scheduler could attempt a reinjection after receiving an MP_FAIL and before the infinite map has been transmitted, causing a deadlock since MPTCP needs to do the reinjection atomically from WRT fallback. Address the issue explicitly avoiding the reinjection in the critical scenario. Note that this is the only fallback critical section that could potentially send packets and hit the double-lock.
In the Linux kernel, the following vulnerability has been resolved: nvdimm: Fix firmware activation deadlock scenarios Lockdep reports the following deadlock scenarios for CXL root device power-management, device_prepare(), operations, and device_shutdown() operations for 'nd_region' devices: Chain exists of: &nvdimm_region_key --> &nvdimm_bus->reconfig_mutex --> system_transition_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(system_transition_mutex); lock(&nvdimm_bus->reconfig_mutex); lock(system_transition_mutex); lock(&nvdimm_region_key); Chain exists of: &cxl_nvdimm_bridge_key --> acpi_scan_lock --> &cxl_root_key Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&cxl_root_key); lock(acpi_scan_lock); lock(&cxl_root_key); lock(&cxl_nvdimm_bridge_key); These stem from holding nvdimm_bus_lock() over hibernate_quiet_exec() which walks the entire system device topology taking device_lock() along the way. The nvdimm_bus_lock() is protecting against unregistration, multiple simultaneous ops callers, and preventing activate_show() from racing activate_store(). For the first 2, the lock is redundant. Unregistration already flushes all ops users, and sysfs already prevents multiple threads to be active in an ops handler at the same time. For the last userspace should already be waiting for its last activate_store() to complete, and does not need activate_show() to flush the write side, so this lock usage can be deleted in these attributes.
A deadlock flaw was found in the Linux kernel’s BPF subsystem. This flaw allows a local user to potentially crash the system.
In the Linux kernel, the following vulnerability has been resolved: nfc: nci: fix circular locking dependency in nci_close_device nci_close_device() flushes rx_wq and tx_wq while holding req_lock. This causes a circular locking dependency because nci_rx_work() running on rx_wq can end up taking req_lock too: nci_rx_work -> nci_rx_data_packet -> nci_data_exchange_complete -> __sk_destruct -> rawsock_destruct -> nfc_deactivate_target -> nci_deactivate_target -> nci_request -> mutex_lock(&ndev->req_lock) Move the flush of rx_wq after req_lock has been released. This should safe (I think) because NCI_UP has already been cleared and the transport is closed, so the work will see it and return -ENETDOWN. NIPA has been hitting this running the nci selftest with a debug kernel on roughly 4% of the runs.
In the Linux kernel, the following vulnerability has been resolved: drm/xe/guc_submit: add missing locking in wedged_fini Any non-wedged queue can have a zero refcount here and can be running concurrently with an async queue destroy, therefore dereferencing the queue ptr to check wedge status after the lookup can trigger UAF if queue is not wedged. Fix this by keeping the submission_state lock held around the check to postpone the free and make the check safe, before dropping again around the put() to avoid the deadlock. (cherry picked from commit d28af0b6b9580b9f90c265a7da0315b0ad20bbfd)
In the Linux kernel, the following vulnerability has been resolved: i2c: stm32f7: Do not prepare/unprepare clock during runtime suspend/resume In case there is any sort of clock controller attached to this I2C bus controller, for example Versaclock or even an AIC32x4 I2C codec, then an I2C transfer triggered from the clock controller clk_ops .prepare callback may trigger a deadlock on drivers/clk/clk.c prepare_lock mutex. This is because the clock controller first grabs the prepare_lock mutex and then performs the prepare operation, including its I2C access. The I2C access resumes this I2C bus controller via .runtime_resume callback, which calls clk_prepare_enable(), which attempts to grab the prepare_lock mutex again and deadlocks. Since the clock are already prepared since probe() and unprepared in remove(), use simple clk_enable()/clk_disable() calls to enable and disable the clock on runtime suspend and resume, to avoid hitting the prepare_lock mutex.
In the Linux kernel, the following vulnerability has been resolved: dma-buf/sw-sync: don't enable IRQ from sync_print_obj() Since commit a6aa8fca4d79 ("dma-buf/sw-sync: Reduce irqsave/irqrestore from known context") by error replaced spin_unlock_irqrestore() with spin_unlock_irq() for both sync_debugfs_show() and sync_print_obj() despite sync_print_obj() is called from sync_debugfs_show(), lockdep complains inconsistent lock state warning. Use plain spin_{lock,unlock}() for sync_print_obj(), for sync_debugfs_show() is already using spin_{lock,unlock}_irq().
In the Linux kernel, the following vulnerability has been resolved: riscv:uprobe fix SR_SPIE set/clear handling In riscv the process of uprobe going to clear spie before exec the origin insn,and set spie after that.But When access the page which origin insn has been placed a page fault may happen and irq was disabled in arch_uprobe_pre_xol function,It cause a WARN as follows. There is no need to clear/set spie in arch_uprobe_pre/post/abort_xol. We can just remove it. [ 31.684157] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1488 [ 31.684677] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 76, name: work [ 31.684929] preempt_count: 0, expected: 0 [ 31.685969] CPU: 2 PID: 76 Comm: work Tainted: G [ 31.686542] Hardware name: riscv-virtio,qemu (DT) [ 31.686797] Call Trace: [ 31.687053] [<ffffffff80006442>] dump_backtrace+0x30/0x38 [ 31.687699] [<ffffffff80812118>] show_stack+0x40/0x4c [ 31.688141] [<ffffffff8081817a>] dump_stack_lvl+0x44/0x5c [ 31.688396] [<ffffffff808181aa>] dump_stack+0x18/0x20 [ 31.688653] [<ffffffff8003e454>] __might_resched+0x114/0x122 [ 31.688948] [<ffffffff8003e4b2>] __might_sleep+0x50/0x7a [ 31.689435] [<ffffffff80822676>] down_read+0x30/0x130 [ 31.689728] [<ffffffff8000b650>] do_page_fault+0x166/x446 [ 31.689997] [<ffffffff80003c0c>] ret_from_exception+0x0/0xc
In the Linux kernel, the following vulnerability has been resolved: scsi: storvsc: Remove WQ_MEM_RECLAIM from storvsc_error_wq storvsc_error_wq workqueue should not be marked as WQ_MEM_RECLAIM as it doesn't need to make forward progress under memory pressure. Marking this workqueue as WQ_MEM_RECLAIM may cause deadlock while flushing a non-WQ_MEM_RECLAIM workqueue. In the current state it causes the following warning: [ 14.506347] ------------[ cut here ]------------ [ 14.506354] workqueue: WQ_MEM_RECLAIM storvsc_error_wq_0:storvsc_remove_lun is flushing !WQ_MEM_RECLAIM events_freezable_power_:disk_events_workfn [ 14.506360] WARNING: CPU: 0 PID: 8 at <-snip->kernel/workqueue.c:2623 check_flush_dependency+0xb5/0x130 [ 14.506390] CPU: 0 PID: 8 Comm: kworker/u4:0 Not tainted 5.4.0-1086-azure #91~18.04.1-Ubuntu [ 14.506391] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ 14.506393] Workqueue: storvsc_error_wq_0 storvsc_remove_lun [ 14.506395] RIP: 0010:check_flush_dependency+0xb5/0x130 <-snip-> [ 14.506408] Call Trace: [ 14.506412] __flush_work+0xf1/0x1c0 [ 14.506414] __cancel_work_timer+0x12f/0x1b0 [ 14.506417] ? kernfs_put+0xf0/0x190 [ 14.506418] cancel_delayed_work_sync+0x13/0x20 [ 14.506420] disk_block_events+0x78/0x80 [ 14.506421] del_gendisk+0x3d/0x2f0 [ 14.506423] sr_remove+0x28/0x70 [ 14.506427] device_release_driver_internal+0xef/0x1c0 [ 14.506428] device_release_driver+0x12/0x20 [ 14.506429] bus_remove_device+0xe1/0x150 [ 14.506431] device_del+0x167/0x380 [ 14.506432] __scsi_remove_device+0x11d/0x150 [ 14.506433] scsi_remove_device+0x26/0x40 [ 14.506434] storvsc_remove_lun+0x40/0x60 [ 14.506436] process_one_work+0x209/0x400 [ 14.506437] worker_thread+0x34/0x400 [ 14.506439] kthread+0x121/0x140 [ 14.506440] ? process_one_work+0x400/0x400 [ 14.506441] ? kthread_park+0x90/0x90 [ 14.506443] ret_from_fork+0x35/0x40 [ 14.506445] ---[ end trace 2d9633159fdc6ee7 ]---
In the Linux kernel, the following vulnerability has been resolved: padata: Always leave BHs disabled when running ->parallel() A deadlock can happen when an overloaded system runs ->parallel() in the context of the current task: padata_do_parallel ->parallel() pcrypt_aead_enc/dec padata_do_serial spin_lock(&reorder->lock) // BHs still enabled <interrupt> ... __do_softirq ... padata_do_serial spin_lock(&reorder->lock) It's a bug for BHs to be on in _do_serial as Steffen points out, so ensure they're off in the "current task" case like they are in padata_parallel_worker to avoid this situation.
In the Linux kernel, the following vulnerability has been resolved: posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime() If get_clock_desc() succeeds, it calls fget() for the clockid's fd, and get the clk->rwsem read lock, so the error path should release the lock to make the lock balance and fput the clockid's fd to make the refcount balance and release the fd related resource. However the below commit left the error path locked behind resulting in unbalanced locking. Check timespec64_valid_strict() before get_clock_desc() to fix it, because the "ts" is not changed after that. [pabeni@redhat.com: fixed commit message typo]
In the Linux kernel, the following vulnerability has been resolved: dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata Following concurrent processes: P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_read_block_bitmap_nowait ext4_read_bh_nowait submit_bh dm_submit_bio do_worker process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata down_write(&pmd->root_lock) - LOCK B __destroy_persistent_data_objects dm_block_manager_destroy dm_bufio_client_destroy unregister_shrinker down_write(&shrinker_rwsem) thin_map | dm_thin_find_block ↓ down_read(&pmd->root_lock) --> ABBA deadlock , which triggers hung task: [ 76.974820] INFO: task kworker/u4:3:63 blocked for more than 15 seconds. [ 76.976019] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.978521] task:kworker/u4:3 state:D stack:0 pid:63 ppid:2 [ 76.978534] Workqueue: dm-thin do_worker [ 76.978552] Call Trace: [ 76.978564] __schedule+0x6ba/0x10f0 [ 76.978582] schedule+0x9d/0x1e0 [ 76.978588] rwsem_down_write_slowpath+0x587/0xdf0 [ 76.978600] down_write+0xec/0x110 [ 76.978607] unregister_shrinker+0x2c/0xf0 [ 76.978616] dm_bufio_client_destroy+0x116/0x3d0 [ 76.978625] dm_block_manager_destroy+0x19/0x40 [ 76.978629] __destroy_persistent_data_objects+0x5e/0x70 [ 76.978636] dm_pool_abort_metadata+0x8e/0x100 [ 76.978643] metadata_operation_failed+0x86/0x110 [ 76.978649] commit+0x6a/0x230 [ 76.978655] do_worker+0xc6e/0xd90 [ 76.978702] process_one_work+0x269/0x630 [ 76.978714] worker_thread+0x266/0x630 [ 76.978730] kthread+0x151/0x1b0 [ 76.978772] INFO: task test.sh:2646 blocked for more than 15 seconds. [ 76.979756] Not tainted 6.1.0-rc4-00011-g8f17dd350364-dirty #910 [ 76.982111] task:test.sh state:D stack:0 pid:2646 ppid:2459 [ 76.982128] Call Trace: [ 76.982139] __schedule+0x6ba/0x10f0 [ 76.982155] schedule+0x9d/0x1e0 [ 76.982159] rwsem_down_read_slowpath+0x4f4/0x910 [ 76.982173] down_read+0x84/0x170 [ 76.982177] dm_thin_find_block+0x4c/0xd0 [ 76.982183] thin_map+0x201/0x3d0 [ 76.982188] __map_bio+0x5b/0x350 [ 76.982195] dm_submit_bio+0x2b6/0x930 [ 76.982202] __submit_bio+0x123/0x2d0 [ 76.982209] submit_bio_noacct_nocheck+0x101/0x3e0 [ 76.982222] submit_bio_noacct+0x389/0x770 [ 76.982227] submit_bio+0x50/0xc0 [ 76.982232] submit_bh_wbc+0x15e/0x230 [ 76.982238] submit_bh+0x14/0x20 [ 76.982241] ext4_read_bh_nowait+0xc5/0x130 [ 76.982247] ext4_read_block_bitmap_nowait+0x340/0xc60 [ 76.982254] ext4_mb_init_cache+0x1ce/0xdc0 [ 76.982259] ext4_mb_load_buddy_gfp+0x987/0xfa0 [ 76.982263] ext4_discard_preallocations+0x45d/0x830 [ 76.982274] ext4_clear_inode+0x48/0xf0 [ 76.982280] ext4_evict_inode+0xcf/0xc70 [ 76.982285] evict+0x119/0x2b0 [ 76.982290] dispose_list+0x43/0xa0 [ 76.982294] prune_icache_sb+0x64/0x90 [ 76.982298] super_cache_scan+0x155/0x210 [ 76.982303] do_shrink_slab+0x19e/0x4e0 [ 76.982310] shrink_slab+0x2bd/0x450 [ 76.982317] drop_slab+0xcc/0x1a0 [ 76.982323] drop_caches_sysctl_handler+0xb7/0xe0 [ 76.982327] proc_sys_call_handler+0x1bc/0x300 [ 76.982331] proc_sys_write+0x17/0x20 [ 76.982334] vfs_write+0x3d3/0x570 [ 76.982342] ksys_write+0x73/0x160 [ 76.982347] __x64_sys_write+0x1e/0x30 [ 76.982352] do_syscall_64+0x35/0x80 [ 76.982357] entry_SYSCALL_64_after_hwframe+0x63/0xcd Funct ---truncated---
In the Linux kernel, the following vulnerability has been resolved: Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del() l2cap_conn_del() calls cancel_delayed_work_sync() for both info_timer and id_addr_timer while holding conn->lock. However, the work functions l2cap_info_timeout() and l2cap_conn_update_id_addr() both acquire conn->lock, creating a potential AB-BA deadlock if the work is already executing when l2cap_conn_del() takes the lock. Move the work cancellations before acquiring conn->lock and use disable_delayed_work_sync() to additionally prevent the works from being rearmed after cancellation, consistent with the pattern used in hci_conn_del().
In the Linux kernel, the following vulnerability has been resolved: drm/msm/mdp5: Fix global state lock backoff We need to grab the lock after the early return for !hwpipe case. Otherwise, we could have hit contention yet still returned 0. Fixes an issue that the new CONFIG_DRM_DEBUG_MODESET_LOCK stuff flagged in CI: WARNING: CPU: 0 PID: 282 at drivers/gpu/drm/drm_modeset_lock.c:296 drm_modeset_lock+0xf8/0x154 Modules linked in: CPU: 0 PID: 282 Comm: kms_cursor_lega Tainted: G W 5.19.0-rc2-15930-g875cc8bc536a #1 Hardware name: Qualcomm Technologies, Inc. DB820c (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drm_modeset_lock+0xf8/0x154 lr : drm_atomic_get_private_obj_state+0x84/0x170 sp : ffff80000cfab6a0 x29: ffff80000cfab6a0 x28: 0000000000000000 x27: ffff000083bc4d00 x26: 0000000000000038 x25: 0000000000000000 x24: ffff80000957ca58 x23: 0000000000000000 x22: ffff000081ace080 x21: 0000000000000001 x20: ffff000081acec18 x19: ffff80000cfabb80 x18: 0000000000000038 x17: 0000000000000000 x16: 0000000000000000 x15: fffffffffffea0d0 x14: 0000000000000000 x13: 284e4f5f4e524157 x12: 5f534b434f4c5f47 x11: ffff80000a386aa8 x10: 0000000000000029 x9 : ffff80000cfab610 x8 : 0000000000000029 x7 : 0000000000000014 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff8000081ad904 x3 : 0000000000000029 x2 : ffff0000801db4c0 x1 : ffff80000cfabb80 x0 : ffff000081aceb58 Call trace: drm_modeset_lock+0xf8/0x154 drm_atomic_get_private_obj_state+0x84/0x170 mdp5_get_global_state+0x54/0x6c mdp5_pipe_release+0x2c/0xd4 mdp5_plane_atomic_check+0x2ec/0x414 drm_atomic_helper_check_planes+0xd8/0x210 drm_atomic_helper_check+0x54/0xb0 ... ---[ end trace 0000000000000000 ]--- drm_modeset_lock attempting to lock a contended lock without backoff: drm_modeset_lock+0x148/0x154 mdp5_get_global_state+0x30/0x6c mdp5_pipe_release+0x2c/0xd4 mdp5_plane_atomic_check+0x290/0x414 drm_atomic_helper_check_planes+0xd8/0x210 drm_atomic_helper_check+0x54/0xb0 drm_atomic_check_only+0x4b0/0x8f4 drm_atomic_commit+0x68/0xe0 Patchwork: https://patchwork.freedesktop.org/patch/492701/
In the Linux kernel, the following vulnerability has been resolved: driver core: fix potential deadlock in __driver_attach In __driver_attach function, There are also AA deadlock problem, like the commit b232b02bf3c2 ("driver core: fix deadlock in __device_attach"). stack like commit b232b02bf3c2 ("driver core: fix deadlock in __device_attach"). list below: In __driver_attach function, The lock holding logic is as follows: ... __driver_attach if (driver_allows_async_probing(drv)) device_lock(dev) // get lock dev async_schedule_dev(__driver_attach_async_helper, dev); // func async_schedule_node async_schedule_node_domain(func) entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC); /* when fail or work limit, sync to execute func, but __driver_attach_async_helper will get lock dev as will, which will lead to A-A deadlock. */ if (!entry || atomic_read(&entry_count) > MAX_WORK) { func; else queue_work_node(node, system_unbound_wq, &entry->work) device_unlock(dev) As above show, when it is allowed to do async probes, because of out of memory or work limit, async work is not be allowed, to do sync execute instead. it will lead to A-A deadlock because of __driver_attach_async_helper getting lock dev. Reproduce: and it can be reproduce by make the condition (if (!entry || atomic_read(&entry_count) > MAX_WORK)) untenable, like below: [ 370.785650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 370.787154] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000 [ 370.788865] Call Trace: [ 370.789374] <TASK> [ 370.789841] __schedule+0x482/0x1050 [ 370.790613] schedule+0x92/0x1a0 [ 370.791290] schedule_preempt_disabled+0x2c/0x50 [ 370.792256] __mutex_lock.isra.0+0x757/0xec0 [ 370.793158] __mutex_lock_slowpath+0x1f/0x30 [ 370.794079] mutex_lock+0x50/0x60 [ 370.794795] __device_driver_lock+0x2f/0x70 [ 370.795677] ? driver_probe_device+0xd0/0xd0 [ 370.796576] __driver_attach_async_helper+0x1d/0xd0 [ 370.797318] ? driver_probe_device+0xd0/0xd0 [ 370.797957] async_schedule_node_domain+0xa5/0xc0 [ 370.798652] async_schedule_node+0x19/0x30 [ 370.799243] __driver_attach+0x246/0x290 [ 370.799828] ? driver_allows_async_probing+0xa0/0xa0 [ 370.800548] bus_for_each_dev+0x9d/0x130 [ 370.801132] driver_attach+0x22/0x30 [ 370.801666] bus_add_driver+0x290/0x340 [ 370.802246] driver_register+0x88/0x140 [ 370.802817] ? virtio_scsi_init+0x116/0x116 [ 370.803425] scsi_register_driver+0x1a/0x30 [ 370.804057] init_sd+0x184/0x226 [ 370.804533] do_one_initcall+0x71/0x3a0 [ 370.805107] kernel_init_freeable+0x39a/0x43a [ 370.805759] ? rest_init+0x150/0x150 [ 370.806283] kernel_init+0x26/0x230 [ 370.806799] ret_from_fork+0x1f/0x30 To fix the deadlock, move the async_schedule_dev outside device_lock, as we can see, in async_schedule_node_domain, the parameter of queue_work_node is system_unbound_wq, so it can accept concurrent operations. which will also not change the code logic, and will not lead to deadlock.
In the Linux kernel, the following vulnerability has been resolved: powerpc/pci: Fix get_phb_number() locking The recent change to get_phb_number() causes a DEBUG_ATOMIC_SLEEP warning on some systems: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 1 lock held by swapper/1: #0: c157efb0 (hose_spinlock){+.+.}-{2:2}, at: pcibios_alloc_controller+0x64/0x220 Preemption disabled at: [<00000000>] 0x0 CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0-yocto-standard+ #1 Call Trace: [d101dc90] [c073b264] dump_stack_lvl+0x50/0x8c (unreliable) [d101dcb0] [c0093b70] __might_resched+0x258/0x2a8 [d101dcd0] [c0d3e634] __mutex_lock+0x6c/0x6ec [d101dd50] [c0a84174] of_alias_get_id+0x50/0xf4 [d101dd80] [c002ec78] pcibios_alloc_controller+0x1b8/0x220 [d101ddd0] [c140c9dc] pmac_pci_init+0x198/0x784 [d101de50] [c140852c] discover_phbs+0x30/0x4c [d101de60] [c0007fd4] do_one_initcall+0x94/0x344 [d101ded0] [c1403b40] kernel_init_freeable+0x1a8/0x22c [d101df10] [c00086e0] kernel_init+0x34/0x160 [d101df30] [c001b334] ret_from_kernel_thread+0x5c/0x64 This is because pcibios_alloc_controller() holds hose_spinlock but of_alias_get_id() takes of_mutex which can sleep. The hose_spinlock protects the phb_bitmap, and also the hose_list, but it doesn't need to be held while get_phb_number() calls the OF routines, because those are only looking up information in the device tree. So fix it by having get_phb_number() take the hose_spinlock itself, only where required, and then dropping the lock before returning. pcibios_alloc_controller() then needs to take the lock again before the list_add() but that's safe, the order of the list is not important.
In the Linux kernel, the following vulnerability has been resolved: ring-buffer: Fix reader locking when changing the sub buffer order The function ring_buffer_subbuf_order_set() updates each ring_buffer_per_cpu and installs new sub buffers that match the requested page order. This operation may be invoked concurrently with readers that rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or the ring_buffer_per_cpu.pages and reader_page pointers. However, no exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying the mentioned data while a reader also operates on them can then result in incorrect memory access and various crashes. Fix the problem by taking the reader_lock when updating a specific ring_buffer_per_cpu in ring_buffer_subbuf_order_set().
In the Linux kernel, the following vulnerability has been resolved: net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]: kworker/0:16/14617 is trying to acquire lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 [...] but task is already holding lock: ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572 The neighbor entry turned to NUD_FAILED state, where __neigh_event_send() triggered an immediate probe as per commit cd28ca0a3dd1 ("neigh: reduce arp latency") via neigh_probe() given table lock was held. One option to fix this situation is to defer the neigh_probe() back to the neigh_timer_handler() similarly as pre cd28ca0a3dd1. For the case of NTF_MANAGED, this deferral is acceptable given this only happens on actual failure state and regular / expected state is NUD_VALID with the entry already present. The fix adds a parameter to __neigh_event_send() in order to communicate whether immediate probe is allowed or disallowed. Existing call-sites of neigh_event_send() default as-is to immediate probe. However, the neigh_managed_work() disables it via use of neigh_event_send_probe(). [0] <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_deadlock_bug kernel/locking/lockdep.c:2956 [inline] check_deadlock kernel/locking/lockdep.c:2999 [inline] validate_chain kernel/locking/lockdep.c:3788 [inline] __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027 lock_acquire kernel/locking/lockdep.c:5639 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604 __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline] _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334 ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652 ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508 ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650 ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742 neigh_probe+0xc2/0x110 net/core/neighbour.c:1040 __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201 neigh_event_send include/net/neighbour.h:470 [inline] neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK>
In the Linux kernel, the following vulnerability has been resolved: iavf: Fix reset error handling Do not call iavf_close in iavf_reset_task error handling. Doing so can lead to double call of napi_disable, which can lead to deadlock there. Removing VF would lead to iavf_remove task being stuck, because it requires crit_lock, which is held by iavf_close. Call iavf_disable_vf if reset fail, so that driver will clean up remaining invalid resources. During rapid VF resets, HW can fail to setup VF mailbox. Wrong error handling can lead to iavf_remove being stuck with: [ 5218.999087] iavf 0000:82:01.0: Failed to init adminq: -53 ... [ 5267.189211] INFO: task repro.sh:11219 blocked for more than 30 seconds. [ 5267.189520] Tainted: G S E 5.18.0-04958-ga54ce3703613-dirty #1 [ 5267.189764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5267.190062] task:repro.sh state:D stack: 0 pid:11219 ppid: 8162 flags:0x00000000 [ 5267.190347] Call Trace: [ 5267.190647] <TASK> [ 5267.190927] __schedule+0x460/0x9f0 [ 5267.191264] schedule+0x44/0xb0 [ 5267.191563] schedule_preempt_disabled+0x14/0x20 [ 5267.191890] __mutex_lock.isra.12+0x6e3/0xac0 [ 5267.192237] ? iavf_remove+0xf9/0x6c0 [iavf] [ 5267.192565] iavf_remove+0x12a/0x6c0 [iavf] [ 5267.192911] ? _raw_spin_unlock_irqrestore+0x1e/0x40 [ 5267.193285] pci_device_remove+0x36/0xb0 [ 5267.193619] device_release_driver_internal+0xc1/0x150 [ 5267.193974] pci_stop_bus_device+0x69/0x90 [ 5267.194361] pci_stop_and_remove_bus_device+0xe/0x20 [ 5267.194735] pci_iov_remove_virtfn+0xba/0x120 [ 5267.195130] sriov_disable+0x2f/0xe0 [ 5267.195506] ice_free_vfs+0x7d/0x2f0 [ice] [ 5267.196056] ? pci_get_device+0x4f/0x70 [ 5267.196496] ice_sriov_configure+0x78/0x1a0 [ice] [ 5267.196995] sriov_numvfs_store+0xfe/0x140 [ 5267.197466] kernfs_fop_write_iter+0x12e/0x1c0 [ 5267.197918] new_sync_write+0x10c/0x190 [ 5267.198404] vfs_write+0x24e/0x2d0 [ 5267.198886] ksys_write+0x5c/0xd0 [ 5267.199367] do_syscall_64+0x3a/0x80 [ 5267.199827] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 5267.200317] RIP: 0033:0x7f5b381205c8 [ 5267.200814] RSP: 002b:00007fff8c7e8c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 5267.201981] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5b381205c8 [ 5267.202620] RDX: 0000000000000002 RSI: 00005569420ee900 RDI: 0000000000000001 [ 5267.203426] RBP: 00005569420ee900 R08: 000000000000000a R09: 00007f5b38180820 [ 5267.204327] R10: 000000000000000a R11: 0000000000000246 R12: 00007f5b383c06e0 [ 5267.205193] R13: 0000000000000002 R14: 00007f5b383bb880 R15: 0000000000000002 [ 5267.206041] </TASK> [ 5267.206970] Kernel panic - not syncing: hung_task: blocked tasks [ 5267.207809] CPU: 48 PID: 551 Comm: khungtaskd Kdump: loaded Tainted: G S E 5.18.0-04958-ga54ce3703613-dirty #1 [ 5267.208726] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.11.0 11/02/2019 [ 5267.209623] Call Trace: [ 5267.210569] <TASK> [ 5267.211480] dump_stack_lvl+0x33/0x42 [ 5267.212472] panic+0x107/0x294 [ 5267.213467] watchdog.cold.8+0xc/0xbb [ 5267.214413] ? proc_dohung_task_timeout_secs+0x30/0x30 [ 5267.215511] kthread+0xf4/0x120 [ 5267.216459] ? kthread_complete_and_exit+0x20/0x20 [ 5267.217505] ret_from_fork+0x22/0x30 [ 5267.218459] </TASK>
In the Linux kernel, the following vulnerability has been resolved: rxrpc: Fix locking in rxrpc's sendmsg Fix three bugs in the rxrpc's sendmsg implementation: (1) rxrpc_new_client_call() should release the socket lock when returning an error from rxrpc_get_call_slot(). (2) rxrpc_wait_for_tx_window_intr() will return without the call mutex held in the event that we're interrupted by a signal whilst waiting for tx space on the socket or relocking the call mutex afterwards. Fix this by: (a) moving the unlock/lock of the call mutex up to rxrpc_send_data() such that the lock is not held around all of rxrpc_wait_for_tx_window*() and (b) indicating to higher callers whether we're return with the lock dropped. Note that this means recvmsg() will not block on this call whilst we're waiting. (3) After dropping and regaining the call mutex, rxrpc_send_data() needs to go and recheck the state of the tx_pending buffer and the tx_total_len check in case we raced with another sendmsg() on the same call. Thinking on this some more, it might make sense to have different locks for sendmsg() and recvmsg(). There's probably no need to make recvmsg() wait for sendmsg(). It does mean that recvmsg() can return MSG_EOR indicating that a call is dead before a sendmsg() to that call returns - but that can currently happen anyway. Without fix (2), something like the following can be induced: WARNING: bad unlock balance detected! 5.16.0-rc6-syzkaller #0 Not tainted ------------------------------------- syz-executor011/3597 is trying to release lock (&call->user_mutex) at: [<ffffffff885163a3>] rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748 but there are no more locks to release! other info that might help us debug this: no locks held by syz-executor011/3597. ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_unlock_imbalance_bug include/trace/events/lock.h:58 [inline] __lock_release kernel/locking/lockdep.c:5306 [inline] lock_release.cold+0x49/0x4e kernel/locking/lockdep.c:5657 __mutex_unlock_slowpath+0x99/0x5e0 kernel/locking/mutex.c:900 rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748 rxrpc_sendmsg+0x420/0x630 net/rxrpc/af_rxrpc.c:561 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409 ___sys_sendmsg+0xf3/0x170 net/socket.c:2463 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae [Thanks to Hawkins Jiawei and Khalid Masum for their attempts to fix this]
In the Linux kernel, the following vulnerability has been resolved: led: qcom-lpg: Fix sleeping in atomic lpg_brighness_set() function can sleep, while led's brightness_set() callback must be non-blocking. Change LPG driver to use brightness_set_blocking() instead. BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/0 preempt_count: 101, expected: 0 INFO: lockdep is turned off. CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.1.0-rc1-00014-gbe99b089c6fc-dirty #85 Hardware name: Qualcomm Technologies, Inc. DB820c (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x88/0xb4 dump_stack+0x18/0x34 __might_resched+0x170/0x254 __might_sleep+0x48/0x9c __mutex_lock+0x4c/0x400 mutex_lock_nested+0x2c/0x40 lpg_brightness_single_set+0x40/0x90 led_set_brightness_nosleep+0x34/0x60 led_heartbeat_function+0x80/0x170 call_timer_fn+0xb8/0x340 __run_timers.part.0+0x20c/0x254 run_timer_softirq+0x3c/0x7c _stext+0x14c/0x578 ____do_softirq+0x10/0x20 call_on_irq_stack+0x2c/0x5c do_softirq_own_stack+0x1c/0x30 __irq_exit_rcu+0x164/0x170 irq_exit_rcu+0x10/0x40 el1_interrupt+0x38/0x50 el1h_64_irq_handler+0x18/0x2c el1h_64_irq+0x64/0x68 cpuidle_enter_state+0xc8/0x380 cpuidle_enter+0x38/0x50 do_idle+0x244/0x2d0 cpu_startup_entry+0x24/0x30 rest_init+0x128/0x1a0 arch_post_acpi_subsys_init+0x0/0x18 start_kernel+0x6f4/0x734 __primary_switched+0xbc/0xc4
In the Linux kernel, the following vulnerability has been resolved: tracing/timerlat: Drop interface_lock in stop_kthread() stop_kthread() is the offline callback for "trace/osnoise:online", since commit 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()"), the following ABBA deadlock scenario is introduced: T1 | T2 [BP] | T3 [AP] osnoise_hotplug_workfn() | work_for_cpu_fn() | cpuhp_thread_fun() | _cpu_down() | osnoise_cpu_die() mutex_lock(&interface_lock) | | stop_kthread() | cpus_write_lock() | mutex_lock(&interface_lock) cpus_read_lock() | cpuhp_kick_ap() | As the interface_lock here in just for protecting the "kthread" field of the osn_var, use xchg() instead to fix this issue. Also use for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take cpu_read_lock() again.
In the Linux kernel, the following vulnerability has been resolved: driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction Mounting NFS rootfs was timing out when deferred_probe_timeout was non-zero [1]. This was because ip_auto_config() initcall times out waiting for the network interfaces to show up when deferred_probe_timeout was non-zero. While ip_auto_config() calls wait_for_device_probe() to make sure any currently running deferred probe work or asynchronous probe finishes, that wasn't sufficient to account for devices being deferred until deferred_probe_timeout. Commit 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires") tried to fix that by making sure wait_for_device_probe() waits for deferred_probe_timeout to expire before returning. However, if wait_for_device_probe() is called from the kernel_init() context: - Before deferred_probe_initcall() [2], it causes the boot process to hang due to a deadlock. - After deferred_probe_initcall() [3], it blocks kernel_init() from continuing till deferred_probe_timeout expires and beats the point of deferred_probe_timeout that's trying to wait for userspace to load modules. Neither of this is good. So revert the changes to wait_for_device_probe(). [1] - https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@TYAPR01MB4544.jpnprd01.prod.outlook.com/ [2] - https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/ [3] - https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@linutronix.de/
In the Linux kernel, the following vulnerability has been resolved: ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero It is possible to disable VFs while the PF driver is processing requests from the VF driver. This can result in a panic. BUG: unable to handle kernel paging request at 000000000000106c PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 8 PID: 0 Comm: swapper/8 Kdump: loaded Tainted: G I --------- - Hardware name: Dell Inc. PowerEdge R740/06WXJT, BIOS 2.8.2 08/27/2020 RIP: 0010:ixgbe_msg_task+0x4c8/0x1690 [ixgbe] Code: 00 00 48 8d 04 40 48 c1 e0 05 89 7c 24 24 89 fd 48 89 44 24 10 83 ff 01 0f 84 b8 04 00 00 4c 8b 64 24 10 4d 03 a5 48 22 00 00 <41> 80 7c 24 4c 00 0f 84 8a 03 00 00 0f b7 c7 83 f8 08 0f 84 8f 0a RSP: 0018:ffffb337869f8df8 EFLAGS: 00010002 RAX: 0000000000001020 RBX: 0000000000000000 RCX: 000000000000002b RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000006 RBP: 0000000000000006 R08: 0000000000000002 R09: 0000000000029780 R10: 00006957d8f42832 R11: 0000000000000000 R12: 0000000000001020 R13: ffff8a00e8978ac0 R14: 000000000000002b R15: ffff8a00e8979c80 FS: 0000000000000000(0000) GS:ffff8a07dfd00000(0000) knlGS:00000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000106c CR3: 0000000063e10004 CR4: 00000000007726e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? ttwu_do_wakeup+0x19/0x140 ? try_to_wake_up+0x1cd/0x550 ? ixgbevf_update_xcast_mode+0x71/0xc0 [ixgbevf] ixgbe_msix_other+0x17e/0x310 [ixgbe] __handle_irq_event_percpu+0x40/0x180 handle_irq_event_percpu+0x30/0x80 handle_irq_event+0x36/0x53 handle_edge_irq+0x82/0x190 handle_irq+0x1c/0x30 do_IRQ+0x49/0xd0 common_interrupt+0xf/0xf This can be eventually be reproduced with the following script: while : do echo 63 > /sys/class/net/<devname>/device/sriov_numvfs sleep 1 echo 0 > /sys/class/net/<devname>/device/sriov_numvfs sleep 1 done Add lock when disabling SR-IOV to prevent process VF mailbox communication.
In the Linux kernel, the following vulnerability has been resolved: VMCI: Use threaded irqs instead of tasklets The vmci_dispatch_dgs() tasklet function calls vmci_read_data() which uses wait_event() resulting in invalid sleep in an atomic context (and therefore potentially in a deadlock). Use threaded irqs to fix this issue and completely remove usage of tasklets. [ 20.264639] BUG: sleeping function called from invalid context at drivers/misc/vmw_vmci/vmci_guest.c:145 [ 20.264643] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 762, name: vmtoolsd [ 20.264645] preempt_count: 101, expected: 0 [ 20.264646] RCU nest depth: 0, expected: 0 [ 20.264647] 1 lock held by vmtoolsd/762: [ 20.264648] #0: ffff0000874ae440 (sk_lock-AF_VSOCK){+.+.}-{0:0}, at: vsock_connect+0x60/0x330 [vsock] [ 20.264658] Preemption disabled at: [ 20.264659] [<ffff80000151d7d8>] vmci_send_datagram+0x44/0xa0 [vmw_vmci] [ 20.264665] CPU: 0 PID: 762 Comm: vmtoolsd Not tainted 5.19.0-0.rc8.20220727git39c3c396f813.60.fc37.aarch64 #1 [ 20.264667] Hardware name: VMware, Inc. VBSA/VBSA, BIOS VEFI 12/31/2020 [ 20.264668] Call trace: [ 20.264669] dump_backtrace+0xc4/0x130 [ 20.264672] show_stack+0x24/0x80 [ 20.264673] dump_stack_lvl+0x88/0xb4 [ 20.264676] dump_stack+0x18/0x34 [ 20.264677] __might_resched+0x1a0/0x280 [ 20.264679] __might_sleep+0x58/0x90 [ 20.264681] vmci_read_data+0x74/0x120 [vmw_vmci] [ 20.264683] vmci_dispatch_dgs+0x64/0x204 [vmw_vmci] [ 20.264686] tasklet_action_common.constprop.0+0x13c/0x150 [ 20.264688] tasklet_action+0x40/0x50 [ 20.264689] __do_softirq+0x23c/0x6b4 [ 20.264690] __irq_exit_rcu+0x104/0x214 [ 20.264691] irq_exit_rcu+0x1c/0x50 [ 20.264693] el1_interrupt+0x38/0x6c [ 20.264695] el1h_64_irq_handler+0x18/0x24 [ 20.264696] el1h_64_irq+0x68/0x6c [ 20.264697] preempt_count_sub+0xa4/0xe0 [ 20.264698] _raw_spin_unlock_irqrestore+0x64/0xb0 [ 20.264701] vmci_send_datagram+0x7c/0xa0 [vmw_vmci] [ 20.264703] vmci_datagram_dispatch+0x84/0x100 [vmw_vmci] [ 20.264706] vmci_datagram_send+0x2c/0x40 [vmw_vmci] [ 20.264709] vmci_transport_send_control_pkt+0xb8/0x120 [vmw_vsock_vmci_transport] [ 20.264711] vmci_transport_connect+0x40/0x7c [vmw_vsock_vmci_transport] [ 20.264713] vsock_connect+0x278/0x330 [vsock] [ 20.264715] __sys_connect_file+0x8c/0xc0 [ 20.264718] __sys_connect+0x84/0xb4 [ 20.264720] __arm64_sys_connect+0x2c/0x3c [ 20.264721] invoke_syscall+0x78/0x100 [ 20.264723] el0_svc_common.constprop.0+0x68/0x124 [ 20.264724] do_el0_svc+0x38/0x4c [ 20.264725] el0_svc+0x60/0x180 [ 20.264726] el0t_64_sync_handler+0x11c/0x150 [ 20.264728] el0t_64_sync+0x190/0x194
In the Linux kernel, the following vulnerability has been resolved: btrfs: fix deadlock between concurrent dio writes when low on free data space When reserving data space for a direct IO write we can end up deadlocking if we have multiple tasks attempting a write to the same file range, there are multiple extents covered by that file range, we are low on available space for data and the writes don't expand the inode's i_size. The deadlock can happen like this: 1) We have a file with an i_size of 1M, at offset 0 it has an extent with a size of 128K and at offset 128K it has another extent also with a size of 128K; 2) Task A does a direct IO write against file range [0, 256K), and because the write is within the i_size boundary, it takes the inode's lock (VFS level) in shared mode; 3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and then gets the extent map for the extent covering the range [0, 128K). At btrfs_get_blocks_direct_write(), it creates an ordered extent for that file range ([0, 128K)); 4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file range [0, 256K); 5) Task A executes btrfs_dio_iomap_begin() again, this time for the file range [128K, 256K), and locks the file range [128K, 256K); 6) Task B starts a direct IO write against file range [0, 256K) as well. It also locks the inode in shared mode, as it's within the i_size limit, and then tries to lock file range [0, 256K). It is able to lock the subrange [0, 128K) but then blocks waiting for the range [128K, 256K), as it is currently locked by task A; 7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data space. Because we are low on available free space, it triggers the async data reclaim task, and waits for it to reserve data space; 8) The async reclaim task decides to wait for all existing ordered extents to complete (through btrfs_wait_ordered_roots()). It finds the ordered extent previously created by task A for the file range [0, 128K) and waits for it to complete; 9) The ordered extent for the file range [0, 128K) can not complete because it blocks at btrfs_finish_ordered_io() when trying to lock the file range [0, 128K). This results in a deadlock, because: - task B is holding the file range [0, 128K) locked, waiting for the range [128K, 256K) to be unlocked by task A; - task A is holding the file range [128K, 256K) locked and it's waiting for the async data reclaim task to satisfy its space reservation request; - the async data reclaim task is waiting for ordered extent [0, 128K) to complete, but the ordered extent can not complete because the file range [0, 128K) is currently locked by task B, which is waiting on task A to unlock file range [128K, 256K) and task A waiting on the async data reclaim task. This results in a deadlock between 4 task: task A, task B, the async data reclaim task and the task doing ordered extent completion (a work queue task). This type of deadlock can sporadically be triggered by the test case generic/300 from fstests, and results in a stack trace like the following: [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds. [12084.034877] Not tainted 5.18.0-rc2-btrfs-next-115 #1 [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [12084.036548] task:kworker/u16:7 state:D stack: 0 pid:123749 ppid: 2 flags:0x00004000 [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [12084.036599] Call Trace: [12084.036601] <TASK> [12084.036606] __schedule+0x3cb/0xed0 [12084.036616] schedule+0x4e/0xb0 [12084.036620] btrfs_start_ordered_extent+0x109/0x1c0 [btrfs] [12084.036651] ? prepare_to_wait_exclusive+0xc0/0xc0 [12084.036659] btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs] [12084.036688] btrfs_work_helper+0xf8/0x400 [btrfs] [12084.0367 ---truncated---
In the Linux kernel, the following vulnerability has been resolved: scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg() In an attempt to log message 0126 with LOG_TRACE_EVENT, the following hard lockup call trace hangs the system. Call Trace: _raw_spin_lock_irqsave+0x32/0x40 lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc] lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc] lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc] lpfc_els_flush_cmd+0x43c/0x670 [lpfc] lpfc_els_flush_all_cmd+0x37/0x60 [lpfc] lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc] lpfc_do_work+0x1485/0x1d70 [lpfc] kthread+0x112/0x130 ret_from_fork+0x1f/0x40 Kernel panic - not syncing: Hard LOCKUP The same CPU tries to claim the phba->port_list_lock twice. Move the cfg_log_verbose checks as part of the lpfc_printf_vlog() and lpfc_printf_log() macros before calling lpfc_dmp_dbg(). There is no need to take the phba->port_list_lock within lpfc_dmp_dbg().
In the Linux kernel, the following vulnerability has been resolved: drivers: staging: rtl8723bs: Fix deadlock in rtw_surveydone_event_callback() There is a deadlock in rtw_surveydone_event_callback(), which is shown below: (Thread 1) | (Thread 2) | _set_timer() rtw_surveydone_event_callback()| mod_timer() spin_lock_bh() //(1) | (wait a time) ... | rtw_scan_timeout_handler() del_timer_sync() | spin_lock_bh() //(2) (wait timer to stop) | ... We hold pmlmepriv->lock in position (1) of thread 1 and use del_timer_sync() to wait timer to stop, but timer handler also need pmlmepriv->lock in position (2) of thread 2. As a result, rtw_surveydone_event_callback() will block forever. This patch extracts del_timer_sync() from the protection of spin_lock_bh(), which could let timer handler to obtain the needed lock. What`s more, we change spin_lock_bh() in rtw_scan_timeout_handler() to spin_lock_irq(). Otherwise, spin_lock_bh() will also cause deadlock() in timer handler.
In the Linux kernel, the following vulnerability has been resolved: drivers: tty: serial: Fix deadlock in sa1100_set_termios() There is a deadlock in sa1100_set_termios(), which is shown below: (Thread 1) | (Thread 2) | sa1100_enable_ms() sa1100_set_termios() | mod_timer() spin_lock_irqsave() //(1) | (wait a time) ... | sa1100_timeout() del_timer_sync() | spin_lock_irqsave() //(2) (wait timer to stop) | ... We hold sport->port.lock in position (1) of thread 1 and use del_timer_sync() to wait timer to stop, but timer handler also need sport->port.lock in position (2) of thread 2. As a result, sa1100_set_termios() will block forever. This patch moves del_timer_sync() before spin_lock_irqsave() in order to prevent the deadlock.
In the Linux kernel, the following vulnerability has been resolved: loop: implement ->free_disk Ensure that the lo_device which is stored in the gendisk private data is valid until the gendisk is freed. Currently the loop driver uses a lot of effort to make sure a device is not freed when it is still in use, but to to fix a potential deadlock this will be relaxed a bit soon.
In the Linux kernel, the following vulnerability has been resolved: drivers: staging: rtl8192e: Fix deadlock in rtllib_beacons_stop() There is a deadlock in rtllib_beacons_stop(), which is shown below: (Thread 1) | (Thread 2) | rtllib_send_beacon() rtllib_beacons_stop() | mod_timer() spin_lock_irqsave() //(1) | (wait a time) ... | rtllib_send_beacon_cb() del_timer_sync() | spin_lock_irqsave() //(2) (wait timer to stop) | ... We hold ieee->beacon_lock in position (1) of thread 1 and use del_timer_sync() to wait timer to stop, but timer handler also need ieee->beacon_lock in position (2) of thread 2. As a result, rtllib_beacons_stop() will block forever. This patch extracts del_timer_sync() from the protection of spin_lock_irqsave(), which could let timer handler to obtain the needed lock.
In the Linux kernel, the following vulnerability has been resolved: cgroup/cpuset: remove kernfs active break A warning was found: WARNING: CPU: 10 PID: 3486953 at fs/kernfs/file.c:828 CPU: 10 PID: 3486953 Comm: rmdir Kdump: loaded Tainted: G RIP: 0010:kernfs_should_drain_open_files+0x1a1/0x1b0 RSP: 0018:ffff8881107ef9e0 EFLAGS: 00010202 RAX: 0000000080000002 RBX: ffff888154738c00 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: 0000000000000004 RDI: ffff888154738c04 RBP: ffff888154738c04 R08: ffffffffaf27fa15 R09: ffffed102a8e7180 R10: ffff888154738c07 R11: 0000000000000000 R12: ffff888154738c08 R13: ffff888750f8c000 R14: ffff888750f8c0e8 R15: ffff888154738ca0 FS: 00007f84cd0be740(0000) GS:ffff8887ddc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f9fbe00c8 CR3: 0000000153eec001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: kernfs_drain+0x15e/0x2f0 __kernfs_remove+0x165/0x300 kernfs_remove_by_name_ns+0x7b/0xc0 cgroup_rm_file+0x154/0x1c0 cgroup_addrm_files+0x1c2/0x1f0 css_clear_dir+0x77/0x110 kill_css+0x4c/0x1b0 cgroup_destroy_locked+0x194/0x380 cgroup_rmdir+0x2a/0x140 It can be explained by: rmdir echo 1 > cpuset.cpus kernfs_fop_write_iter // active=0 cgroup_rm_file kernfs_remove_by_name_ns kernfs_get_active // active=1 __kernfs_remove // active=0x80000002 kernfs_drain cpuset_write_resmask wait_event //waiting (active == 0x80000001) kernfs_break_active_protection // active = 0x80000001 // continue kernfs_unbreak_active_protection // active = 0x80000002 ... kernfs_should_drain_open_files // warning occurs kernfs_put_active This warning is caused by 'kernfs_break_active_protection' when it is writing to cpuset.cpus, and the cgroup is removed concurrently. The commit 3a5a6d0c2b03 ("cpuset: don't nest cgroup_mutex inside get_online_cpus()") made cpuset_hotplug_workfn asynchronous, This change involves calling flush_work(), which can create a multiple processes circular locking dependency that involve cgroup_mutex, potentially leading to a deadlock. To avoid deadlock. the commit 76bb5ab8f6e3 ("cpuset: break kernfs active protection in cpuset_write_resmask()") added 'kernfs_break_active_protection' in the cpuset_write_resmask. This could lead to this warning. After the commit 2125c0034c5d ("cgroup/cpuset: Make cpuset hotplug processing synchronous"), the cpuset_write_resmask no longer needs to wait the hotplug to finish, which means that concurrent hotplug and cpuset operations are no longer possible. Therefore, the deadlock doesn't exist anymore and it does not have to 'break active protection' now. To fix this warning, just remove kernfs_break_active_protection operation in the 'cpuset_write_resmask'.