In the Linux kernel, the following vulnerability has been resolved: af_unix: Fix data-races around user->unix_inflight. user->unix_inflight is changed under spin_lock(unix_gc_lock), but too_many_unix_fds() reads it locklessly. Let's annotate the write/read accesses to user->unix_inflight. BUG: KCSAN: data-race in unix_attach_fds / unix_inflight write to 0xffffffff8546f2d0 of 8 bytes by task 44798 on cpu 1: unix_inflight+0x157/0x180 net/unix/scm.c:66 unix_attach_fds+0x147/0x1e0 net/unix/scm.c:123 unix_scm_to_skb net/unix/af_unix.c:1827 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950 unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline] unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x148/0x160 net/socket.c:748 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494 ___sys_sendmsg+0xc6/0x140 net/socket.c:2548 __sys_sendmsg+0x94/0x140 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 read to 0xffffffff8546f2d0 of 8 bytes by task 44814 on cpu 0: too_many_unix_fds net/unix/scm.c:101 [inline] unix_attach_fds+0x54/0x1e0 net/unix/scm.c:110 unix_scm_to_skb net/unix/af_unix.c:1827 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1950 unix_seqpacket_sendmsg net/unix/af_unix.c:2308 [inline] unix_seqpacket_sendmsg+0xba/0x130 net/unix/af_unix.c:2292 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg+0x148/0x160 net/socket.c:748 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2494 ___sys_sendmsg+0xc6/0x140 net/socket.c:2548 __sys_sendmsg+0x94/0x140 net/socket.c:2577 __do_sys_sendmsg net/socket.c:2586 [inline] __se_sys_sendmsg net/socket.c:2584 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2584 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 value changed: 0x000000000000000c -> 0x000000000000000d Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 44814 Comm: systemd-coredum Not tainted 6.4.0-11989-g6843306689af #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
A pivot_root race condition in fs/namespace.c in the Linux kernel 4.4.x before 4.4.221, 4.9.x before 4.9.221, 4.14.x before 4.14.178, 4.19.x before 4.19.119, and 5.x before 5.3 allows local users to cause a denial of service (panic) by corrupting a mountpoint reference counter.
In the Linux kernel, the following vulnerability has been resolved: tracing: Fix race issue between cpu buffer write and swap Warning happened in rb_end_commit() at code: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142 rb_commit+0x402/0x4a0 Call Trace: ring_buffer_unlock_commit+0x42/0x250 trace_buffer_unlock_commit_regs+0x3b/0x250 trace_event_buffer_commit+0xe5/0x440 trace_event_buffer_reserve+0x11c/0x150 trace_event_raw_event_sched_switch+0x23c/0x2c0 __traceiter_sched_switch+0x59/0x80 __schedule+0x72b/0x1580 schedule+0x92/0x120 worker_thread+0xa0/0x6f0 It is because the race between writing event into cpu buffer and swapping cpu buffer through file per_cpu/cpu0/snapshot: Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1 -------- -------- tracing_snapshot_write() [...] ring_buffer_lock_reserve() cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a'; [...] rb_reserve_next_event() [...] ring_buffer_swap_cpu() if (local_read(&cpu_buffer_a->committing)) goto out_dec; if (local_read(&cpu_buffer_b->committing)) goto out_dec; buffer_a->buffers[cpu] = cpu_buffer_b; buffer_b->buffers[cpu] = cpu_buffer_a; // 2. cpu_buffer has swapped here. rb_start_commit(cpu_buffer); if (unlikely(READ_ONCE(cpu_buffer->buffer) != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer' [...] // has not changed here. return NULL; } cpu_buffer_b->buffer = buffer_a; cpu_buffer_a->buffer = buffer_b; [...] // 4. Reserve event from 'cpu_buffer_a'. ring_buffer_unlock_commit() [...] cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!! rb_commit(cpu_buffer) rb_end_commit() // 6. WARN for the wrong 'committing' state !!! Based on above analysis, we can easily reproduce by following testcase: ``` bash #!/bin/bash dmesg -n 7 sysctl -w kernel.panic_on_warn=1 TR=/sys/kernel/tracing echo 7 > ${TR}/buffer_size_kb echo "sched:sched_switch" > ${TR}/set_event while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & ``` To fix it, IIUC, we can use smp_call_function_single() to do the swap on the target cpu where the buffer is located, so that above race would be avoided.
In the Linux kernel, the following vulnerability has been resolved: Bluetooth: Fix hci_suspend_sync crash If hci_unregister_dev() frees the hci_dev object but hci_suspend_notifier may still be accessing it, it can cause the program to crash. Here's the call trace: <4>[102152.653246] Call Trace: <4>[102152.653254] hci_suspend_sync+0x109/0x301 [bluetooth] <4>[102152.653259] hci_suspend_dev+0x78/0xcd [bluetooth] <4>[102152.653263] hci_suspend_notifier+0x42/0x7a [bluetooth] <4>[102152.653268] notifier_call_chain+0x43/0x6b <4>[102152.653271] __blocking_notifier_call_chain+0x48/0x69 <4>[102152.653273] __pm_notifier_call_chain+0x22/0x39 <4>[102152.653276] pm_suspend+0x287/0x57c <4>[102152.653278] state_store+0xae/0xe5 <4>[102152.653281] kernfs_fop_write+0x109/0x173 <4>[102152.653284] __vfs_write+0x16f/0x1a2 <4>[102152.653287] ? selinux_file_permission+0xca/0x16f <4>[102152.653289] ? security_file_permission+0x36/0x109 <4>[102152.653291] vfs_write+0x114/0x21d <4>[102152.653293] __x64_sys_write+0x7b/0xdb <4>[102152.653296] do_syscall_64+0x59/0x194 <4>[102152.653299] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 This patch holds the reference count of the hci_dev object while processing it in hci_suspend_notifier to avoid potential crash caused by the race condition.
In the Linux kernel, the following vulnerability has been resolved: mptcp: fix disconnect vs accept race Despite commit 0ad529d9fd2b ("mptcp: fix possible divide by zero in recvmsg()"), the mptcp protocol is still prone to a race between disconnect() (or shutdown) and accept. The root cause is that the mentioned commit checks the msk-level flag, but mptcp_stream_accept() does acquire the msk-level lock, as it can rely directly on the first subflow lock. As reported by Christoph than can lead to a race where an msk socket is accepted after that mptcp_subflow_queue_clean() releases the listener socket lock and just before it takes destructive actions leading to the following splat: BUG: kernel NULL pointer dereference, address: 0000000000000012 PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330 Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89 RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300 RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020 R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880 FS: 00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> do_accept+0x1ae/0x260 net/socket.c:1872 __sys_accept4+0x9b/0x110 net/socket.c:1913 __do_sys_accept4 net/socket.c:1954 [inline] __se_sys_accept4 net/socket.c:1951 [inline] __x64_sys_accept4+0x20/0x30 net/socket.c:1951 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Address the issue by temporary removing the pending request socket from the accept queue, so that racing accept() can't touch them. After depleting the msk - the ssk still exists, as plain TCP sockets, re-insert them into the accept queue, so that later inet_csk_listen_stop() will complete the tcp socket disposal.
In the Linux kernel, the following vulnerability has been resolved: l2tp: close all race conditions in l2tp_tunnel_register() The code in l2tp_tunnel_register() is racy in several ways: 1. It modifies the tunnel socket _after_ publishing it. 2. It calls setup_udp_tunnel_sock() on an existing socket without locking. 3. It changes sock lock class on fly, which triggers many syzbot reports. This patch amends all of them by moving socket initialization code before publishing and under sock lock. As suggested by Jakub, the l2tp lockdep class is not necessary as we can just switch to bh_lock_sock_nested().
In the Linux kernel, the following vulnerability has been resolved: rxrpc: Fix potential data race in rxrpc_wait_to_be_connected() Inside the loop in rxrpc_wait_to_be_connected() it checks call->error to see if it should exit the loop without first checking the call state. This is probably safe as if call->error is set, the call is dead anyway, but we should probably wait for the call state to have been set to completion first, lest it cause surprise on the way out. Fix this by only accessing call->error if the call is complete. We don't actually need to access the error inside the loop as we'll do that after. This caused the following report: BUG: KCSAN: data-race in rxrpc_send_data / rxrpc_set_call_completion write to 0xffff888159cf3c50 of 4 bytes by task 25673 on cpu 1: rxrpc_set_call_completion+0x71/0x1c0 net/rxrpc/call_state.c:22 rxrpc_send_data_packet+0xba9/0x1650 net/rxrpc/output.c:479 rxrpc_transmit_one+0x1e/0x130 net/rxrpc/output.c:714 rxrpc_decant_prepared_tx net/rxrpc/call_event.c:326 [inline] rxrpc_transmit_some_data+0x496/0x600 net/rxrpc/call_event.c:350 rxrpc_input_call_event+0x564/0x1220 net/rxrpc/call_event.c:464 rxrpc_io_thread+0x307/0x1d80 net/rxrpc/io_thread.c:461 kthread+0x1ac/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 read to 0xffff888159cf3c50 of 4 bytes by task 25672 on cpu 0: rxrpc_send_data+0x29e/0x1950 net/rxrpc/sendmsg.c:296 rxrpc_do_sendmsg+0xb7a/0xc20 net/rxrpc/sendmsg.c:726 rxrpc_sendmsg+0x413/0x520 net/rxrpc/af_rxrpc.c:565 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2501 ___sys_sendmsg net/socket.c:2555 [inline] __sys_sendmmsg+0x263/0x500 net/socket.c:2641 __do_sys_sendmmsg net/socket.c:2670 [inline] __se_sys_sendmmsg net/socket.c:2667 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2667 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00000000 -> 0xffffffea
In the Linux kernel, the following vulnerability has been resolved: ocfs2: fix races between hole punching and AIO+DIO After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block", fstests/generic/300 become from always failed to sometimes failed: ======================================================================== [ 473.293420 ] run fstests generic/300 [ 475.296983 ] JBD2: Ignoring recovery information on journal [ 475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0) with ordered data mode. [ 494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag: Owner 5668 has an extent at cpos 78723 which can no longer be found [ 494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted. [ 494.292018 ] OCFS2: File system is now read-only. [ 494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272 ERROR: status = -30 [ 494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374 ERROR: status = -3 fio: io_u error on file /mnt/scratch/racer: Read-only file system: write offset=460849152, buflen=131072 ========================================================================= In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add unwritten extents to a list. extents are also inserted into extent tree in ocfs2_write_begin_nolock. Then another thread call fallocate to puch a hole at one of the unwritten extent. The extent at cpos was removed by ocfs2_remove_extent(). At end io worker thread, ocfs2_search_extent_list found there is no such extent at the cpos. T1 T2 T3 inode lock ... insert extents ... inode unlock ocfs2_fallocate __ocfs2_change_file_space inode lock lock ip_alloc_sem ocfs2_remove_inode_range inode ocfs2_remove_btree_range ocfs2_remove_extent ^---remove the extent at cpos 78723 ... unlock ip_alloc_sem inode unlock ocfs2_dio_end_io ocfs2_dio_end_io_write lock ip_alloc_sem ocfs2_mark_extent_written ocfs2_change_extent_flag ocfs2_search_extent_list ^---failed to find extent ... unlock ip_alloc_sem In most filesystems, fallocate is not compatible with racing with AIO+DIO, so fix it by adding to wait for all dio before fallocate/punch_hole like ext4.
In the Linux kernel, the following vulnerability has been resolved: btrfs: use latest_dev in btrfs_show_devname The test case btrfs/238 reports the warning below: WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs] CPU: 2 PID: 1 Comm: systemd Tainted: G W O 5.14.0-rc1-custom #72 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_show_devname+0x108/0x1b4 [btrfs] show_mountinfo+0x234/0x2c4 m_show+0x28/0x34 seq_read_iter+0x12c/0x3c4 vfs_read+0x29c/0x2c8 ksys_read+0x80/0xec __arm64_sys_read+0x28/0x34 invoke_syscall+0x50/0xf8 do_el0_svc+0x88/0x138 el0_svc+0x2c/0x8c el0t_64_sync_handler+0x84/0xe4 el0t_64_sync+0x198/0x19c Reason: While btrfs_prepare_sprout() moves the fs_devices::devices into fs_devices::seed_list, the btrfs_show_devname() searches for the devices and found none, leading to the warning as in above. Fix: latest_dev is updated according to the changes to the device list. That means we could use the latest_dev->name to show the device name in /proc/self/mounts, the pointer will be always valid as it's assigned before the device is deleted from the list in remove or replace. The RCU protection is sufficient as the device structure is freed after synchronization.
In the Linux kernel, the following vulnerability has been resolved: firmware: arm_scmi: Check mailbox/SMT channel for consistency On reception of a completion interrupt the shared memory area is accessed to retrieve the message header at first and then, if the message sequence number identifies a transaction which is still pending, the related payload is fetched too. When an SCMI command times out the channel ownership remains with the platform until eventually a late reply is received and, as a consequence, any further transmission attempt remains pending, waiting for the channel to be relinquished by the platform. Once that late reply is received the channel ownership is given back to the agent and any pending request is then allowed to proceed and overwrite the SMT area of the just delivered late reply; then the wait for the reply to the new request starts. It has been observed that the spurious IRQ related to the late reply can be wrongly associated with the freshly enqueued request: when that happens the SCMI stack in-flight lookup procedure is fooled by the fact that the message header now present in the SMT area is related to the new pending transaction, even though the real reply has still to arrive. This race-condition on the A2P channel can be detected by looking at the channel status bits: a genuine reply from the platform will have set the channel free bit before triggering the completion IRQ. Add a consistency check to validate such condition in the A2P ISR.
In the Linux kernel, the following vulnerability has been resolved: wifi: rtw89: fix potential race condition between napi_init and napi_enable A race condition can happen if netdev is registered, but NAPI isn't initialized yet, and meanwhile user space starts the netdev that will enable NAPI. Then, it hits BUG_ON(): kernel BUG at net/core/dev.c:6423! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 417 Comm: iwd Not tainted 6.2.7-slab-dirty #3 eb0f5a8a9d91 Hardware name: LENOVO 21DL/LNVNB161216, BIOS JPCN20WW(V1.06) 09/20/2022 RIP: 0010:napi_enable+0x3f/0x50 Code: 48 89 c2 48 83 e2 f6 f6 81 89 08 00 00 02 74 0d 48 83 ... RSP: 0018:ffffada1414f3548 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffa01425802080 RCX: 0000000000000000 RDX: 00000000000002ff RSI: ffffada14e50c614 RDI: ffffa01425808dc0 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000100 R12: ffffa01425808f58 R13: 0000000000000000 R14: ffffa01423498940 R15: 0000000000000001 FS: 00007f5577c0a740(0000) GS:ffffa0169fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5577a19972 CR3: 0000000125a7a000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> rtw89_pci_ops_start+0x1c/0x70 [rtw89_pci 6cbc75429515c181cbc386478d5cfb32ffc5a0f8] rtw89_core_start+0xbe/0x160 [rtw89_core fe07ecb874820b6d778370d4acb6ef8a37847f22] rtw89_ops_start+0x26/0x40 [rtw89_core fe07ecb874820b6d778370d4acb6ef8a37847f22] drv_start+0x42/0x100 [mac80211 c07fa22af8c3cf3f7d7ab3884ca990784d72e2d2] ieee80211_do_open+0x311/0x7d0 [mac80211 c07fa22af8c3cf3f7d7ab3884ca990784d72e2d2] ieee80211_open+0x6a/0x90 [mac80211 c07fa22af8c3cf3f7d7ab3884ca990784d72e2d2] __dev_open+0xe0/0x180 __dev_change_flags+0x1da/0x250 dev_change_flags+0x26/0x70 do_setlink+0x37c/0x12c0 ? ep_poll_callback+0x246/0x290 ? __nla_validate_parse+0x61/0xd00 ? __wake_up_common_lock+0x8f/0xd0 To fix this, follow Jonas' suggestion to switch the order of these functions and move register netdev to be the last step of PCI probe. Also, correct the error handling of rtw89_core_register_hw().
In the Linux kernel, the following vulnerability has been resolved: btrfs: protect folio::private when attaching extent buffer folios [BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer. Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize). This can lead to the following race: Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X). Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put() Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping. The condition is not that easy to hit: - The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race. - folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private(). That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up. At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer. [FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection. Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock. To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()): #!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done } run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done } mkfs.btrfs -f -d single -m single ---truncated---
In the Linux kernel, the following vulnerability has been resolved: af_unix: Fix data races in unix_release_sock/unix_stream_sendmsg A data-race condition has been identified in af_unix. In one data path, the write function unix_release_sock() atomically writes to sk->sk_shutdown using WRITE_ONCE. However, on the reader side, unix_stream_sendmsg() does not read it atomically. Consequently, this issue is causing the following KCSAN splat to occur: BUG: KCSAN: data-race in unix_release_sock / unix_stream_sendmsg write (marked) to 0xffff88867256ddbb of 1 bytes by task 7270 on cpu 28: unix_release_sock (net/unix/af_unix.c:640) unix_release (net/unix/af_unix.c:1050) sock_close (net/socket.c:659 net/socket.c:1421) __fput (fs/file_table.c:422) __fput_sync (fs/file_table.c:508) __se_sys_close (fs/open.c:1559 fs/open.c:1541) __x64_sys_close (fs/open.c:1541) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffff88867256ddbb of 1 bytes by task 989 on cpu 14: unix_stream_sendmsg (net/unix/af_unix.c:2273) __sock_sendmsg (net/socket.c:730 net/socket.c:745) ____sys_sendmsg (net/socket.c:2584) __sys_sendmmsg (net/socket.c:2638 net/socket.c:2724) __x64_sys_sendmmsg (net/socket.c:2753 net/socket.c:2750 net/socket.c:2750) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x01 -> 0x03 The line numbers are related to commit dd5a440a31fa ("Linux 6.9-rc7"). Commit e1d09c2c2f57 ("af_unix: Fix data races around sk->sk_shutdown.") addressed a comparable issue in the past regarding sk->sk_shutdown. However, it overlooked resolving this particular data path. This patch only offending unix_stream_sendmsg() function, since the other reads seem to be protected by unix_state_lock() as discussed in
In the Linux kernel, the following vulnerability has been resolved: cgroup: fix race between task migration and iteration When a task is migrated out of a css_set, cgroup_migrate_add_task() first moves it from cset->tasks to cset->mg_tasks via: list_move_tail(&task->cg_list, &cset->mg_tasks); If a css_task_iter currently has it->task_pos pointing to this task, css_set_move_task() calls css_task_iter_skip() to keep the iterator valid. However, since the task has already been moved to ->mg_tasks, the iterator is advanced relative to the mg_tasks list instead of the original tasks list. As a result, remaining tasks on cset->tasks, as well as tasks queued on cset->mg_tasks, can be skipped by iteration. Fix this by calling css_set_skip_task_iters() before unlinking task->cg_list from cset->tasks. This advances all active iterators to the next task on cset->tasks, so iteration continues correctly even when a task is concurrently being migrated. This race is hard to hit in practice without instrumentation, but it can be reproduced by artificially slowing down cgroup_procs_show(). For example, on an Android device a temporary /sys/kernel/cgroup/cgroup_test knob can be added to inject a delay into cgroup_procs_show(), and then: 1) Spawn three long-running tasks (PIDs 101, 102, 103). 2) Create a test cgroup and move the tasks into it. 3) Enable a large delay via /sys/kernel/cgroup/cgroup_test. 4) In one shell, read cgroup.procs from the test cgroup. 5) Within the delay window, in another shell migrate PID 102 by writing it to a different cgroup.procs file. Under this setup, cgroup.procs can intermittently show only PID 101 while skipping PID 103. Once the migration completes, reading the file again shows all tasks as expected. Note that this change does not allow removing the existing css_set_skip_task_iters() call in css_set_move_task(). The new call in cgroup_migrate_add_task() only handles iterators that are racing with migration while the task is still on cset->tasks. Iterators may also start after the task has been moved to cset->mg_tasks. If we dropped css_set_skip_task_iters() from css_set_move_task(), such iterators could keep task_pos pointing to a migrating task, causing css_task_iter_advance() to malfunction on the destination css_set, up to and including crashes or infinite loops. The race window between migration and iteration is very small, and css_task_iter is not on a hot path. In the worst case, when an iterator is positioned on the first thread of the migrating process, cgroup_migrate_add_task() may have to skip multiple tasks via css_set_skip_task_iters(). However, this only happens when migration and iteration actually race, so the performance impact is negligible compared to the correctness fix provided here.
In the Linux kernel, the following vulnerability has been resolved: media: rkisp1: Fix IRQ disable race issue In rkisp1_isp_stop() and rkisp1_csi_disable() the driver masks the interrupts and then apparently assumes that the interrupt handler won't be running, and proceeds in the stop procedure. This is not the case, as the interrupt handler can already be running, which would lead to the ISP being disabled while the interrupt handler handling a captured frame. This brings up two issues: 1) the ISP could be powered off while the interrupt handler is still running and accessing registers, leading to board lockup, and 2) the interrupt handler code and the code that disables the streaming might do things that conflict. It is not clear to me if 2) causes a real issue, but 1) can be seen with a suitable delay (or printk in my case) in the interrupt handler, leading to board lockup.
In the Linux kernel, the following vulnerability has been resolved: nfsd: Fix nsfd startup race (again) Commit bd5ae9288d64 ("nfsd: register pernet ops last, unregister first") has re-opened rpc_pipefs_event() race against nfsd_net_id registration (register_pernet_subsys()) which has been fixed by commit bb7ffbf29e76 ("nfsd: fix nsfd startup race triggering BUG_ON"). Restore the order of register_pernet_subsys() vs register_cld_notifier(). Add WARN_ON() to prevent a future regression. Crash info: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000012 CPU: 8 PID: 345 Comm: mount Not tainted 5.4.144-... #1 pc : rpc_pipefs_event+0x54/0x120 [nfsd] lr : rpc_pipefs_event+0x48/0x120 [nfsd] Call trace: rpc_pipefs_event+0x54/0x120 [nfsd] blocking_notifier_call_chain rpc_fill_super get_tree_keyed rpc_fs_get_tree vfs_get_tree do_mount ksys_mount __arm64_sys_mount el0_svc_handler el0_svc
In the Linux kernel, the following vulnerability has been resolved: rxrpc: Fix recv-recv race of completed call If a call receives an event (such as incoming data), the call gets placed on the socket's queue and a thread in recvmsg can be awakened to go and process it. Once the thread has picked up the call off of the queue, further events will cause it to be requeued, and once the socket lock is dropped (recvmsg uses call->user_mutex to allow the socket to be used in parallel), a second thread can come in and its recvmsg can pop the call off the socket queue again. In such a case, the first thread will be receiving stuff from the call and the second thread will be blocked on call->user_mutex. The first thread can, at this point, process both the event that it picked call for and the event that the second thread picked the call for and may see the call terminate - in which case the call will be "released", decoupling the call from the user call ID assigned to it (RXRPC_USER_CALL_ID in the control message). The first thread will return okay, but then the second thread will wake up holding the user_mutex and, if it sees that the call has been released by the first thread, it will BUG thusly: kernel BUG at net/rxrpc/recvmsg.c:474! Fix this by just dequeuing the call and ignoring it if it is seen to be already released. We can't tell userspace about it anyway as the user call ID has become stale.
In the Linux kernel, the following vulnerability has been resolved: io-wq: check for wq exit after adding new worker task_work We check IO_WQ_BIT_EXIT before attempting to create a new worker, and wq exit cancels pending work if we have any. But it's possible to have a race between the two, where creation checks exit finding it not set, but we're in the process of exiting. The exit side will cancel pending creation task_work, but there's a gap where we add task_work after we've canceled existing creations at exit time. Fix this by checking the EXIT bit post adding the creation task_work. If it's set, run the same cancelation that exit does.
In the Linux kernel, the following vulnerability has been resolved: userfaultfd: fix a race between writeprotect and exit_mmap() A race is possible when a process exits, its VMAs are removed by exit_mmap() and at the same time userfaultfd_writeprotect() is called. The race was detected by KASAN on a development kernel, but it appears to be possible on vanilla kernels as well. Use mmget_not_zero() to prevent the race as done in other userfaultfd operations.
In the Linux kernel, the following vulnerability has been resolved: net/smc: fix kernel panic caused by race of smc_sock A crash occurs when smc_cdc_tx_handler() tries to access smc_sock but smc_release() has already freed it. [ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88 [ 4570.696048] #PF: supervisor write access in kernel mode [ 4570.696728] #PF: error_code(0x0002) - not-present page [ 4570.697401] PGD 0 P4D 0 [ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111 [ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0 [ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30 <...> [ 4570.711446] Call Trace: [ 4570.711746] <IRQ> [ 4570.711992] smc_cdc_tx_handler+0x41/0xc0 [ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560 [ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10 [ 4570.713489] tasklet_action_common.isra.17+0x66/0x140 [ 4570.714083] __do_softirq+0x123/0x2f4 [ 4570.714521] irq_exit_rcu+0xc4/0xf0 [ 4570.714934] common_interrupt+0xba/0xe0 Though smc_cdc_tx_handler() checked the existence of smc connection, smc_release() may have already dismissed and released the smc socket before smc_cdc_tx_handler() further visits it. smc_cdc_tx_handler() |smc_release() if (!conn) | | |smc_cdc_tx_dismiss_slots() | smc_cdc_tx_dismisser() | |sock_put(&smc->sk) <- last sock_put, | smc_sock freed bh_lock_sock(&smc->sk) (panic) | To make sure we won't receive any CDC messages after we free the smc_sock, add a refcount on the smc_connection for inflight CDC message(posted to the QP but haven't received related CQE), and don't release the smc_connection until all the inflight CDC messages haven been done, for both success or failed ones. Using refcount on CDC messages brings another problem: when the link is going to be destroyed, smcr_link_clear() will reset the QP, which then remove all the pending CQEs related to the QP in the CQ. To make sure all the CQEs will always come back so the refcount on the smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced by smc_ib_modify_qp_error(). And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we need to wait for all pending WQEs done, or we may encounter use-after- free when handling CQEs. For IB device removal routine, we need to wait for all the QPs on that device been destroyed before we can destroy CQs on the device, or the refcount on smc_connection won't reach 0 and smc_sock cannot be released.
In the Linux kernel, the following vulnerability has been resolved: ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but we may have scheduled task work via io_uring_cmd_complete_in_task() for dispatching request, then kernel crash can be triggered. Fix it by not trying to canceling the command if ublk block request is started.
In the Linux kernel, the following vulnerability has been resolved: s390/qeth: fix deadlock during failing recovery Commit 0b9902c1fcc5 ("s390/qeth: fix deadlock during recovery") removed taking discipline_mutex inside qeth_do_reset(), fixing potential deadlocks. An error path was missed though, that still takes discipline_mutex and thus has the original deadlock potential. Intermittent deadlocks were seen when a qeth channel path is configured offline, causing a race between qeth_do_reset and ccwgroup_remove. Call qeth_set_offline() directly in the qeth_do_reset() error case and then a new variant of ccwgroup_set_offline(), without taking discipline_mutex.
In the Linux kernel, the following vulnerability has been resolved: udp: fix race between close() and udp_abort() Kaustubh reported and diagnosed a panic in udp_lib_lookup(). The root cause is udp_abort() racing with close(). Both racing functions acquire the socket lock, but udp{v6}_destroy_sock() release it before performing destructive actions. We can't easily extend the socket lock scope to avoid the race, instead use the SOCK_DEAD flag to prevent udp_abort from doing any action when the critical race happens. Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org>
In the Linux kernel, the following vulnerability has been resolved: ocfs2: fix race between searching chunks and release journal_head from buffer_head Encountered a race between ocfs2_test_bg_bit_allocatable() and jbd2_journal_put_journal_head() resulting in the below vmcore. PID: 106879 TASK: ffff880244ba9c00 CPU: 2 COMMAND: "loop3" Call trace: panic oops_end no_context __bad_area_nosemaphore bad_area_nosemaphore __do_page_fault do_page_fault page_fault [exception RIP: ocfs2_block_group_find_clear_bits+316] ocfs2_block_group_find_clear_bits [ocfs2] ocfs2_cluster_group_search [ocfs2] ocfs2_search_chain [ocfs2] ocfs2_claim_suballoc_bits [ocfs2] __ocfs2_claim_clusters [ocfs2] ocfs2_claim_clusters [ocfs2] ocfs2_local_alloc_slide_window [ocfs2] ocfs2_reserve_local_alloc_bits [ocfs2] ocfs2_reserve_clusters_with_limit [ocfs2] ocfs2_reserve_clusters [ocfs2] ocfs2_lock_refcount_allocators [ocfs2] ocfs2_make_clusters_writable [ocfs2] ocfs2_replace_cow [ocfs2] ocfs2_refcount_cow [ocfs2] ocfs2_file_write_iter [ocfs2] lo_rw_aio loop_queue_work kthread_worker_fn kthread ret_from_fork When ocfs2_test_bg_bit_allocatable() called bh2jh(bg_bh), the bg_bh->b_private NULL as jbd2_journal_put_journal_head() raced and released the jounal head from the buffer head. Needed to take bit lock for the bit 'BH_JournalHead' to fix this race.
In the Linux kernel, the following vulnerability has been resolved: btrfs: fix race in read_extent_buffer_pages() There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging it turns out there's a race when reading an extent buffer the uptodate status can be missed. To prevent concurrent reads for the same extent buffer, read_extent_buffer_pages() performs these checks: /* (1) */ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0; /* (2) */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done; At this point, it seems safe to start the actual read operation. Once that completes, end_bbio_meta_read() does /* (3) */ set_extent_buffer_uptodate(eb); /* (4) */ clear_bit(EXTENT_BUFFER_READING, &eb->bflags); Normally, this is enough to ensure only one read happens, and all other callers wait for it to finish before returning. Unfortunately, there is a racey interleaving: Thread A | Thread B | Thread C ---------+----------+--------- (1) | | | (1) | (2) | | (3) | | (4) | | | (2) | | | (1) When this happens, thread B kicks of an unnecessary read. Worse, thread C will see UPTODATE set and return immediately, while the read from thread B is still in progress. This race could result in tree-checker errors like this as the extent buffer is concurrently modified: BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] Fix it by testing UPTODATE again after setting the READING bit, and if it's been set, skip the unnecessary read. [ minor update of changelog ]
In the Linux kernel, the following vulnerability has been resolved: dmaengine: dw-edma: eDMA: Add sync read before starting the DMA transfer in remote setup The Linked list element and pointer are not stored in the same memory as the eDMA controller register. If the doorbell register is toggled before the full write of the linked list a race condition error will occur. In remote setup we can only use a readl to the memory to assure the full write has occurred.
In the Linux kernel, the following vulnerability has been resolved: quota: Fix potential NULL pointer dereference Below race may cause NULL pointer dereference P1 P2 dquot_free_inode quota_off drop_dquot_ref remove_dquot_ref dquots = i_dquot(inode) dquots = i_dquot(inode) srcu_read_lock dquots[cnt]) != NULL (1) dquots[type] = NULL (2) spin_lock(&dquots[cnt]->dq_dqb_lock) (3) .... If dquot_free_inode(or other routines) checks inode's quota pointers (1) before quota_off sets it to NULL(2) and use it (3) after that, NULL pointer dereference will be triggered. So let's fix it by using a temporary pointer to avoid this issue.
In the Linux kernel, the following vulnerability has been resolved: nvme-pci: fix race condition between reset and nvme_dev_disable() nvme_dev_disable() modifies the dev->online_queues field, therefore nvme_pci_update_nr_queues() should avoid racing against it, otherwise we could end up passing invalid values to blk_mq_update_nr_hw_queues(). WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347 pci_irq_get_affinity+0x187/0x210 Workqueue: nvme-reset-wq nvme_reset_work [nvme] RIP: 0010:pci_irq_get_affinity+0x187/0x210 Call Trace: <TASK> ? blk_mq_pci_map_queues+0x87/0x3c0 ? pci_irq_get_affinity+0x187/0x210 blk_mq_pci_map_queues+0x87/0x3c0 nvme_pci_map_queues+0x189/0x460 [nvme] blk_mq_update_nr_hw_queues+0x2a/0x40 nvme_reset_work+0x1be/0x2a0 [nvme] Fix the bug by locking the shutdown_lock mutex before using dev->online_queues. Give up if nvme_dev_disable() is running or if it has been executed already.
A race condition was found in the Linux kernel's sound/hda device driver in snd_hdac_regmap_sync() function. This can result in a null pointer dereference issue, possibly leading to a kernel panic or denial of service issue.
A flaw was found in the Netfilter subsystem of the Linux kernel. A race condition between IPSET_CMD_ADD and IPSET_CMD_SWAP can lead to a kernel panic due to the invocation of `__ip_set_put` on a wrong `set`. This issue may allow a local user to crash the system.
In the Linux kernel, the following vulnerability has been resolved: netfs: Fix race between cache write completion and ALL_QUEUED being set When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED.
A use-after-free vulnerability was found in network namespaces code affecting the Linux kernel before 4.14.11. The function get_net_ns_by_id() in net/core/net_namespace.c does not check for the net::count value after it has found a peer network in netns_ids idr, which could lead to double free and memory corruption. This vulnerability could allow an unprivileged local user to induce kernel memory corruption on the system, leading to a crash. Due to the nature of the flaw, privilege escalation cannot be fully ruled out, although it is thought to be unlikely.
In the Linux kernel, the following vulnerability has been resolved: fsnotify: clear PARENT_WATCHED flags lazily In some setups directories can have many (usually negative) dentries. Hence __fsnotify_update_child_dentry_flags() function can take a significant amount of time. Since the bulk of this function happens under inode->i_lock this causes a significant contention on the lock when we remove the watch from the directory as the __fsnotify_update_child_dentry_flags() call from fsnotify_recalc_mask() races with __fsnotify_update_child_dentry_flags() calls from __fsnotify_parent() happening on children. This can lead upto softlockup reports reported by users. Fix the problem by calling fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags only when parent starts watching children. When parent stops watching children, clear false positive PARENT_WATCHED flags lazily in __fsnotify_parent() for each accessed child.
In the Linux kernel, the following vulnerability has been resolved: drm/panthor: Fix race condition when gathering fdinfo group samples Commit e16635d88fa0 ("drm/panthor: add DRM fdinfo support") failed to protect access to groups with an xarray lock, which could lead to use-after-free errors.
In the Linux kernel, the following vulnerability has been resolved: mm: userfaultfd: fix race of userfaultfd_move and swap cache This commit fixes two kinds of races, they may have different results: Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same BUG_ON if the filemap lookup returned NULL and folio is added to swap cache after that. If another kind of race is triggered (folio changed after lookup) we may see RSS counter is corrupted: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the folio is being accounted to the wrong VMA. I'm not sure if there will be any data corruption though, seems no. The issues above are critical already. On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... <swapin src_pte, alloc and use folio A> // folio A is a new allocated folio // and get installed into src_pte <frees swap entry S1> // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. <try to swap out another folio B> // folio B is a folio in another VMA. <put folio B to swap cache using S1 > // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... <swapin folio B and free S1> // Now S1 is free to be used again. <swapout src_pte & folio A using S1> // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get ---truncated---
In the Linux kernel, the following vulnerability has been resolved: mm/smaps: fix race between smaps_hugetlb_range and migration smaps_hugetlb_range() handles the pte without holdling ptl, and may be concurrenct with migration, leaing to BUG_ON in pfn_swap_entry_to_page(). The race is as follows. smaps_hugetlb_range migrate_pages huge_ptep_get remove_migration_ptes folio_unlock pfn_swap_entry_folio BUG_ON To fix it, hold ptl lock in smaps_hugetlb_range().
In the Linux kernel, the following vulnerability has been resolved: sched/fair: Fix fault in reweight_entity Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set.
In the Linux kernel, the following vulnerability has been resolved: smb: client: fix race with concurrent opens in rename(2) Besides sending the rename request to the server, the rename process also involves closing any deferred close, waiting for outstanding I/O to complete as well as marking all existing open handles as deleted to prevent them from deferring closes, which increases the race window for potential concurrent opens on the target file. Fix this by unhashing the dentry in advance to prevent any concurrent opens on the target.
In the Linux kernel, the following vulnerability has been resolved: net/sched: sch_qfq: Fix race condition on qfq_aggregate A race condition can occur when 'agg' is modified in qfq_change_agg (called during qfq_enqueue) while other threads access it concurrently. For example, qfq_dump_class may trigger a NULL dereference, and qfq_delete_class may cause a use-after-free. This patch addresses the issue by: 1. Moved qfq_destroy_class into the critical section. 2. Added sch_tree_lock protection to qfq_dump_class and qfq_dump_class_stats.
In the Linux kernel, the following vulnerability has been resolved: wireguard: receive: annotate data-race around receiving_counter.counter Syzkaller with KCSAN identified a data-race issue when accessing keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE() annotations to mark the data race as intentional. BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0: counter_validate drivers/net/wireguard/receive.c:321 [inline] wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461 __napi_poll+0x60/0x3b0 net/core/dev.c:6536 napi_poll net/core/dev.c:6605 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6738 __do_softirq+0xc4/0x279 kernel/softirq.c:553 do_softirq+0x5e/0x90 kernel/softirq.c:454 __local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381 __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline] _raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210 spin_unlock_bh include/linux/spinlock.h:396 [inline] ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline] wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499 process_one_work kernel/workqueue.c:2633 [inline] ... read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1: decrypt_packet drivers/net/wireguard/receive.c:252 [inline] wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501 process_one_work kernel/workqueue.c:2633 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706 worker_thread+0x525/0x730 kernel/workqueue.c:2787 ...
A flaw was found in the subsequent get_user_pages_fast in the Linux kernel’s interface for symmetric key cipher algorithms in the skcipher_recvmsg of crypto/algif_skcipher.c function. This flaw allows a local user to crash the system.
An issue was discovered in drivers/bluetooth/hci_ldisc.c in the Linux kernel 6.2. In hci_uart_tty_ioctl, there is a race condition between HCIUARTSETPROTO and HCIUARTGETPROTO. HCI_UART_PROTO_SET is set before hu->proto is set. A NULL pointer dereference may occur.
In the Linux kernel, the following vulnerability has been resolved: coresight: tmc-etr: Fix race condition between sysfs and perf mode When trying to run perf and sysfs mode simultaneously, the WARN_ON() in tmc_etr_enable_hw() is triggered sometimes: WARNING: CPU: 42 PID: 3911571 at drivers/hwtracing/coresight/coresight-tmc-etr.c:1060 tmc_etr_enable_hw+0xc0/0xd8 [coresight_tmc] [..snip..] Call trace: tmc_etr_enable_hw+0xc0/0xd8 [coresight_tmc] (P) tmc_enable_etr_sink+0x11c/0x250 [coresight_tmc] (L) tmc_enable_etr_sink+0x11c/0x250 [coresight_tmc] coresight_enable_path+0x1c8/0x218 [coresight] coresight_enable_sysfs+0xa4/0x228 [coresight] enable_source_store+0x58/0xa8 [coresight] dev_attr_store+0x20/0x40 sysfs_kf_write+0x4c/0x68 kernfs_fop_write_iter+0x120/0x1b8 vfs_write+0x2c8/0x388 ksys_write+0x74/0x108 __arm64_sys_write+0x24/0x38 el0_svc_common.constprop.0+0x64/0x148 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x130 el0t_64_sync_handler+0xc8/0xd0 el0t_64_sync+0x1ac/0x1b0 ---[ end trace 0000000000000000 ]--- Since the enablement of sysfs mode is separeted into two critical regions, one for sysfs buffer allocation and another for hardware enablement, it's possible to race with the perf mode. Fix this by double check whether the perf mode's been used before enabling the hardware in sysfs mode. mode: [sysfs mode] [perf mode] tmc_etr_get_sysfs_buffer() spin_lock(&drvdata->spinlock) [sysfs buffer allocation] spin_unlock(&drvdata->spinlock) spin_lock(&drvdata->spinlock) tmc_etr_enable_hw() drvdata->etr_buf = etr_perf->etr_buf spin_unlock(&drvdata->spinlock) spin_lock(&drvdata->spinlock) tmc_etr_enable_hw() WARN_ON(drvdata->etr_buf) // WARN sicne etr_buf initialized at the perf side spin_unlock(&drvdata->spinlock) With this fix, we retain the check for CS_MODE_PERF in get_etr_sysfs_buf. This ensures we verify whether the perf mode's already running before we actually allocate the buffer. Then we can save the time of allocating/freeing the sysfs buffer if race with the perf mode.
In the Linux kernel, the following vulnerability has been resolved: m68k: Fix spinlock race in kernel thread creation Context switching does take care to retain the correct lock owner across the switch from 'prev' to 'next' tasks. This does rely on interrupts remaining disabled for the entire duration of the switch. This condition is guaranteed for normal process creation and context switching between already running processes, because both 'prev' and 'next' already have interrupts disabled in their saved copies of the status register. The situation is different for newly created kernel threads. The status register is set to PS_S in copy_thread(), which does leave the IPL at 0. Upon restoring the 'next' thread's status register in switch_to() aka resume(), interrupts then become enabled prematurely. resume() then returns via ret_from_kernel_thread() and schedule_tail() where run queue lock is released (see finish_task_switch() and finish_lock_switch()). A timer interrupt calling scheduler_tick() before the lock is released in finish_task_switch() will find the lock already taken, with the current task as lock owner. This causes a spinlock recursion warning as reported by Guenter Roeck. As far as I can ascertain, this race has been opened in commit 533e6903bea0 ("m68k: split ret_from_fork(), simplify kernel_thread()") but I haven't done a detailed study of kernel history so it may well predate that commit. Interrupts cannot be disabled in the saved status register copy for kernel threads (init will complain about interrupts disabled when finally starting user space). Disable interrupts temporarily when switching the tasks' register sets in resume(). Note that a simple oriw 0x700,%sr after restoring sr is not enough here - this leaves enough of a race for the 'spinlock recursion' warning to still be observed. Tested on ARAnyM and qemu (Quadra 800 emulation).
In the Linux kernel, the following vulnerability has been resolved: ath11k: fix netdev open race Make sure to allocate resources needed before registering the device. This specifically avoids having a racing open() trigger a BUG_ON() in mod_timer() when ath11k_mac_op_start() is called before the mon_reap_timer as been set up. I did not see this issue with next-20220310, but I hit it on every probe with next-20220511. Perhaps some timing changed in between. Here's the backtrace: [ 51.346947] kernel BUG at kernel/time/timer.c:990! [ 51.346958] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ... [ 51.578225] Call trace: [ 51.583293] __mod_timer+0x298/0x390 [ 51.589518] mod_timer+0x14/0x20 [ 51.595368] ath11k_mac_op_start+0x41c/0x4a0 [ath11k] [ 51.603165] drv_start+0x38/0x60 [mac80211] [ 51.610110] ieee80211_do_open+0x29c/0x7d0 [mac80211] [ 51.617945] ieee80211_open+0x60/0xb0 [mac80211] [ 51.625311] __dev_open+0x100/0x1c0 [ 51.631420] __dev_change_flags+0x194/0x210 [ 51.638214] dev_change_flags+0x24/0x70 [ 51.644646] do_setlink+0x228/0xdb0 [ 51.650723] __rtnl_newlink+0x460/0x830 [ 51.657162] rtnl_newlink+0x4c/0x80 [ 51.663229] rtnetlink_rcv_msg+0x124/0x390 [ 51.669917] netlink_rcv_skb+0x58/0x130 [ 51.676314] rtnetlink_rcv+0x18/0x30 [ 51.682460] netlink_unicast+0x250/0x310 [ 51.688960] netlink_sendmsg+0x19c/0x3e0 [ 51.695458] ____sys_sendmsg+0x220/0x290 [ 51.701938] ___sys_sendmsg+0x7c/0xc0 [ 51.708148] __sys_sendmsg+0x68/0xd0 [ 51.714254] __arm64_sys_sendmsg+0x28/0x40 [ 51.720900] invoke_syscall+0x48/0x120 Tested-on: WCN6855 hw2.0 PCI WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3
In the Linux kernel, the following vulnerability has been resolved: nvme-pci: Fix race bug in nvme_poll_irqdisable() In the following scenario, pdev can be disabled between (1) and (3) by (2). This sets pdev->msix_enabled = 0. Then, pci_irq_vector() will return MSI-X IRQ(>15) for (1) whereas return INTx IRQ(<=15) for (2). This causes IRQ warning because it tries to enable INTx IRQ that has never been disabled before. To fix this, save IRQ number into a local variable and ensure disable_irq() and enable_irq() operate on the same IRQ number. Even if pci_free_irq_vectors() frees the IRQ concurrently, disable_irq() and enable_irq() on a stale IRQ number is still valid and safe, and the depth accounting reamins balanced. task 1: nvme_poll_irqdisable() disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(1) enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)) ...(3) task 2: nvme_reset_work() nvme_dev_disable() pdev->msix_enable = 0; ...(2) crash log: ------------[ cut here ]------------ Unbalanced enable for IRQ 10 WARNING: kernel/irq/manage.c:753 at __enable_irq+0x102/0x190 kernel/irq/manage.c:753, CPU#1: kworker/1:0H/26 Modules linked in: CPU: 1 UID: 0 PID: 26 Comm: kworker/1:0H Not tainted 6.19.0-dirty #9 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_timeout_work RIP: 0010:__enable_irq+0x107/0x190 kernel/irq/manage.c:753 Code: ff df 48 89 fa 48 c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 79 48 8d 3d 2e 7a 3f 05 41 8b 74 24 2c <67> 48 0f b9 3a e8 ef b9 21 00 5b 41 5c 5d e9 46 54 66 03 e8 e1 b9 RSP: 0018:ffffc900001bf550 EFLAGS: 00010046 RAX: 0000000000000007 RBX: 0000000000000000 RCX: ffffffffb20c0e90 RDX: 0000000000000000 RSI: 000000000000000a RDI: ffffffffb74b88f0 RBP: ffffc900001bf560 R08: ffff88800197cf00 R09: 0000000000000001 R10: 0000000000000003 R11: 0000000000000003 R12: ffff8880012a6000 R13: 1ffff92000037eae R14: 000000000000000a R15: 0000000000000293 FS: 0000000000000000(0000) GS:ffff8880b49f7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555da4a25fa8 CR3: 00000000208e8000 CR4: 00000000000006f0 Call Trace: <TASK> enable_irq+0x121/0x1e0 kernel/irq/manage.c:797 nvme_poll_irqdisable+0x162/0x1c0 drivers/nvme/host/pci.c:1494 nvme_timeout+0x965/0x14b0 drivers/nvme/host/pci.c:1744 blk_mq_rq_timed_out block/blk-mq.c:1653 [inline] blk_mq_handle_expired+0x227/0x2d0 block/blk-mq.c:1721 bt_iter+0x2fc/0x3a0 block/blk-mq-tag.c:292 __sbitmap_for_each_set include/linux/sbitmap.h:269 [inline] sbitmap_for_each_set include/linux/sbitmap.h:290 [inline] bt_for_each block/blk-mq-tag.c:324 [inline] blk_mq_queue_tag_busy_iter+0x969/0x1e80 block/blk-mq-tag.c:536 blk_mq_timeout_work+0x627/0x870 block/blk-mq.c:1763 process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257 process_scheduled_works kernel/workqueue.c:3340 [inline] worker_thread+0x65c/0xe60 kernel/workqueue.c:3421 kthread+0x41a/0x930 kernel/kthread.c:463 ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> irq event stamp: 74478 hardirqs last enabled at (74477): [<ffffffffb5720a9c>] __raw_spin_unlock_irq include/linux/spinlock_api_smp.h:159 [inline] hardirqs last enabled at (74477): [<ffffffffb5720a9c>] _raw_spin_unlock_irq+0x2c/0x60 kernel/locking/spinlock.c:202 hardirqs last disabled at (74478): [<ffffffffb57207b5>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline] hardirqs last disabled at (74478): [<ffffffffb57207b5>] _raw_spin_lock_irqsave+0x85/0xa0 kernel/locking/spinlock.c:162 softirqs last enabled at (74304): [<ffffffffb1e9466c>] __do_softirq kernel/softirq.c:656 [inline] softirqs last enabled at (74304): [<ffffffffb1e9466c>] invoke_softirq kernel/softirq.c:496 [inline] softirqs last enabled at (74304): [<ffffffffb1e9466c>] __irq_exit_rcu+0xdc/0x120 ---truncated---
In the Linux kernel, the following vulnerability has been resolved: scsi: ufs: core: Flush exception handling work when RPM level is zero Ensure that the exception event handling work is explicitly flushed during suspend when the runtime power management level is set to UFS_PM_LVL_0. When the RPM level is zero, the device power mode and link state both remain active. Previously, the UFS core driver bypassed flushing exception event handling jobs in this configuration. This created a race condition where the driver could attempt to access the host controller to handle an exception after the system had already entered a deep power-down state, resulting in a system crash. Explicitly flush this work and disable auto BKOPs before the suspend callback proceeds. This guarantees that pending exception tasks complete and prevents illegal hardware access during the power-down sequence.
In the Linux kernel, the following vulnerability has been resolved: usb: yurex: fix race in probe The bbu member of the descriptor must be set to the value standing for uninitialized values before the URB whose completion handler sets bbu is submitted. Otherwise there is a window during which probing can overwrite already retrieved data.
In the Linux kernel, the following vulnerability has been resolved: ipv4: Fix uninit-value access in __ip_make_skb() KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb() tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL while __ip_make_skb() is running, the function will access icmphdr in the skb even if it is not included. This causes the issue reported by KMSAN. Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL on the socket. Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These are union in struct flowi4 and are implicitly initialized by flowi4_init_output(), but we should not rely on specific union layout. Initialize these explicitly in raw_sendmsg(). [1] BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 ip_finish_skb include/net/ip.h:243 [inline] ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: slab_post_alloc_hook mm/slub.c:3804 [inline] slab_alloc_node mm/slub.c:3845 [inline] kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668 alloc_skb include/linux/skbuff.h:1318 [inline] __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
In the Linux kernel, the following vulnerability has been resolved: md/bitmap: fix GPF in write_page caused by resize race A General Protection Fault occurs in write_page() during array resize: RIP: 0010:write_page+0x22b/0x3c0 [md_mod] This is a use-after-free race between bitmap_daemon_work() and __bitmap_resize(). The daemon iterates over `bitmap->storage.filemap` without locking, while the resize path frees that storage via md_bitmap_file_unmap(). `quiesce()` does not stop the md thread, allowing concurrent access to freed pages. Fix by holding `mddev->bitmap_info.mutex` during the bitmap update.