Logo
-

Byte Open Security

(ByteOS Network)

Log In

Sign Up

ByteOS

Security
Vulnerability Details
Registries
Custom Views
Weaknesses
Attack Patterns
Filters & Tools
Vulnerability Details :

CVE-2026-46223

Summary
Assigner-Linux
Assigner Org ID-416baaa9-dc9f-4396-8d5f-8c081fb06d67
Published At-28 May, 2026 | 09:40
Updated At-28 May, 2026 | 09:40
Rejected At-
Credits

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

In the Linux kernel, the following vulnerability has been resolved: cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.

Vendors
-
Not available
Products
-
Metrics (CVSS)
VersionBase scoreBase severityVector
Weaknesses
Attack Patterns
Solution/Workaround
References
HyperlinkResource Type
EPSS History
Score
Latest Score
-
N/A
No data available for selected date range
Percentile
Latest Percentile
-
N/A
No data available for selected date range
Stakeholder-Specific Vulnerability Categorization (SSVC)
▼Common Vulnerabilities and Exposures (CVE)
cve.org
Assigner:Linux
Assigner Org ID:416baaa9-dc9f-4396-8d5f-8c081fb06d67
Published At:28 May, 2026 | 09:40
Updated At:28 May, 2026 | 09:40
Rejected At:
▼CVE Numbering Authority (CNA)
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

In the Linux kernel, the following vulnerability has been resolved: cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.

Affected Products
Vendor
Linux Kernel Organization, IncLinux
Product
Linux
Repo
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Program Files
  • include/linux/cgroup-defs.h
  • kernel/cgroup/cgroup.c
Default Status
unaffected
Versions
Affected
  • From 1b164b876c36c3eb5561dd9b37702b04401b0166 before 33fa2e6b1507a0a377a151a8826438bedad1d0b0 (git)
  • From 1b164b876c36c3eb5561dd9b37702b04401b0166 before 93618edf753838a727dbff63c7c291dee22d656b (git)
  • 78c72bce4a87819126211c0d24e18350010604fb (git)
  • From 6.19.12 before 6.20 (semver)
Vendor
Linux Kernel Organization, IncLinux
Product
Linux
Repo
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Program Files
  • include/linux/cgroup-defs.h
  • kernel/cgroup/cgroup.c
Default Status
affected
Versions
Affected
  • 7.0
Unaffected
  • From 0 before 7.0 (semver)
  • From 7.0.9 through 7.0.* (semver)
  • From 7.1-rc3 through * (original_commit_for_fix)
Metrics
VersionBase scoreBase severityVector
Metrics Other Info
Impacts
CAPEC IDDescription
Solutions

Configurations

Workarounds

Exploits

Credits

Timeline
EventDate
Replaced By

Rejected Reason

References
HyperlinkResource
https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0
N/A
https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b
N/A
Hyperlink: https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0
Resource: N/A
Hyperlink: https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b
Resource: N/A
Information is not available yet
▼National Vulnerability Database (NVD)
nvd.nist.gov
Source:416baaa9-dc9f-4396-8d5f-8c081fb06d67
Published At:28 May, 2026 | 10:16
Updated At:28 May, 2026 | 10:16

In the Linux kernel, the following vulnerability has been resolved: cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.

CISA Catalog
Date AddedDue DateVulnerability NameRequired Action
N/A
Date Added: N/A
Due Date: N/A
Vulnerability Name: N/A
Required Action: N/A
Metrics
TypeVersionBase scoreBase severityVector
CPE Matches

Evaluator Description

Evaluator Impact

Evaluator Solution

Vendor Statements

References
HyperlinkSourceResource
https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0416baaa9-dc9f-4396-8d5f-8c081fb06d67
N/A
https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b416baaa9-dc9f-4396-8d5f-8c081fb06d67
N/A
Hyperlink: https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0
Source: 416baaa9-dc9f-4396-8d5f-8c081fb06d67
Resource: N/A
Hyperlink: https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b
Source: 416baaa9-dc9f-4396-8d5f-8c081fb06d67
Resource: N/A

Change History

0
Information is not available yet

Similar CVEs

0Records found

Details not found