5eede802d4
BugLink: https://bugs.launchpad.net/bugs/2060704 On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: >> That said - I think as a proof of concept and "look, with this we get >> the expected scheduling event counts", that patch is perfect. I think >> you more than proved the concept. > > There is certainly quite some analyis work to do to make this a one to > one replacement. > > With a handful of benchmarks the PoC (tweaked with some obvious fixes) > is pretty much on par with the current mainline variants (NONE/FULL), > but the memtier benchmark makes a massive dent. > > It sports a whopping 10% regression with the LAZY mode versus the mainline > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. > > That benchmark is really sensitive to the preemption model. With current > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% > performance drop versus preempt=NONE. That 20% was a tired pilot error. The real number is in the 5% ballpark. > I have no clue what's going on there yet, but that shows that there is > obviously quite some work ahead to get this sorted. It took some head scratching to figure that out. The initial fix broke the handling of the hog issue, i.e. the problem that Ankur tried to solve, but I hacked up a "solution" for that too. With that the memtier benchmark is roughly back to the mainline numbers, but my throughput benchmark know how is pretty close to zero, so that should be looked at by people who actually understand these things. Likewise the hog prevention is just at the PoC level and clearly beyond my knowledge of scheduler details: It unconditionally forces a reschedule when the looping task is not responding to a lazy reschedule request before the next tick. IOW it forces a reschedule on the second tick, which is obviously different from the cond_resched()/might_sleep() behaviour. The changes vs. the original PoC aside of the bug and thinko fixes: 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the lazy preempt bit as the trace_entry::flags field is full already. That obviously breaks the tracer ABI, but if we go there then this needs to be fixed. Steven? 2) debugfs file to validate that loops can be force preempted w/o cond_resched() The usage is: # taskset -c 1 bash # echo 1 > /sys/kernel/debug/sched/hog & # echo 1 > /sys/kernel/debug/sched/hog & # echo 1 > /sys/kernel/debug/sched/hog & top shows ~33% CPU for each of the hogs and tracing confirms that the crude hack in the scheduler tick works: bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr The 'l' instead of the usual 'N' reflects that the lazy resched bit is set. That makes __update_curr() invoke resched_curr() instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED and folds it into preempt_count so that preemption happens at the next possible point, i.e. either in return from interrupt or at the next preempt_enable(). That's as much as I wanted to demonstrate and I'm not going to spend more cycles on it as I have already too many other things on flight and the resulting scheduler woes are clearly outside of my expertice. Though definitely I'm putting a permanent NAK in place for any attempts to duct tape the preempt=NONE model any further by sprinkling more cond*() and whatever warts around. Thanks, tglx [tglx: s@CONFIG_PREEMPT_AUTO@CONFIG_PREEMPT_BUILD_AUTO@ ] Link: https://lore.kernel.org/all/87jzshhexi.ffs@tglx/ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Kevin Becker <kevin.becker@canonical.com>
152 lines
5.1 KiB
Plaintext
152 lines
5.1 KiB
Plaintext
# SPDX-License-Identifier: GPL-2.0-only
|
|
|
|
config PREEMPT_NONE_BUILD
|
|
bool
|
|
|
|
config PREEMPT_VOLUNTARY_BUILD
|
|
bool
|
|
|
|
config PREEMPT_BUILD
|
|
bool
|
|
select PREEMPTION
|
|
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
|
|
|
|
config PREEMPT_BUILD_AUTO
|
|
bool
|
|
select PREEMPT_BUILD
|
|
|
|
config HAVE_PREEMPT_AUTO
|
|
bool
|
|
|
|
choice
|
|
prompt "Preemption Model"
|
|
default PREEMPT_NONE
|
|
|
|
config PREEMPT_NONE
|
|
bool "No Forced Preemption (Server)"
|
|
select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
|
|
help
|
|
This is the traditional Linux preemption model, geared towards
|
|
throughput. It will still provide good latencies most of the
|
|
time, but there are no guarantees and occasional longer delays
|
|
are possible.
|
|
|
|
Select this option if you are building a kernel for a server or
|
|
scientific/computation system, or if you want to maximize the
|
|
raw processing power of the kernel, irrespective of scheduling
|
|
latencies.
|
|
|
|
config PREEMPT_VOLUNTARY
|
|
bool "Voluntary Kernel Preemption (Desktop)"
|
|
depends on !ARCH_NO_PREEMPT
|
|
select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
|
|
help
|
|
This option reduces the latency of the kernel by adding more
|
|
"explicit preemption points" to the kernel code. These new
|
|
preemption points have been selected to reduce the maximum
|
|
latency of rescheduling, providing faster application reactions,
|
|
at the cost of slightly lower throughput.
|
|
|
|
This allows reaction to interactive events by allowing a
|
|
low priority process to voluntarily preempt itself even if it
|
|
is in kernel mode executing a system call. This allows
|
|
applications to run more 'smoothly' even when the system is
|
|
under load.
|
|
|
|
Select this if you are building a kernel for a desktop system.
|
|
|
|
config PREEMPT
|
|
bool "Preemptible Kernel (Low-Latency Desktop)"
|
|
depends on !ARCH_NO_PREEMPT
|
|
select PREEMPT_BUILD
|
|
help
|
|
This option reduces the latency of the kernel by making
|
|
all kernel code (that is not executing in a critical section)
|
|
preemptible. This allows reaction to interactive events by
|
|
permitting a low priority process to be preempted involuntarily
|
|
even if it is in kernel mode executing a system call and would
|
|
otherwise not be about to reach a natural preemption point.
|
|
This allows applications to run more 'smoothly' even when the
|
|
system is under load, at the cost of slightly lower throughput
|
|
and a slight runtime overhead to kernel code.
|
|
|
|
Select this if you are building a kernel for a desktop or
|
|
embedded system with latency requirements in the milliseconds
|
|
range.
|
|
|
|
config PREEMPT_AUTO
|
|
bool "Automagic preemption mode with runtime tweaking support"
|
|
depends on HAVE_PREEMPT_AUTO
|
|
select PREEMPT_BUILD_AUTO
|
|
help
|
|
Add some sensible blurb here
|
|
|
|
config PREEMPT_RT
|
|
bool "Fully Preemptible Kernel (Real-Time)"
|
|
depends on EXPERT && ARCH_SUPPORTS_RT
|
|
select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
|
|
select PREEMPTION
|
|
help
|
|
This option turns the kernel into a real-time kernel by replacing
|
|
various locking primitives (spinlocks, rwlocks, etc.) with
|
|
preemptible priority-inheritance aware variants, enforcing
|
|
interrupt threading and introducing mechanisms to break up long
|
|
non-preemptible sections. This makes the kernel, except for very
|
|
low level and critical code paths (entry code, scheduler, low
|
|
level interrupt handling) fully preemptible and brings most
|
|
execution contexts under scheduler control.
|
|
|
|
Select this if you are building a kernel for systems which
|
|
require real-time guarantees.
|
|
|
|
endchoice
|
|
|
|
config PREEMPT_COUNT
|
|
bool
|
|
|
|
config PREEMPTION
|
|
bool
|
|
select PREEMPT_COUNT
|
|
|
|
config PREEMPT_DYNAMIC
|
|
bool "Preemption behaviour defined on boot"
|
|
depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
|
|
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
|
|
select PREEMPT_BUILD
|
|
default y if HAVE_PREEMPT_DYNAMIC_CALL
|
|
help
|
|
This option allows to define the preemption model on the kernel
|
|
command line parameter and thus override the default preemption
|
|
model defined during compile time.
|
|
|
|
The feature is primarily interesting for Linux distributions which
|
|
provide a pre-built kernel binary to reduce the number of kernel
|
|
flavors they offer while still offering different usecases.
|
|
|
|
The runtime overhead is negligible with HAVE_STATIC_CALL_INLINE enabled
|
|
but if runtime patching is not available for the specific architecture
|
|
then the potential overhead should be considered.
|
|
|
|
Interesting if you want the same pre-built kernel should be used for
|
|
both Server and Desktop workloads.
|
|
|
|
config SCHED_CORE
|
|
bool "Core Scheduling for SMT"
|
|
depends on SCHED_SMT
|
|
help
|
|
This option permits Core Scheduling, a means of coordinated task
|
|
selection across SMT siblings. When enabled -- see
|
|
prctl(PR_SCHED_CORE) -- task selection ensures that all SMT siblings
|
|
will execute a task from the same 'core group', forcing idle when no
|
|
matching task is found.
|
|
|
|
Use of this feature includes:
|
|
- mitigation of some (not all) SMT side channels;
|
|
- limiting SMT interference to improve determinism and/or performance.
|
|
|
|
SCHED_CORE is default disabled. When it is enabled and unused,
|
|
which is the likely usage by Linux distributions, there should
|
|
be no measurable impact on performance.
|
|
|
|
|