Files
Thomas Gleixner 5eede802d4 sched: define TIF_ALLOW_RESCHED
BugLink: https://bugs.launchpad.net/bugs/2060704

On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote:
> On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote:
>> That said - I think as a proof of concept and "look, with this we get
>> the expected scheduling event counts", that patch is perfect. I think
>> you more than proved the concept.
>
> There is certainly quite some analyis work to do to make this a one to
> one replacement.
>
> With a handful of benchmarks the PoC (tweaked with some obvious fixes)
> is pretty much on par with the current mainline variants (NONE/FULL),
> but the memtier benchmark makes a massive dent.
>
> It sports a whopping 10% regression with the LAZY mode versus the mainline
> NONE model. Non-LAZY and FULL behave unsurprisingly in the same way.
>
> That benchmark is really sensitive to the preemption model. With current
> mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20%
> performance drop versus preempt=NONE.

That 20% was a tired pilot error. The real number is in the 5% ballpark.

> I have no clue what's going on there yet, but that shows that there is
> obviously quite some work ahead to get this sorted.

It took some head scratching to figure that out. The initial fix broke
the handling of the hog issue, i.e. the problem that Ankur tried to
solve, but I hacked up a "solution" for that too.

With that the memtier benchmark is roughly back to the mainline numbers,
but my throughput benchmark know how is pretty close to zero, so that
should be looked at by people who actually understand these things.

Likewise the hog prevention is just at the PoC level and clearly beyond
my knowledge of scheduler details: It unconditionally forces a
reschedule when the looping task is not responding to a lazy reschedule
request before the next tick. IOW it forces a reschedule on the second
tick, which is obviously different from the cond_resched()/might_sleep()
behaviour.

The changes vs. the original PoC aside of the bug and thinko fixes:

    1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the
       lazy preempt bit as the trace_entry::flags field is full already.

       That obviously breaks the tracer ABI, but if we go there then
       this needs to be fixed. Steven?

    2) debugfs file to validate that loops can be force preempted w/o
       cond_resched()

       The usage is:

       # taskset -c 1 bash
       # echo 1 > /sys/kernel/debug/sched/hog &
       # echo 1 > /sys/kernel/debug/sched/hog &
       # echo 1 > /sys/kernel/debug/sched/hog &

       top shows ~33% CPU for each of the hogs and tracing confirms that
       the crude hack in the scheduler tick works:

            bash-4559    [001] dlh2.  2253.331202: resched_curr <-__update_curr
            bash-4560    [001] dlh2.  2253.340199: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.346199: resched_curr <-__update_curr
            bash-4559    [001] dlh2.  2253.353199: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.358199: resched_curr <-__update_curr
            bash-4560    [001] dlh2.  2253.370202: resched_curr <-__update_curr
            bash-4559    [001] dlh2.  2253.378198: resched_curr <-__update_curr
            bash-4561    [001] dlh2.  2253.389199: resched_curr <-__update_curr

       The 'l' instead of the usual 'N' reflects that the lazy resched
       bit is set. That makes __update_curr() invoke resched_curr()
       instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED
       and folds it into preempt_count so that preemption happens at the
       next possible point, i.e. either in return from interrupt or at
       the next preempt_enable().

That's as much as I wanted to demonstrate and I'm not going to spend
more cycles on it as I have already too many other things on flight and
the resulting scheduler woes are clearly outside of my expertice.

Though definitely I'm putting a permanent NAK in place for any attempts
to duct tape the preempt=NONE model any further by sprinkling more
cond*() and whatever warts around.

Thanks,

        tglx

[tglx: s@CONFIG_PREEMPT_AUTO@CONFIG_PREEMPT_BUILD_AUTO@ ]

Link: https://lore.kernel.org/all/87jzshhexi.ffs@tglx/
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Kevin Becker <kevin.becker@canonical.com>
2025-06-25 11:17:16 -04:00

152 lines
5.1 KiB
Plaintext

# SPDX-License-Identifier: GPL-2.0-only
config PREEMPT_NONE_BUILD
bool
config PREEMPT_VOLUNTARY_BUILD
bool
config PREEMPT_BUILD
bool
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
config PREEMPT_BUILD_AUTO
bool
select PREEMPT_BUILD
config HAVE_PREEMPT_AUTO
bool
choice
prompt "Preemption Model"
default PREEMPT_NONE
config PREEMPT_NONE
bool "No Forced Preemption (Server)"
select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
help
This is the traditional Linux preemption model, geared towards
throughput. It will still provide good latencies most of the
time, but there are no guarantees and occasional longer delays
are possible.
Select this option if you are building a kernel for a server or
scientific/computation system, or if you want to maximize the
raw processing power of the kernel, irrespective of scheduling
latencies.
config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
help
This option reduces the latency of the kernel by adding more
"explicit preemption points" to the kernel code. These new
preemption points have been selected to reduce the maximum
latency of rescheduling, providing faster application reactions,
at the cost of slightly lower throughput.
This allows reaction to interactive events by allowing a
low priority process to voluntarily preempt itself even if it
is in kernel mode executing a system call. This allows
applications to run more 'smoothly' even when the system is
under load.
Select this if you are building a kernel for a desktop system.
config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPT_BUILD
help
This option reduces the latency of the kernel by making
all kernel code (that is not executing in a critical section)
preemptible. This allows reaction to interactive events by
permitting a low priority process to be preempted involuntarily
even if it is in kernel mode executing a system call and would
otherwise not be about to reach a natural preemption point.
This allows applications to run more 'smoothly' even when the
system is under load, at the cost of slightly lower throughput
and a slight runtime overhead to kernel code.
Select this if you are building a kernel for a desktop or
embedded system with latency requirements in the milliseconds
range.
config PREEMPT_AUTO
bool "Automagic preemption mode with runtime tweaking support"
depends on HAVE_PREEMPT_AUTO
select PREEMPT_BUILD_AUTO
help
Add some sensible blurb here
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO
select PREEMPTION
help
This option turns the kernel into a real-time kernel by replacing
various locking primitives (spinlocks, rwlocks, etc.) with
preemptible priority-inheritance aware variants, enforcing
interrupt threading and introducing mechanisms to break up long
non-preemptible sections. This makes the kernel, except for very
low level and critical code paths (entry code, scheduler, low
level interrupt handling) fully preemptible and brings most
execution contexts under scheduler control.
Select this if you are building a kernel for systems which
require real-time guarantees.
endchoice
config PREEMPT_COUNT
bool
config PREEMPTION
bool
select PREEMPT_COUNT
config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
help
This option allows to define the preemption model on the kernel
command line parameter and thus override the default preemption
model defined during compile time.
The feature is primarily interesting for Linux distributions which
provide a pre-built kernel binary to reduce the number of kernel
flavors they offer while still offering different usecases.
The runtime overhead is negligible with HAVE_STATIC_CALL_INLINE enabled
but if runtime patching is not available for the specific architecture
then the potential overhead should be considered.
Interesting if you want the same pre-built kernel should be used for
both Server and Desktop workloads.
config SCHED_CORE
bool "Core Scheduling for SMT"
depends on SCHED_SMT
help
This option permits Core Scheduling, a means of coordinated task
selection across SMT siblings. When enabled -- see
prctl(PR_SCHED_CORE) -- task selection ensures that all SMT siblings
will execute a task from the same 'core group', forcing idle when no
matching task is found.
Use of this feature includes:
- mitigation of some (not all) SMT side channels;
- limiting SMT interference to improve determinism and/or performance.
SCHED_CORE is default disabled. When it is enabled and unused,
which is the likely usage by Linux distributions, there should
be no measurable impact on performance.