From 4282494a20cdcaf38d553f2c2ff6f252084f979c Mon Sep 17 00:00:00 2001 From: Guo Ren Date: Wed, 4 Jan 2023 21:19:52 -0500 Subject: [PATCH 1/8] locking/qspinlock: Micro-optimize pending state waiting for unlock When we're pending, we only care about lock value. The xchg_tail wouldn't affect the pending state. That means the hardware thread could stay in a sleep state and leaves the rest execution units' resources of pipeline to other hardware threads. This situation is the SMT scenarios in the same core. Not an entering low-power state situation. Of course, the granularity between cores is "cacheline", but the granularity between SMT hw threads of the same core could be "byte" which internal LSU handles. For example, when a hw-thread yields the resources of the core to other hw-threads, this patch could help the hw-thread stay in the sleep state and prevent it from being woken up by other hw-threads xchg_tail. Signed-off-by: Guo Ren Signed-off-by: Guo Ren Signed-off-by: Ingo Molnar Acked-by: Waiman Long Link: https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org Cc: Peter Zijlstra --- kernel/locking/qspinlock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 2b23378775fe..ebe6b8ec7cb3 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -371,7 +371,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) /* * We're pending, wait for the owner to go away. * - * 0,1,1 -> 0,1,0 + * 0,1,1 -> *,1,0 * * this wait loop must be a load-acquire such that we match the * store-release that clears the locked bit and create lock @@ -380,7 +380,7 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) * barriers. */ if (val & _Q_LOCKED_MASK) - atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK)); + smp_cond_load_acquire(&lock->locked, !VAL); /* * take ownership and clear the pending bit. From b613c7f31476c44316bfac1af7cac714b7d6bef9 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Wed, 25 Jan 2023 19:36:25 -0500 Subject: [PATCH 2/8] locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath A non-first waiter can potentially spin in the for loop of rwsem_down_write_slowpath() without sleeping but fail to acquire the lock even if the rwsem is free if the following sequence happens: Non-first RT waiter First waiter Lock holder ------------------- ------------ ----------- Acquire wait_lock rwsem_try_write_lock(): Set handoff bit if RT or wait too long Set waiter->handoff_set Release wait_lock Acquire wait_lock Inherit waiter->handoff_set Release wait_lock Clear owner Release lock if (waiter.handoff_set) { rwsem_spin_on_owner((); if (OWNER_NULL) goto trylock_again; } trylock_again: Acquire wait_lock rwsem_try_write_lock(): if (first->handoff_set && (waiter != first)) return false; Release wait_lock A non-first waiter cannot really acquire the rwsem even if it mistakenly believes that it can spin on OWNER_NULL value. If that waiter happens to be an RT task running on the same CPU as the first waiter, it can block the first waiter from acquiring the rwsem leading to live lock. Fix this problem by making sure that a non-first waiter cannot spin in the slowpath loop without sleeping. Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent") Signed-off-by: Waiman Long Signed-off-by: Ingo Molnar Tested-by: Mukesh Ojha Reviewed-by: Mukesh Ojha Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230126003628.365092-2-longman@redhat.com --- kernel/locking/rwsem.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 44873594de03..be2df9ea7c30 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -624,18 +624,16 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem, */ if (first->handoff_set && (waiter != first)) return false; - - /* - * First waiter can inherit a previously set handoff - * bit and spin on rwsem if lock acquisition fails. - */ - if (waiter == first) - waiter->handoff_set = true; } new = count; if (count & RWSEM_LOCK_MASK) { + /* + * A waiter (first or not) can set the handoff bit + * if it is an RT task or wait in the wait queue + * for too long. + */ if (has_handoff || (!rt_task(waiter->task) && !time_after(jiffies, waiter->timeout))) return false; @@ -651,11 +649,12 @@ static inline bool rwsem_try_write_lock(struct rw_semaphore *sem, } while (!atomic_long_try_cmpxchg_acquire(&sem->count, &count, new)); /* - * We have either acquired the lock with handoff bit cleared or - * set the handoff bit. + * We have either acquired the lock with handoff bit cleared or set + * the handoff bit. Only the first waiter can have its handoff_set + * set here to enable optimistic spinning in slowpath loop. */ if (new & RWSEM_FLAG_HANDOFF) { - waiter->handoff_set = true; + first->handoff_set = true; lockevent_inc(rwsem_wlock_handoff); return false; } From 3f5245538a1964ae186ab7e1636020a41aa63143 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Wed, 25 Jan 2023 19:36:26 -0500 Subject: [PATCH 3/8] locking/rwsem: Disable preemption in all down_read*() and up_read() code paths Commit: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") ... assumes that when the owner field is changed to NULL, the lock will become free soon. But commit: 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") ... disabled preemption when acquiring rwsem for write. However, preemption has not yet been disabled when acquiring a read lock on a rwsem. So a reader can add a RWSEM_READER_BIAS to count without setting owner to signal a reader, got preempted out by a RT task which then spins in the writer slowpath as owner remains NULL leading to live lock. One easy way to fix this problem is to disable preemption at all the down_read*() and up_read() code paths as implemented in this patch. Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") Reported-by: Mukesh Ojha Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230126003628.365092-3-longman@redhat.com --- kernel/locking/rwsem.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index be2df9ea7c30..84d5b649b95f 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1091,7 +1091,7 @@ queue: /* Ordered by sem->wait_lock against rwsem_mark_wake(). */ break; } - schedule(); + schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_reader); } @@ -1253,14 +1253,20 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) */ static inline int __down_read_common(struct rw_semaphore *sem, int state) { + int ret = 0; long count; + preempt_disable(); if (!rwsem_read_trylock(sem, &count)) { - if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) - return -EINTR; + if (IS_ERR(rwsem_down_read_slowpath(sem, count, state))) { + ret = -EINTR; + goto out; + } DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); } - return 0; +out: + preempt_enable(); + return ret; } static inline void __down_read(struct rw_semaphore *sem) @@ -1280,19 +1286,23 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { + int ret = 0; long tmp; DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); + preempt_disable(); tmp = atomic_long_read(&sem->count); while (!(tmp & RWSEM_READ_FAILED_MASK)) { if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, tmp + RWSEM_READER_BIAS)) { rwsem_set_reader_owned(sem); - return 1; + ret = 1; + break; } } - return 0; + preempt_enable(); + return ret; } /* @@ -1334,6 +1344,7 @@ static inline void __up_read(struct rw_semaphore *sem) DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem); + preempt_disable(); rwsem_clear_reader_owned(sem); tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count); DEBUG_RWSEMS_WARN_ON(tmp < 0, sem); @@ -1342,6 +1353,7 @@ static inline void __up_read(struct rw_semaphore *sem) clear_nonspinnable(sem); rwsem_wake(sem); } + preempt_enable(); } /* @@ -1661,6 +1673,12 @@ void down_read_non_owner(struct rw_semaphore *sem) { might_sleep(); __down_read(sem); + /* + * The owner value for a reader-owned lock is mostly for debugging + * purpose only and is not critical to the correct functioning of + * rwsem. So it is perfectly fine to set it in a preempt-enabled + * context here. + */ __rwsem_set_reader_owned(sem, NULL); } EXPORT_SYMBOL(down_read_non_owner); From 1d61659ced6bd8881cf2fb5cbcb28f9541fc7430 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Wed, 25 Jan 2023 19:36:27 -0500 Subject: [PATCH 4/8] locking/rwsem: Disable preemption in all down_write*() and up_write() code paths The previous patch has disabled preemption in all the down_read() and up_read() code paths. For symmetry, this patch extends commit: 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") ... to have preemption disabled in all the down_write() and up_write() code paths, including downgrade_write(). Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230126003628.365092-4-longman@redhat.com --- kernel/locking/rwsem.c | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 84d5b649b95f..acb5a50309a1 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -256,16 +256,13 @@ static inline bool rwsem_read_trylock(struct rw_semaphore *sem, long *cntp) static inline bool rwsem_write_trylock(struct rw_semaphore *sem) { long tmp = RWSEM_UNLOCKED_VALUE; - bool ret = false; - preempt_disable(); if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_WRITER_LOCKED)) { rwsem_set_owner(sem); - ret = true; + return true; } - preempt_enable(); - return ret; + return false; } /* @@ -716,7 +713,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) return false; } - preempt_disable(); /* * Disable preemption is equal to the RCU read-side crital section, * thus the task_strcut structure won't go away. @@ -728,7 +724,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) if ((flags & RWSEM_NONSPINNABLE) || (owner && !(flags & RWSEM_READER_OWNED) && !owner_on_cpu(owner))) ret = false; - preempt_enable(); lockevent_cond_inc(rwsem_opt_fail, !ret); return ret; @@ -828,8 +823,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) int loop = 0; u64 rspin_threshold = 0; - preempt_disable(); - /* sem->wait_lock should not be held when doing optimistic spinning */ if (!osq_lock(&sem->osq)) goto done; @@ -937,7 +930,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) } osq_unlock(&sem->osq); done: - preempt_enable(); lockevent_cond_inc(rwsem_opt_fail, !taken); return taken; } @@ -1178,15 +1170,12 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state) if (waiter.handoff_set) { enum owner_state owner_state; - preempt_disable(); owner_state = rwsem_spin_on_owner(sem); - preempt_enable(); - if (owner_state == OWNER_NULL) goto trylock_again; } - schedule(); + schedule_preempt_disabled(); lockevent_inc(rwsem_sleep_writer); set_current_state(state); trylock_again: @@ -1310,12 +1299,15 @@ static inline int __down_read_trylock(struct rw_semaphore *sem) */ static inline int __down_write_common(struct rw_semaphore *sem, int state) { + int ret = 0; + + preempt_disable(); if (unlikely(!rwsem_write_trylock(sem))) { if (IS_ERR(rwsem_down_write_slowpath(sem, state))) - return -EINTR; + ret = -EINTR; } - - return 0; + preempt_enable(); + return ret; } static inline void __down_write(struct rw_semaphore *sem) @@ -1330,8 +1322,14 @@ static inline int __down_write_killable(struct rw_semaphore *sem) static inline int __down_write_trylock(struct rw_semaphore *sem) { + int ret; + + preempt_disable(); DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem); - return rwsem_write_trylock(sem); + ret = rwsem_write_trylock(sem); + preempt_enable(); + + return ret; } /* @@ -1374,9 +1372,9 @@ static inline void __up_write(struct rw_semaphore *sem) preempt_disable(); rwsem_clear_owner(sem); tmp = atomic_long_fetch_add_release(-RWSEM_WRITER_LOCKED, &sem->count); - preempt_enable(); if (unlikely(tmp & RWSEM_FLAG_WAITERS)) rwsem_wake(sem); + preempt_enable(); } /* @@ -1394,11 +1392,13 @@ static inline void __downgrade_write(struct rw_semaphore *sem) * write side. As such, rely on RELEASE semantics. */ DEBUG_RWSEMS_WARN_ON(rwsem_owner(sem) != current, sem); + preempt_disable(); tmp = atomic_long_fetch_add_release( -RWSEM_WRITER_LOCKED+RWSEM_READER_BIAS, &sem->count); rwsem_set_reader_owned(sem); if (tmp & RWSEM_FLAG_WAITERS) rwsem_downgrade_wake(sem); + preempt_enable(); } #else /* !CONFIG_PREEMPT_RT */ From 50fd4d5e6914b60ea6f89c2cbff7a07799414c62 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Mon, 16 Jan 2023 17:34:46 +0100 Subject: [PATCH 5/8] x86/PAT: Use try_cmpxchg() in set_page_memtype() Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in set_page_memtype. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20230116163446.4734-1-ubizjak@gmail.com --- arch/x86/mm/pat/memtype.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index fb4b1b5e0dea..6d1ba2dda35d 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -159,10 +159,10 @@ static inline void set_page_memtype(struct page *pg, break; } + old_flags = READ_ONCE(pg->flags); do { - old_flags = pg->flags; new_flags = (old_flags & _PGMT_CLEAR_MASK) | memtype_flags; - } while (cmpxchg(&pg->flags, old_flags, new_flags) != old_flags); + } while (!try_cmpxchg(&pg->flags, &old_flags, new_flags)); } #else static inline enum page_cache_mode get_page_memtype(struct page *pg) From 890a0794b34f89fcd90e94ec970ad2bc18b70e73 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Mon, 16 Jan 2023 17:25:22 +0100 Subject: [PATCH 6/8] x86/ACPI/boot: Use try_cmpxchg() in __acpi_{acquire,release}_global_lock() Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in __acpi_{acquire,release}_global_lock(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after CMPXCHG (and related MOV instruction in front of CMPXCHG). Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE() to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak Signed-off-by: Ingo Molnar Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20230116162522.4072-1-ubizjak@gmail.com --- arch/x86/kernel/acpi/boot.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 907cc98b1938..4177577c173b 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -1840,23 +1840,23 @@ early_param("acpi_sci", setup_acpi_sci); int __acpi_acquire_global_lock(unsigned int *lock) { - unsigned int old, new, val; + unsigned int old, new; + + old = READ_ONCE(*lock); do { - old = *lock; new = (((old & ~0x3) + 2) + ((old >> 1) & 0x1)); - val = cmpxchg(lock, old, new); - } while (unlikely (val != old)); + } while (!try_cmpxchg(lock, &old, new)); return ((new & 0x3) < 3) ? -1 : 0; } int __acpi_release_global_lock(unsigned int *lock) { - unsigned int old, new, val; + unsigned int old, new; + + old = READ_ONCE(*lock); do { - old = *lock; new = old & ~0x3; - val = cmpxchg(lock, old, new); - } while (unlikely (val != old)); + } while (!try_cmpxchg(lock, &old, new)); return old & 0x1; } From 2edcedcd1dcb542cb09acabda1732de47dd82e68 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Thu, 9 Jun 2022 17:36:33 +0200 Subject: [PATCH 7/8] locking/lockdep: Remove lockdep_init_map_crosslock. The cross-release bits have been removed, lockdep_init_map_crosslock() is a leftover. Remove lockdep_init_map_crosslock. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Reviewed-by: Waiman Long Acked-by: Waiman Long Link: https://lore.kernel.org/r/20220311164457.46461-1-bigeasy@linutronix.de Link: https://lore.kernel.org/r/YqITgY+2aPITu96z@linutronix.de --- include/linux/lockdep.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 1f1099dac3f0..1023f349af71 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -435,7 +435,6 @@ enum xhlock_context_t { XHLOCK_CTX_NR, }; -#define lockdep_init_map_crosslock(m, n, k, s) do {} while (0) /* * To initialize a lockdep_map statically use this macro. * Note that _name must not be NULL. From 3b4863fa5b7dd50dab1b10abbed938efd203752f Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 26 Oct 2022 15:44:07 +0200 Subject: [PATCH 8/8] vduse: Remove include of rwlock.h rwlock.h should not be included directly. Instead linux/splinlock.h should be included. Including it directly will break the RT build. Remove the rwlock.h include. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Acked-by: Michael S. Tsirkin --- drivers/vdpa/vdpa_user/iova_domain.h | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h index 4e0e50e7ac15..173e979b84a9 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.h +++ b/drivers/vdpa/vdpa_user/iova_domain.h @@ -14,7 +14,6 @@ #include #include #include -#include #define IOVA_START_PFN 1