Commit Graph

1285 Commits

Author SHA1 Message Date
Damien Le Moal
5d1966b61d FROMGIT: block: Introduce bio_needs_zone_write_plugging()
In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.

This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.

Fixes: f211268ed1 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 417517944
Change-Id: I9628b14d4fe0e1f964d4036178fbc6ee49b3be78
(cherry picked from commit bf7a8b5cbbb2d531f3336be2186af0c5590d157c git://git.kernel.dk/linux-block for-next)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-06-26 14:10:08 -07:00
Pankaj Raghav
30ce6652ee ANDROID: block: Support npo2 zone sizes
Checking if a given sector is aligned to a zone is a common operation
that is performed for zoned devices. Add bdev_is_zone_start helper to
check for this instead of opencoding it everywhere.

Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.

The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper
is used to optimize the calculation for po2 zone sizes.

Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow
zoned devices with non po2 zone size is to remove the unmapped LBA
between zone capacity and zone size for devices that cannot have a po2
zone capacity.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Bug: 269471019
Bug: 415836627
Link: https://lore.kernel.org/linux-block/20220923173618.6899-4-p.raghav@samsung.com/
Change-Id: I2ecc186d7b14f5508b6abfe9821526d39a21d7e4
[ bvanassche: ported this patch to kernel 6.12 ]
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-16 15:54:49 -07:00
Christoph Hellwig
574e0848d2 UPSTREAM: block: add a queue_limits_commit_update_frozen helper
Add a helper that freezes the queue, updates the queue limits and
unfreezes the queue and convert all open coded versions of that to the
new helper.

Change-Id: I38b3dae3012fdbeaccbc12d17e1b19c7f31db8fa
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250110054726.1499538-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit aa427d7b73b196f657d6d2cf0e94eff6b883fdef)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-16 15:54:49 -07:00
Ming Lei
26febb7cde UPSTREAM: block: track queue dying state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save queue lying state when
we want to lock the freeze queue since the state is one per-task variable
now.

Change-Id: Ic11e09d92c00c4b5080fbe4cd7cfa50e808096f7
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241127135133.3952153-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 415836627
(cherry picked from commit f6661b1d0525f3764596a1b65eeed9e75aecafa7)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-16 12:17:58 +00:00
Ming Lei
752dff69ae UPSTREAM: block: track disk DEAD state automatically for modeling queue freeze lockdep
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when
we want to lock the freeze queue since the state is one per-task variable
now.

Doing this way can kill lots of false positive when freeze queue is
called before adding disk[1].

[1] https://lore.kernel.org/linux-block/6741f6b2.050a0220.1cc393.0017.GAE@google.com/

Change-Id: I1b0331f5863865d05ac2d719cd314addfed23838
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241127135133.3952153-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 415836627
(cherry picked from commit 6f491a8d4b92d1a840fd9209cba783c84437d0b7)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-16 12:17:57 +00:00
Christoph Hellwig
24f685a927 UPSTREAM: block: add a rq_list type
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Change-Id: Ia470736d0468c265f5b61cb9d8a0e5544b6b7b0d
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 415836627
(cherry picked from commit a3396b99990d8b4e5797e7b16fdeb64c15ae97bb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-16 12:17:57 +00:00
Greg Kroah-Hartman
75adb09e2f ANDROID: GKI: the "reusachtig" padding sync with android16-6.12
Add the initial set of ABI padding fields in android16-6.12 based on what
is in the android15-6.6 branch.

Bug: 151154716
Change-Id: Icdb394863b2911389bfdced0fd1ea20236ca4ce1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
2025-05-16 12:17:56 +00:00
Bart Van Assche
996a35040a FROMLIST: dm-zone: Use bdev_*() helper functions where applicable
Improve code readability by using bdev_is_zone_aligned() and
bdev_offset_from_zone_start() where applicable. No functionality
has been changed.

This patch is a reworked version of a patch from Pankaj Raghav.

See also https://lore.kernel.org/linux-block/20220923173618.6899-11-p.raghav@samsung.com/.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Change-Id: Iddeca794a7b695a414cffdaf7442e5595523792f
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bug: 415836627
Link: https://lore.kernel.org/dm-devel/20250514205033.2108129-1-bvanassche@acm.org/
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2025-05-15 19:45:33 -07:00
Wang Jianzheng
51d17a187f ANDROID: Block: add OEM data to struct gendisk
In order to alleviate the issue of priority inversion caused by the lock_page,
it is proposed to add oem_data to the struct gendisk to store a pointer to
its struct block_device . This will allow us to check its priority through a
customized scheduler hook when lock_folio.

Bug: 338959088
Bug: 407947260

Change-Id: I118ef11cb89a3fad9a15a2c3b8383d42be0fded4
Signed-off-by: Wang Jianzheng <11134417@vivo.corp-partner.google.com>
(cherry picked from commit feb92ccf10bce90739b5f51cc33d1bd6f16d7fab)
Signed-off-by: ying zuxin <11154159@vivo.com>
2025-04-16 13:39:52 +00:00
Ming Lei
aa91035a5f BACKPORT: block: make segment size limit workable for > 4K PAGE_SIZE
Using PAGE_SIZE as a minimum expected DMA segment size in consideration
of devices which have a max DMA segment size of < 64k when used on 64k
PAGE_SIZE systems leads to devices not being able to probe such as
eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure
as follows:

WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0

Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment
size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems.

If anyone need to backport this patch, the following commits are depended:

	commit 6aeb4f836480 ("block: remove bio_add_pc_page")
	commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
	commit b7175e24d6ac ("block: add a dma mapping iterator")

Bug: 399192075

Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0]
Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1]
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Keith Busch <kbusch@kernel.org>
Tested-by: Paul Bunyan <pbunyan@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 889c57066ceee5e9172232da0608a8ac053bb6e5)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>

[dhavale: resolved minor conflict in block/blk.h]

Change-Id: I5fe54dd8c73621259cbd9720b77253d8a2af29c7
2025-03-19 13:28:29 +00:00
Sandeep Dhavale
d76170f735 Revert "ANDROID: block: Support configuring limits below the page size"
This reverts commit 6e8ff6954a.

Bug: 399192075

Changes related to ANDROID's implementation of small segment size for
block device are being reverted in order to backport upstream accepted
solution at
https://lore.kernel.org/linux-block/20250225022141.2154581-1-ming.lei@redhat.com/

Change-Id: If6954197beb13ddac565392c540bdf1ca6795acf
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
2025-03-19 13:28:28 +00:00
Damien Le Moal
4095790470 UPSTREAM: block: Remove zone write plugs when handling native zone append writes
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.

For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.

However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
   written using zone append or regular write operations, because the
   write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
   will always be failed by blk_zone_wplug_prepare_bio() as the write
   pointer alignment check will fail, even if the user correctly
   accounted for the zone append operations and issued the regular
   writes with a correct sector.

Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.

Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().

To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.

Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Change-Id: If7a37be9828e0d59ff68c7b7db4f30a9a10ede89
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2f572c42bb)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2025-03-19 13:28:26 +00:00
Greg Kroah-Hartman
b5fd1cdaf6 Revert "block: Remove zone write plugs when handling native zone append writes"
This reverts commit 2f572c42bb which is
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.

It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.

Bug: 161946584
Change-Id: I48f47a48084edfbca1f6e07fdde108f9c164aacf
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2025-03-10 15:17:11 +00:00
Greg Kroah-Hartman
0a0ca652b6 Merge 6.12.18 into android16-6.12
GKI (arm64) relevant 37 out of 149 changes, affecting 60 files +390/-338
  659bfea591 scsi: ufs: core: Fix ufshcd_is_ufs_dev_busy() and ufshcd_eh_timed_out() [1 file, +4/-4]
  3594aad97e ovl: fix UAF in ovl_dentry_update_reval by moving dput() in ovl_link_up [1 file, +1/-1]
  a3ae6a60ba SUNRPC: Prevent looping due to rpc_signal_task() races [3 files, +2/-6]
  b5038504da scsi: core: Clear driver private data when retrying request [1 file, +7/-7]
  465a814323 scsi: ufs: core: Set default runtime/system PM levels before ufshcd_hba_init() [1 file, +15/-15]
  ee5d6cb5cc ALSA: usb-audio: Avoid dropping MIDI events at closing multiple ports [1 file, +1/-1]
  5c9921f1da Bluetooth: L2CAP: Fix L2CAP_ECRED_CONN_RSP response [1 file, +7/-2]
  f22df335b2 net: loopback: Avoid sending IP packets without an Ethernet header [1 file, +14/-0]
  915d64a78f net: set the minimum for net_hotdata.netdev_budget_usecs [1 file, +2/-1]
  db8b2a613d ipv4: Convert icmp_route_lookup() to dscp_t. [1 file, +9/-10]
  97c455c3c2 ipv4: Convert ip_route_input() to dscp_t. [6 files, +18/-9]
  8ffd0390fc ipvs: Always clear ipvs_property flag in skb_scrub_packet() [1 file, +1/-1]
  c417b1e4d8 tcp: devmem: don't write truncated dmabuf CMSGs to userspace [3 files, +22/-16]
  33d782e38d tcp: Defer ts_recent changes until req is owned [1 file, +4/-6]
  902d576296 net: Clear old fragment checksum value in napi_reuse_skb [1 file, +1/-0]
  806437d047 thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power() [1 file, +1/-1]
  7d582eb6e4 perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list [1 file, +9/-2]
  13cca2b73e uprobes: Reject the shared zeropage in uprobe_write_opcode() [1 file, +5/-0]
  07a82c78d8 thermal: of: Simplify thermal_of_should_bind with scoped for each OF child [1 file, +2/-3]
  e11df3bffd thermal/of: Fix cdev lookup in thermal_of_should_bind() [1 file, +29/-21]
  19cd2dc4d4 thermal: core: Move lists of thermal instances to trip descriptors [7 files, +62/-64]
  27a144c3be thermal: gov_power_allocator: Update total_weight on bind and cdev updates [1 file, +22/-8]
  546c19eb69 io_uring/net: save msg_control for compat [1 file, +3/-1]
  8cc451444c unreachable: Unify [2 files, +7/-15]
  2cfd0e5084 objtool: Remove annotate_{,un}reachable() [2 files, +2/-68]
  a00e900c9b objtool: Fix C jump table annotations for Clang [3 files, +6/-5]
  435d2964af tracing: Fix bad hist from corrupting named_triggers list [1 file, +15/-15]
  8e31d9fb2f ALSA: usb-audio: Re-add sample rate quirk for Pioneer DJM-900NXS2 [1 file, +1/-0]
  b9de147b2c KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2 [3 files, +14/-21]
  a2475ccad6 perf/core: Add RCU read lock protection to perf_iterate_ctx() [1 file, +2/-1]
  322cb23e24 perf/core: Fix low freq setting via IOC_PERIOD [1 file, +9/-8]
  8f6369c3cd arm64/mm: Fix Boot panic on Ampere Altra [1 file, +1/-6]
  2f572c42bb block: Remove zone write plugs when handling native zone append writes [2 files, +73/-10]
  29b6d5ad3e rcuref: Plug slowpath race in rcuref_put() [2 files, +8/-6]
  0362847c52 sched/core: Prevent rescheduling when interrupts are disabled [1 file, +1/-1]
  59455f968c scsi: ufs: core: bsg: Fix crash when arpmb command fails [1 file, +4/-2]
  72cbaf8b41 thermal: gov_power_allocator: Add missing NULL pointer check [1 file, +6/-1]

Changes in 6.12.18
	RDMA/mlx5: Fix the recovery flow of the UMR QP
	IB/mlx5: Set and get correct qp_num for a DCT QP
	RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with error
	RDMA/mlx5: Fix a WARN during dereg_mr for DM type
	RDMA/mana_ib: Allocate PAGE aligned doorbell index
	RDMA/hns: Fix mbox timing out by adding retry mechanism
	RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
	RDMA/bnxt_re: Refactor NQ allocation
	RDMA/bnxt_re: Cache MSIx info to a local structure
	RDMA/bnxt_re: Add sanity checks on rdev validity
	RDMA/bnxt_re: Allocate dev_attr information dynamically
	RDMA/bnxt_re: Fix the statistics for Gen P7 VF
	landlock: Fix non-TCP sockets restriction
	scsi: ufs: core: Fix ufshcd_is_ufs_dev_busy() and ufshcd_eh_timed_out()
	ovl: fix UAF in ovl_dentry_update_reval by moving dput() in ovl_link_up
	NFS: O_DIRECT writes must check and adjust the file length
	NFS: Adjust delegated timestamps for O_DIRECT reads and writes
	SUNRPC: Prevent looping due to rpc_signal_task() races
	NFSv4: Fix a deadlock when recovering state on a sillyrenamed file
	SUNRPC: Handle -ETIMEDOUT return from tlshd
	RDMA/mlx5: Fix implicit ODP hang on parent deregistration
	RDMA/mlx5: Fix AH static rate parsing
	scsi: core: Clear driver private data when retrying request
	scsi: ufs: core: Set default runtime/system PM levels before ufshcd_hba_init()
	RDMA/mlx5: Fix bind QP error cleanup flow
	RDMA/bnxt_re: Fix the page details for the srq created by kernel consumers
	sunrpc: suppress warnings for unused procfs functions
	ALSA: usb-audio: Avoid dropping MIDI events at closing multiple ports
	Bluetooth: L2CAP: Fix L2CAP_ECRED_CONN_RSP response
	rxrpc: rxperf: Fix missing decoding of terminal magic cookie
	afs: Fix the server_list to unuse a displaced server rather than putting it
	afs: Give an afs_server object a ref on the afs_cell object it points to
	net: loopback: Avoid sending IP packets without an Ethernet header
	net: set the minimum for net_hotdata.netdev_budget_usecs
	ipv4: Convert icmp_route_lookup() to dscp_t.
	ipv4: Convert ip_route_input() to dscp_t.
	ipvlan: Prepare ipvlan_process_v4_outbound() to future .flowi4_tos conversion.
	ipvlan: ensure network headers are in skb linear part
	net: cadence: macb: Synchronize stats calculations
	net: dsa: rtl8366rb: Fix compilation problem
	ASoC: es8328: fix route from DAC to output
	ASoC: fsl: Rename stream name of SAI DAI driver
	ipvs: Always clear ipvs_property flag in skb_scrub_packet()
	drm/xe/oa: Signal output fences
	drm/xe/oa: Move functions up so they can be reused for config ioctl
	drm/xe/oa: Add syncs support to OA config ioctl
	drm/xe/oa: Allow only certain property changes from config
	drm/xe/oa: Allow oa_exponent value of 0
	firmware: cs_dsp: Remove async regmap writes
	ASoC: cs35l56: Prevent races when soft-resetting using SPI control
	ALSA: hda/realtek: Fix wrong mic setup for ASUS VivoBook 15
	net: ethernet: ti: am65-cpsw: select PAGE_POOL
	tcp: devmem: don't write truncated dmabuf CMSGs to userspace
	ice: add E830 HW VF mailbox message limit support
	ice: Fix deinitializing VF in error path
	ice: Avoid setting default Rx VSI twice in switchdev setup
	tcp: Defer ts_recent changes until req is owned
	net: Clear old fragment checksum value in napi_reuse_skb
	net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination.
	net/mlx5: IRQ, Fix null string in debug print
	net: ipv6: fix dst ref loop on input in seg6 lwt
	net: ipv6: fix dst ref loop on input in rpl lwt
	selftests: drv-net: Check if combined-count exists
	idpf: fix checksums set in idpf_rx_rsc()
	net: ti: icss-iep: Reject perout generation request
	thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power()
	perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list
	uprobes: Reject the shared zeropage in uprobe_write_opcode()
	thermal: of: Simplify thermal_of_should_bind with scoped for each OF child
	thermal/of: Fix cdev lookup in thermal_of_should_bind()
	thermal: core: Move lists of thermal instances to trip descriptors
	thermal: gov_power_allocator: Update total_weight on bind and cdev updates
	io_uring/net: save msg_control for compat
	unreachable: Unify
	objtool: Remove annotate_{,un}reachable()
	objtool: Fix C jump table annotations for Clang
	x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems
	phy: rockchip: fix Kconfig dependency more
	phy: rockchip: naneng-combphy: compatible reset with old DT
	riscv: KVM: Fix hart suspend status check
	riscv: KVM: Fix hart suspend_type use
	riscv: KVM: Fix SBI IPI error generation
	riscv: KVM: Fix SBI TIME error generation
	tracing: Fix bad hist from corrupting named_triggers list
	ftrace: Avoid potential division by zero in function_stat_show()
	ALSA: usb-audio: Re-add sample rate quirk for Pioneer DJM-900NXS2
	ALSA: hda/realtek: Fix microphone regression on ASUS N705UD
	KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
	perf/core: Add RCU read lock protection to perf_iterate_ctx()
	perf/x86: Fix low freqency setting issue
	perf/core: Fix low freq setting via IOC_PERIOD
	drm/xe/regs: remove a duplicate definition for RING_CTL_SIZE(size)
	drm/xe/userptr: restore invalidation list on error
	drm/xe/userptr: fix EFAULT handling
	drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd
	drm/amdgpu: disable BAR resize on Dell G5 SE
	drm/amdgpu: init return value in amdgpu_ttm_clear_buffer
	drm/amd/display: Disable PSR-SU on eDP panels
	drm/amd/display: add a quirk to enable eDP0 on DP1
	drm/amd/display: Fix HPD after gpu reset
	arm64/mm: Fix Boot panic on Ampere Altra
	block: Remove zone write plugs when handling native zone append writes
	i2c: npcm: disable interrupt enable bit before devm_request_irq
	i2c: ls2x: Fix frequency division register access
	usbnet: gl620a: fix endpoint checking in genelink_bind()
	net: stmmac: dwmac-loongson: Add fix_soc_reset() callback
	net: phy: qcom: qca807x fix condition for DAC_DSP_BIAS_CURRENT
	net: enetc: fix the off-by-one issue in enetc_map_tx_buffs()
	net: enetc: keep track of correct Tx BD count in enetc_map_tx_tso_buffs()
	net: enetc: VFs do not support HWTSTAMP_TX_ONESTEP_SYNC
	net: enetc: update UDP checksum when updating originTimestamp field
	net: enetc: correct the xdp_tx statistics
	net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs()
	phy: tegra: xusb: reset VBUS & ID OVERRIDE
	phy: exynos5-usbdrd: fix MPLL_MULTIPLIER and SSC_REFCLKSEL masks in refclk
	phy: exynos5-usbdrd: gs101: ensure power is gated to SS phy in phy_exit()
	iommu/vt-d: Remove device comparison in context_setup_pass_through_cb
	iommu/vt-d: Fix suspicious RCU usage
	intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctly
	mptcp: always handle address removal under msk socket lock
	mptcp: reset when MPTCP opts are dropped after join
	selftests/landlock: Test that MPTCP actions are not restricted
	vmlinux.lds: Ensure that const vars with relocations are mapped R/O
	rcuref: Plug slowpath race in rcuref_put()
	sched/core: Prevent rescheduling when interrupts are disabled
	sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
	selftests/landlock: Test TCP accesses with protocol=IPPROTO_TCP
	dm-integrity: Avoid divide by zero in table status in Inline mode
	dm vdo: add missing spin_lock_init
	ima: Reset IMA_NONACTION_RULE_FLAGS after post_setattr
	scsi: ufs: core: bsg: Fix crash when arpmb command fails
	rseq/selftests: Fix riscv rseq_offset_deref_addv inline asm
	riscv/futex: sign extend compare value in atomic cmpxchg
	riscv: signal: fix signal frame size
	riscv: cacheinfo: Use of_property_present() for non-boolean properties
	riscv: signal: fix signal_minsigstksz
	riscv: cpufeature: use bitmap_equal() instead of memcmp()
	efi: Don't map the entire mokvar table to determine its size
	amdgpu/pm/legacy: fix suspend/resume issues
	x86/microcode/AMD: Return bool from find_blobs_in_containers()
	x86/microcode/AMD: Have __apply_microcode_amd() return bool
	x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
	x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations
	x86/microcode/AMD: Merge early_apply_microcode() into its single callsite
	x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration
	x86/microcode/AMD: Add get_patch_level()
	x86/microcode/AMD: Load only SHA256-checksummed patches
	thermal: gov_power_allocator: Add missing NULL pointer check
	Linux 6.12.18

Change-Id: Id06a9c751e3315bfd1a6e642b2c0f276edb46319
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2025-03-10 13:05:41 +00:00
Damien Le Moal
2f572c42bb block: Remove zone write plugs when handling native zone append writes
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.

For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.

However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
   written using zone append or regular write operations, because the
   write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
   will always be failed by blk_zone_wplug_prepare_bio() as the write
   pointer alignment check will fail, even if the user correctly
   accounted for the zone append operations and issued the regular
   writes with a correct sector.

Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.

Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().

To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.

Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07 18:25:40 +01:00
Sandeep Dhavale
6e8ff6954a ANDROID: block: Support configuring limits below the page size
Allow block drivers to configure the following:
* Maximum number of hardware sectors values smaller than
  PAGE_SIZE >> SECTOR_SHIFT. For PAGE_SIZE = 4096 this means that values
  below 8 become supported.
* A maximum segment size below the page size. This is most useful
  for page sizes above 4096 bytes.

The blk_sub_page_segments static branch will be used in later patches to
prevent that performance of block drivers that support segments >=
PAGE_SIZE and max_hw_sectors >= PAGE_SIZE >> SECTOR_SHIFT would be affected.

This patch may change the behavior of existing block drivers from not
working into working. An attempt to
configure a limit below what is supported by the block layer causes the
block layer to select a larger value. If that value is not supported by
the block driver, this may cause other data to be transferred than
requested, a kernel crash or other undesirable behavior.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>

Bug: 346870006

Link: https://lore.kernel.org/all/20230612203314.17820-4-bvanassche@acm.org/

[dhavale: current patch is based on the FROMLIST patch send to kernel mailing list.
As the queue config functions are removed, the logic has been adapted in
analogus function blk_validate_limits().

Block maintainers have rejected all our previous attempt to land patches
which support sub page segment size. But we have decided that these
patches are necessary to have 16KB page size kernel work with hardware
which supports maximum 4KB segment size.
]

Change-Id: I3faa20be1e83d1501d0f25f549b40301443d0df4
2025-01-24 13:45:40 -08:00
Weichao Guo
a2d282588e ANDROID: GKI: add ANDROID_OEM_DATA() in struct request_queue
Add ANDROID_OEM_DATA(1) in struct request_queue to support more
request queue's status for extend copy feature.

Bug: 283021230

Change-Id: Ic946fd08dcebed708f03749557d9289ddb3696b8
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Weichao Guo <guoweichao@oppo.corp-partner.google.com>
(cherry picked from commit d7b3d8d1e527dc41fe8faeb68cef879290db379c)
(cherry picked from commit b169eba61f7301995db4e5753b4bd9806c0afab5)
2025-01-14 10:39:13 -08:00
Damien Le Moal
7fa80134cf block: Prevent potential deadlocks in zone write plug error recovery
commit fe0418eb9bd69a19a948b297c8de815e05f3cde1 upstream.

Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.

However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.

Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.

The changes to remove the automatic error recovery are as follows:
 - Completely remove the error recovery work and its associated
   resources (zone write plug list head, disk error list, and disk
   zone_wplugs_work work struct). This also removes the functions
   disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().

 - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
   BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
   plug whenever a write opration targetting the zone of the zone write
   plug fails. This flag indicates that the zone write pointer offset is
   not reliable and that it must be updated when the next report zone,
   reset zone, finish zone or disk revalidation is executed.

 - Modify blk_zone_write_plug_bio_endio() to set the
   BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
   write BIO.

 - Modify the function disk_zone_wplug_set_wp_offset() to clear this
   new flag, thus implementing recovery of a correct write pointer
   offset with the reset (all) zone and finish zone operations.

 - Modify blkdev_report_zones() to always use the disk_report_zones_cb()
   callback so that disk_zone_wplug_sync_wp_offset() can be called for
   any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
   This implements recovery of a correct write pointer offset for zone
   write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
   the range of the report zones operation executed by the user.

 - Modify blk_revalidate_seq_zone() to call
   disk_zone_wplug_sync_wp_offset() for all sequential write required
   zones when a zoned block device is revalidated, thus always resolving
   any inconsistency between the write pointer offset of zone write
   plugs and the actual write pointer position of sequential zones.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19 18:13:00 +01:00
Damien Le Moal
a4b656ea1b dm: Fix dm-zoned-reclaim zone write pointer alignment
commit b76b840fd93374240b59825f1ab8e2f5c9907acb upstream.

The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.

The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.

However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.

Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().

dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.

Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-19 18:13:00 +01:00
Damien Le Moal
493326c4f1 block: RCU protect disk->conv_zones_bitmap
[ Upstream commit d7cb6d7414ea1b33536fa6d11805cb8dceec1f97 ]

Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.

disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.

disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14 20:03:35 +01:00
Ming Lei
b12cfcae8a block: always verify unfreeze lock on the owner task
commit 6a78699838a0ddeed3620ddf50c1521f1fe1e811 upstream.

commit f1be1788a32e ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.

Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.

Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05 14:03:10 +01:00
Christoph Hellwig
5e15cc7a1d block: return unsigned int from bdev_io_min
[ Upstream commit 46fd48ab3ea3eb3bb215684bd66ea3d260b091a9 ]

The underlying limit is defined as an unsigned int, so return that from
bdev_io_min as well.

Fixes: ac481c20ef ("block: Topology ioctls")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241119072602.1059488-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05 14:03:06 +01:00
Ming Lei
a6fc2ba1c7 block: model freeze & enter queue as lock for supporting lockdep
[ Upstream commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d ]

Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().

Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:

1) model q->q_usage_counter as two locks(io and queue lock)

- queue lock covers sync with blk_enter_queue()

- io lock covers sync with bio_enter_queue()

2) make the lockdep class/key as per-queue:

- different subsystem has very different lock use pattern, shared lock
 class causes false positive easily

- freeze_queue degrades to no lock in case that disk state becomes DEAD
  because bio_enter_queue() won't be blocked any more

- freeze_queue degrades to no lock in case that request queue becomes dying
  because blk_enter_queue() won't be blocked any more

3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered

- it is trylock because blk_mq_freeze_queue() are allowed to run
  concurrently

4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed

- dependency with blk_mq_freeze_queue() is covered

- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible

With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.

For example, lockdep report can be triggered in the report[3] with this
patch applied.

[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166

[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/

[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 3802f73bd807 ("block: fix uaf for flush rq while iterating tags")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05 14:03:05 +01:00
Dr. David Alan Gilbert
9ba5dcc722 block: Remove unused blk_limits_io_{min,opt}
blk_limits_io_min and blk_limits_io_opt are unused since the
recent commit
  0a94a469a4 ("dm: stop using blk_limits_io_{min,opt}")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20240920004817.676216-1-linux@treblig.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-20 00:19:48 -06:00
Jens Axboe
42b16d3ac3 Merge tag 'v6.11' into for-6.12/block
Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.

* tag 'v6.11': (1788 commits)
  Linux 6.11
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
  pinctrl: pinctrl-cy8c95x0: Fix regcache
  cifs: Fix signature miscalculation
  mm: avoid leaving partial pfn mappings around in error case
  drm/xe/client: add missing bo locking in show_meminfo()
  drm/xe/client: fix deadlock in show_meminfo()
  drm/xe/oa: Enable Xe2+ PES disaggregation
  drm/xe/display: fix compat IS_DISPLAY_STEP() range end
  drm/xe: Fix access_ok check in user_fence_create
  drm/xe: Fix possible UAF in guc_exec_queue_process_msg
  drm/xe: Remove fence check from send_tlb_invalidation
  drm/xe/gt: Remove double include
  net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
  PCI: Fix potential deadlock in pcim_intx()
  workqueue: Clear worker->pool in the worker thread context
  net: tighten bad gso csum offset check in virtio_net_hdr
  netlink: specs: mptcp: fix port endianness
  net: dpaa: Pad packets to ETH_ZLEN
  mptcp: pm: Fix uaf in __timer_delete_sync
  ...
2024-09-17 08:32:53 -06:00
Christoph Hellwig
379b122a3e block: constify the lim argument to queue_limits_max_zone_append_sectors
queue_limits_max_zone_append_sectors doesn't change the lim argument,
so mark it as const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29 04:32:32 -06:00
John Garry
81475beb1b block: Drop NULL check in bdev_write_zeroes_sectors()
Function bdev_get_queue() must not return NULL, so drop the check in
bdev_write_zeroes_sectors().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Link: https://lore.kernel.org/r/20240815163228.216051-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-19 09:48:59 -06:00
Linus Torvalds
7d080fa867 Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - MD fixes via Song:
     - md-cluster fixes (Heming Zhao)
     - raid1 fix (Mateusz Jończyk)

 - s390/dasd module description (Jeff)

 - Series cleaning up and hardening the blk-mq debugfs flag handling
   (John, Christoph)

 - blk-cgroup cleanup (Xiu)

 - Error polled IO attempts if backend doesn't support it (hexue)

 - Fix for an sbitmap hang (Yang)

* tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits)
  blk-cgroup: move congestion_count to struct blkcg
  sbitmap: fix io hung due to race on sbitmap_word::cleared
  block: avoid polling configuration errors
  block: Catch possible entries missing from rqf_name[]
  block: Simplify definition of RQF_NAME()
  block: Use enum to define RQF_x bit indexes
  block: Catch possible entries missing from cmd_flag_name[]
  block: Catch possible entries missing from alloc_policy_name[]
  block: Catch possible entries missing from hctx_flag_name[]
  block: Catch possible entries missing from hctx_state_name[]
  block: Catch possible entries missing from blk_queue_flag_name[]
  block: Make QUEUE_FLAG_x as an enum
  block: Relocate BLK_MQ_MAX_DEPTH
  block: Relocate BLK_MQ_CPU_WORK_BATCH
  block: remove QUEUE_FLAG_STOPPED
  block: Add missing entry to hctx_flag_name[]
  block: Add zone write plugging entry to rqf_name[]
  block: Add missing entries from cmd_flag_name[]
  s390/dasd: fix error checks in dasd_copy_pair_store()
  s390/dasd: add missing MODULE_DESCRIPTION() macros
  ...
2024-07-22 11:32:05 -07:00
John Garry
55177adf18 block: Make QUEUE_FLAG_x as an enum
This will allow us better keep in sync with blk_queue_flag_name[].

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-8-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:49 -06:00
Christoph Hellwig
c8f51feee1 block: remove QUEUE_FLAG_STOPPED
QUEUE_FLAG_STOPPED is entirely unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240719112912.3830443-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19 09:32:48 -06:00
Linus Torvalds
3e78198862 Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
     - Device initialization memory leak fixes (Keith)
     - More constants defined (Weiwen)
     - Target debugfs support (Hannes)
     - PCIe subsystem reset enhancements (Keith)
     - Queue-depth multipath policy (Redhat and PureStorage)
     - Implement get_unique_id (Christoph)
     - Authentication error fixes (Gaosheng)

 - MD updates via Song
     - sync_action fix and refactoring (Yu Kuai)
     - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
       Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)

 - Fix loop detach/open race (Gulam)

 - Fix lower control limit for blk-throttle (Yu)

 - Add module descriptions to various drivers (Jeff)

 - Add support for atomic writes for block devices, and statx reporting
   for same. Includes SCSI and NVMe (John, Prasad, Alan)

 - Add IO priority information to block trace points (Dongliang)

 - Various zone improvements and tweaks (Damien)

 - mq-deadline tag reservation improvements (Bart)

 - Ignore direct reclaim swap writes in writeback throttling (Baokun)

 - Block integrity improvements and fixes (Anuj)

 - Add basic support for rust based block drivers. Has a dummy null_blk
   variant for now (Andreas)

 - Series converting driver settings to queue limits, and cleanups and
   fixes related to that (Christoph)

 - Cleanup for poking too deeply into the bvec internals, in preparation
   for DMA mapping API changes (Christoph)

 - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
   Ming, Zhu, Damien, Christophe, Chaitanya)

* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
  floppy: add missing MODULE_DESCRIPTION() macro
  loop: add missing MODULE_DESCRIPTION() macro
  ublk_drv: add missing MODULE_DESCRIPTION() macro
  xen/blkback: add missing MODULE_DESCRIPTION() macro
  block/rnbd: Constify struct kobj_type
  block: take offset into account in blk_bvec_map_sg again
  block: fix get_max_segment_size() warning
  loop: Don't bother validating blocksize
  virtio_blk: Don't bother validating blocksize
  null_blk: Don't bother validating blocksize
  block: Validate logical block size in blk_validate_limits()
  virtio_blk: Fix default logical block size fallback
  nvmet-auth: fix nvmet_auth hash error handling
  nvme: implement ->get_unique_id
  block: pass a phys_addr_t to get_max_segment_size
  block: add a bvec_phys helper
  blk-lib: check for kill signal in ioctl BLKZEROOUT
  block: limit the Write Zeroes to manually writing zeroes fallback
  block: refacto blkdev_issue_zeroout
  block: move read-only and supported checks into (__)blkdev_issue_zeroout
  ...
2024-07-15 14:20:22 -07:00
John Garry
fe3d508ba9 block: Validate logical block size in blk_validate_limits()
Some drivers validate that their own logical block size. It is no harm to
always do this, so validate in blk_validate_limits().

This allows us to remove the validation in most of those drivers.

Add a comment to blk_validate_block_size() to inform users that self-
validation of LBS is usually unnecessary.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-09 00:00:17 -06:00
Christoph Hellwig
bf86bcdb40 blk-lib: check for kill signal in ioctl BLKZEROOUT
Zeroout can access a significant capacity and take longer than the user
expected.  A user may change their mind about wanting to run that
command and attempt to kill the process and do something else with their
device. But since the task is uninterruptable, they have to wait for it
to finish, which could be many hours.

Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks
for a fatal signal at each iteration so the user doesn't have to wait for
their regretted operation to complete naturally.

Heavily based on an earlier patch from Keith Busch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:53:15 -06:00
Damien Le Moal
f2a7bea237 block: Remove REQ_OP_ZONE_RESET_ALL emulation
Now that device mapper can handle resetting all zones of a mapped zoned
device using REQ_OP_ZONE_RESET_ALL, all zoned block device drivers
support this operation. With this, the request queue feature
BLK_FEAT_ZONE_RESETALL is not necessary and the emulation code in
blk-zone.c can be removed.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240704052816.623865-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-05 00:42:04 -06:00
Christoph Hellwig
5476394aa9 block: simplify queue_logical_block_size
queue_logical_block_size is never called with a 0 queue, and the
logical_block_size field in queue_limits is always initialized for
a live queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240627111407.476276-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28 15:06:16 -06:00
John Garry
63db4a1f79 block: Delete blk_queue_flag_test_and_set()
Since commit 70200574cc ("block: remove QUEUE_FLAG_DISCARD"),
blk_queue_flag_test_and_set() has not been used, so delete it.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20240627160735.842189-1-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-27 12:43:27 -06:00
Christoph Hellwig
e94b45d08b block: move dma_pad_mask into queue_limits
dma_pad_mask is a queue_limits by all ways of looking at it, so move it
there and set it through the atomic queue limits APIs.

Add a little helper that takes the alignment and pad into account to
simplify the code that is touched a bit.

Note that there never was any need for the > check in
blk_queue_update_dma_pad, this probably was just copy and paste from
dma_update_dma_alignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
abfc9d8109 block: remove the fallback case in queue_dma_alignment
Now that all updates go through blk_validate_limits the default of 511
is set at initialization time.  Also remove the unused NULL check as
calling this helper on a NULL queue can't happen (and doesn't make
much sense to start with).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
73781b3b81 block: remove disk_update_readahead
Mark blk_apply_bdi_limits non-static and open code disk_update_readahead
in the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
fcf865e357 block: convert features and flags to __bitwise types
... and let sparse help us catch mismatches or abuses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
ec9b1cf0b0 block: rename BLK_FEAT_MISALIGNED
This is a flag for ->flags and not a feature for ->features.  And fix the
one place that actually incorrectly cleared it from ->features.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26 09:37:35 -06:00
Christoph Hellwig
44348870de block: fix the blk_queue_nonrot polarity
Take care of the inverse polarity of the BLK_FEAT_ROTATIONAL flag
vs the old nonrot helper.

Fixes: bd4a633b6f ("block: move the nonrot flag to queue_limits")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240624173835.76753-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24 13:06:12 -06:00
Damien Le Moal
caaf7101c0 block: Cleanup block device zone helpers
There is no need to conditionally define on CONFIG_BLK_DEV_ZONED the
inline helper functions bdev_nr_zones(), bdev_max_open_zones(),
bdev_max_active_zones() and disk_zone_no() as these function will return
the correct valu in all cases (zoned device or not, including when
CONFIG_BLK_DEV_ZONED is not set). Furthermore, disk_nr_zones()
definition can be simplified as disk->nr_zones is always 0 for regular
block devices.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-21 08:26:36 -06:00
Damien Le Moal
b6cfe2287d block: Define bdev_nr_zones() as an inline function
There is no need for bdev_nr_zones() to be an exported function
calculating the number of zones of a block device. Instead, given that
all callers use this helper with a fully initialized block device that
has a gendisk, we can redefine this function as an inline helper in
blkdev.h.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-21 08:26:35 -06:00
Prasad Singamsetty
9abcfbd235 block: Add atomic write support for statx
Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-7-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
John Garry
9da3d1e912 block: Add core atomic write support
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag

New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
  of an atomic write which the device may support. It is not
  necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
  max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
  and atomic_write_max_sectors would be the limit on a merged atomic write
  request size. This value is not capped at max_sectors, as the value in
  max_sectors can be controlled from userspace, and it would only cause
  trouble if userspace could limit atomic_write_unit_max_bytes and the
  other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
  min/max length of an atomic write unit which the device may support. They
  both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
  the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
  atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
  Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
  indicates an LBA space boundary at which an atomic write straddles no
  longer is atomically executed by the disk. The value must be a
  power-of-2. Note that it would be acceptable to enforce a rule that
  atomic_write_hw_boundary_sectors is a multiple of
  atomic_write_hw_unit_max, but the resultant code would be more
  complicated.

All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.

An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.

New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
				bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
				bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
				bytes
- atomic_write_max_bytes      - same as atomic_write_max_sectors in bytes

Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary

Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.

FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.

Flag REQ_ATOMIC is used for indicating an atomic write.

Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
John Garry
f70167a7a6 block: Generalize chunk_sectors support as boundary support
The purpose of the chunk_sectors limit is to ensure that a mergeble request
fits within the boundary of the chunck_sector value.

Such a feature will be useful for other request_queue boundary limits, so
generalize the chunk_sectors merge code.

This idea was proposed by Hannes Reinecke.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 15:19:17 -06:00
Jens Axboe
e821bcecdf Merge branch 'for-6.11/block-limits' into for-6.11/block
Merge in queue limits cleanups.

* for-6.11/block-limits:
  block: move the raid_partial_stripes_expensive flag into the features field
  block: remove the discard_alignment flag
  block: move the misaligned flag into the features field
  block: renumber and rename the cache disabled flag
  block: fix spelling and grammar for in writeback_cache_control.rst
  block: remove the unused blk_bounce enum
2024-06-20 06:55:20 -06:00
Christoph Hellwig
7d4dec525f block: move the raid_partial_stripes_expensive flag into the features field
Move the raid_partial_stripes_expensive flags into the features field to
reclaim a little bit of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:15 -06:00
Christoph Hellwig
4cac3d3a71 block: remove the discard_alignment flag
queue_limits.discard_alignment is never read except in the places
where it is stacked into another limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240619154623.450048-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20 06:53:14 -06:00