In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.
This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.
Fixes: f211268ed1 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 417517944
Change-Id: I9628b14d4fe0e1f964d4036178fbc6ee49b3be78
(cherry picked from commit bf7a8b5cbbb2d531f3336be2186af0c5590d157c git://git.kernel.dk/linux-block for-next)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Checking if a given sector is aligned to a zone is a common operation
that is performed for zoned devices. Add bdev_is_zone_start helper to
check for this instead of opencoding it everywhere.
Convert the calculations on zone size to be generic instead of relying on
power-of-2(po2) based arithmetic in the block layer using the helpers
wherever possible.
The only hot path affected by this change for zoned devices with po2
zone size is in blk_check_zone_append() but bdev_is_zone_start() helper
is used to optimize the calculation for po2 zone sizes.
Finally, allow zoned devices with non po2 zone sizes provided that their
zone capacity and zone size are equal. The main motivation to allow
zoned devices with non po2 zone size is to remove the unmapped LBA
between zone capacity and zone size for devices that cannot have a po2
zone capacity.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Bug: 269471019
Bug: 415836627
Link: https://lore.kernel.org/linux-block/20220923173618.6899-4-p.raghav@samsung.com/
Change-Id: I2ecc186d7b14f5508b6abfe9821526d39a21d7e4
[ bvanassche: ported this patch to kernel 6.12 ]
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Now we only verify the outmost freeze & unfreeze in current context in case
that !q->mq_freeze_depth, so it is reliable to save queue lying state when
we want to lock the freeze queue since the state is one per-task variable
now.
Change-Id: Ic11e09d92c00c4b5080fbe4cd7cfa50e808096f7
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241127135133.3952153-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 415836627
(cherry picked from commit f6661b1d0525f3764596a1b65eeed9e75aecafa7)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Change-Id: Ia470736d0468c265f5b61cb9d8a0e5544b6b7b0d
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 415836627
(cherry picked from commit a3396b99990d8b4e5797e7b16fdeb64c15ae97bb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Add the initial set of ABI padding fields in android16-6.12 based on what
is in the android15-6.6 branch.
Bug: 151154716
Change-Id: Icdb394863b2911389bfdced0fd1ea20236ca4ce1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
In order to alleviate the issue of priority inversion caused by the lock_page,
it is proposed to add oem_data to the struct gendisk to store a pointer to
its struct block_device . This will allow us to check its priority through a
customized scheduler hook when lock_folio.
Bug: 338959088
Bug: 407947260
Change-Id: I118ef11cb89a3fad9a15a2c3b8383d42be0fded4
Signed-off-by: Wang Jianzheng <11134417@vivo.corp-partner.google.com>
(cherry picked from commit feb92ccf10bce90739b5f51cc33d1bd6f16d7fab)
Signed-off-by: ying zuxin <11154159@vivo.com>
Using PAGE_SIZE as a minimum expected DMA segment size in consideration
of devices which have a max DMA segment size of < 64k when used on 64k
PAGE_SIZE systems leads to devices not being able to probe such as
eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure
as follows:
WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0
Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment
size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems.
If anyone need to backport this patch, the following commits are depended:
commit 6aeb4f836480 ("block: remove bio_add_pc_page")
commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
commit b7175e24d6ac ("block: add a dma mapping iterator")
Bug: 399192075
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0]
Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1]
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Keith Busch <kbusch@kernel.org>
Tested-by: Paul Bunyan <pbunyan@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 889c57066ceee5e9172232da0608a8ac053bb6e5)
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
[dhavale: resolved minor conflict in block/blk.h]
Change-Id: I5fe54dd8c73621259cbd9720b77253d8a2af29c7
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.
For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
written using zone append or regular write operations, because the
write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
will always be failed by blk_zone_wplug_prepare_bio() as the write
pointer alignment check will fail, even if the user correctly
accounted for the zone append operations and issued the regular
writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.
Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Change-Id: If7a37be9828e0d59ff68c7b7db4f30a9a10ede89
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2f572c42bb)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 2f572c42bb which is
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.
It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.
Bug: 161946584
Change-Id: I48f47a48084edfbca1f6e07fdde108f9c164aacf
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
GKI (arm64) relevant 37 out of 149 changes, affecting 60 files +390/-338
659bfea591 scsi: ufs: core: Fix ufshcd_is_ufs_dev_busy() and ufshcd_eh_timed_out() [1 file, +4/-4]
3594aad97e ovl: fix UAF in ovl_dentry_update_reval by moving dput() in ovl_link_up [1 file, +1/-1]
a3ae6a60ba SUNRPC: Prevent looping due to rpc_signal_task() races [3 files, +2/-6]
b5038504da scsi: core: Clear driver private data when retrying request [1 file, +7/-7]
465a814323 scsi: ufs: core: Set default runtime/system PM levels before ufshcd_hba_init() [1 file, +15/-15]
ee5d6cb5cc ALSA: usb-audio: Avoid dropping MIDI events at closing multiple ports [1 file, +1/-1]
5c9921f1da Bluetooth: L2CAP: Fix L2CAP_ECRED_CONN_RSP response [1 file, +7/-2]
f22df335b2 net: loopback: Avoid sending IP packets without an Ethernet header [1 file, +14/-0]
915d64a78f net: set the minimum for net_hotdata.netdev_budget_usecs [1 file, +2/-1]
db8b2a613d ipv4: Convert icmp_route_lookup() to dscp_t. [1 file, +9/-10]
97c455c3c2 ipv4: Convert ip_route_input() to dscp_t. [6 files, +18/-9]
8ffd0390fc ipvs: Always clear ipvs_property flag in skb_scrub_packet() [1 file, +1/-1]
c417b1e4d8 tcp: devmem: don't write truncated dmabuf CMSGs to userspace [3 files, +22/-16]
33d782e38d tcp: Defer ts_recent changes until req is owned [1 file, +4/-6]
902d576296 net: Clear old fragment checksum value in napi_reuse_skb [1 file, +1/-0]
806437d047 thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power() [1 file, +1/-1]
7d582eb6e4 perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list [1 file, +9/-2]
13cca2b73e uprobes: Reject the shared zeropage in uprobe_write_opcode() [1 file, +5/-0]
07a82c78d8 thermal: of: Simplify thermal_of_should_bind with scoped for each OF child [1 file, +2/-3]
e11df3bffd thermal/of: Fix cdev lookup in thermal_of_should_bind() [1 file, +29/-21]
19cd2dc4d4 thermal: core: Move lists of thermal instances to trip descriptors [7 files, +62/-64]
27a144c3be thermal: gov_power_allocator: Update total_weight on bind and cdev updates [1 file, +22/-8]
546c19eb69 io_uring/net: save msg_control for compat [1 file, +3/-1]
8cc451444c unreachable: Unify [2 files, +7/-15]
2cfd0e5084 objtool: Remove annotate_{,un}reachable() [2 files, +2/-68]
a00e900c9b objtool: Fix C jump table annotations for Clang [3 files, +6/-5]
435d2964af tracing: Fix bad hist from corrupting named_triggers list [1 file, +15/-15]
8e31d9fb2f ALSA: usb-audio: Re-add sample rate quirk for Pioneer DJM-900NXS2 [1 file, +1/-0]
b9de147b2c KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2 [3 files, +14/-21]
a2475ccad6 perf/core: Add RCU read lock protection to perf_iterate_ctx() [1 file, +2/-1]
322cb23e24 perf/core: Fix low freq setting via IOC_PERIOD [1 file, +9/-8]
8f6369c3cd arm64/mm: Fix Boot panic on Ampere Altra [1 file, +1/-6]
2f572c42bb block: Remove zone write plugs when handling native zone append writes [2 files, +73/-10]
29b6d5ad3e rcuref: Plug slowpath race in rcuref_put() [2 files, +8/-6]
0362847c52 sched/core: Prevent rescheduling when interrupts are disabled [1 file, +1/-1]
59455f968c scsi: ufs: core: bsg: Fix crash when arpmb command fails [1 file, +4/-2]
72cbaf8b41 thermal: gov_power_allocator: Add missing NULL pointer check [1 file, +6/-1]
Changes in 6.12.18
RDMA/mlx5: Fix the recovery flow of the UMR QP
IB/mlx5: Set and get correct qp_num for a DCT QP
RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with error
RDMA/mlx5: Fix a WARN during dereg_mr for DM type
RDMA/mana_ib: Allocate PAGE aligned doorbell index
RDMA/hns: Fix mbox timing out by adding retry mechanism
RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
RDMA/bnxt_re: Refactor NQ allocation
RDMA/bnxt_re: Cache MSIx info to a local structure
RDMA/bnxt_re: Add sanity checks on rdev validity
RDMA/bnxt_re: Allocate dev_attr information dynamically
RDMA/bnxt_re: Fix the statistics for Gen P7 VF
landlock: Fix non-TCP sockets restriction
scsi: ufs: core: Fix ufshcd_is_ufs_dev_busy() and ufshcd_eh_timed_out()
ovl: fix UAF in ovl_dentry_update_reval by moving dput() in ovl_link_up
NFS: O_DIRECT writes must check and adjust the file length
NFS: Adjust delegated timestamps for O_DIRECT reads and writes
SUNRPC: Prevent looping due to rpc_signal_task() races
NFSv4: Fix a deadlock when recovering state on a sillyrenamed file
SUNRPC: Handle -ETIMEDOUT return from tlshd
RDMA/mlx5: Fix implicit ODP hang on parent deregistration
RDMA/mlx5: Fix AH static rate parsing
scsi: core: Clear driver private data when retrying request
scsi: ufs: core: Set default runtime/system PM levels before ufshcd_hba_init()
RDMA/mlx5: Fix bind QP error cleanup flow
RDMA/bnxt_re: Fix the page details for the srq created by kernel consumers
sunrpc: suppress warnings for unused procfs functions
ALSA: usb-audio: Avoid dropping MIDI events at closing multiple ports
Bluetooth: L2CAP: Fix L2CAP_ECRED_CONN_RSP response
rxrpc: rxperf: Fix missing decoding of terminal magic cookie
afs: Fix the server_list to unuse a displaced server rather than putting it
afs: Give an afs_server object a ref on the afs_cell object it points to
net: loopback: Avoid sending IP packets without an Ethernet header
net: set the minimum for net_hotdata.netdev_budget_usecs
ipv4: Convert icmp_route_lookup() to dscp_t.
ipv4: Convert ip_route_input() to dscp_t.
ipvlan: Prepare ipvlan_process_v4_outbound() to future .flowi4_tos conversion.
ipvlan: ensure network headers are in skb linear part
net: cadence: macb: Synchronize stats calculations
net: dsa: rtl8366rb: Fix compilation problem
ASoC: es8328: fix route from DAC to output
ASoC: fsl: Rename stream name of SAI DAI driver
ipvs: Always clear ipvs_property flag in skb_scrub_packet()
drm/xe/oa: Signal output fences
drm/xe/oa: Move functions up so they can be reused for config ioctl
drm/xe/oa: Add syncs support to OA config ioctl
drm/xe/oa: Allow only certain property changes from config
drm/xe/oa: Allow oa_exponent value of 0
firmware: cs_dsp: Remove async regmap writes
ASoC: cs35l56: Prevent races when soft-resetting using SPI control
ALSA: hda/realtek: Fix wrong mic setup for ASUS VivoBook 15
net: ethernet: ti: am65-cpsw: select PAGE_POOL
tcp: devmem: don't write truncated dmabuf CMSGs to userspace
ice: add E830 HW VF mailbox message limit support
ice: Fix deinitializing VF in error path
ice: Avoid setting default Rx VSI twice in switchdev setup
tcp: Defer ts_recent changes until req is owned
net: Clear old fragment checksum value in napi_reuse_skb
net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination.
net/mlx5: IRQ, Fix null string in debug print
net: ipv6: fix dst ref loop on input in seg6 lwt
net: ipv6: fix dst ref loop on input in rpl lwt
selftests: drv-net: Check if combined-count exists
idpf: fix checksums set in idpf_rx_rsc()
net: ti: icss-iep: Reject perout generation request
thermal: gov_power_allocator: Fix incorrect calculation in divvy_up_power()
perf/core: Order the PMU list to fix warning about unordered pmu_ctx_list
uprobes: Reject the shared zeropage in uprobe_write_opcode()
thermal: of: Simplify thermal_of_should_bind with scoped for each OF child
thermal/of: Fix cdev lookup in thermal_of_should_bind()
thermal: core: Move lists of thermal instances to trip descriptors
thermal: gov_power_allocator: Update total_weight on bind and cdev updates
io_uring/net: save msg_control for compat
unreachable: Unify
objtool: Remove annotate_{,un}reachable()
objtool: Fix C jump table annotations for Clang
x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems
phy: rockchip: fix Kconfig dependency more
phy: rockchip: naneng-combphy: compatible reset with old DT
riscv: KVM: Fix hart suspend status check
riscv: KVM: Fix hart suspend_type use
riscv: KVM: Fix SBI IPI error generation
riscv: KVM: Fix SBI TIME error generation
tracing: Fix bad hist from corrupting named_triggers list
ftrace: Avoid potential division by zero in function_stat_show()
ALSA: usb-audio: Re-add sample rate quirk for Pioneer DJM-900NXS2
ALSA: hda/realtek: Fix microphone regression on ASUS N705UD
KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
perf/core: Add RCU read lock protection to perf_iterate_ctx()
perf/x86: Fix low freqency setting issue
perf/core: Fix low freq setting via IOC_PERIOD
drm/xe/regs: remove a duplicate definition for RING_CTL_SIZE(size)
drm/xe/userptr: restore invalidation list on error
drm/xe/userptr: fix EFAULT handling
drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd
drm/amdgpu: disable BAR resize on Dell G5 SE
drm/amdgpu: init return value in amdgpu_ttm_clear_buffer
drm/amd/display: Disable PSR-SU on eDP panels
drm/amd/display: add a quirk to enable eDP0 on DP1
drm/amd/display: Fix HPD after gpu reset
arm64/mm: Fix Boot panic on Ampere Altra
block: Remove zone write plugs when handling native zone append writes
i2c: npcm: disable interrupt enable bit before devm_request_irq
i2c: ls2x: Fix frequency division register access
usbnet: gl620a: fix endpoint checking in genelink_bind()
net: stmmac: dwmac-loongson: Add fix_soc_reset() callback
net: phy: qcom: qca807x fix condition for DAC_DSP_BIAS_CURRENT
net: enetc: fix the off-by-one issue in enetc_map_tx_buffs()
net: enetc: keep track of correct Tx BD count in enetc_map_tx_tso_buffs()
net: enetc: VFs do not support HWTSTAMP_TX_ONESTEP_SYNC
net: enetc: update UDP checksum when updating originTimestamp field
net: enetc: correct the xdp_tx statistics
net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs()
phy: tegra: xusb: reset VBUS & ID OVERRIDE
phy: exynos5-usbdrd: fix MPLL_MULTIPLIER and SSC_REFCLKSEL masks in refclk
phy: exynos5-usbdrd: gs101: ensure power is gated to SS phy in phy_exit()
iommu/vt-d: Remove device comparison in context_setup_pass_through_cb
iommu/vt-d: Fix suspicious RCU usage
intel_idle: Handle older CPUs, which stop the TSC in deeper C states, correctly
mptcp: always handle address removal under msk socket lock
mptcp: reset when MPTCP opts are dropped after join
selftests/landlock: Test that MPTCP actions are not restricted
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
rcuref: Plug slowpath race in rcuref_put()
sched/core: Prevent rescheduling when interrupts are disabled
sched_ext: Fix pick_task_scx() picking non-queued tasks when it's called without balance()
selftests/landlock: Test TCP accesses with protocol=IPPROTO_TCP
dm-integrity: Avoid divide by zero in table status in Inline mode
dm vdo: add missing spin_lock_init
ima: Reset IMA_NONACTION_RULE_FLAGS after post_setattr
scsi: ufs: core: bsg: Fix crash when arpmb command fails
rseq/selftests: Fix riscv rseq_offset_deref_addv inline asm
riscv/futex: sign extend compare value in atomic cmpxchg
riscv: signal: fix signal frame size
riscv: cacheinfo: Use of_property_present() for non-boolean properties
riscv: signal: fix signal_minsigstksz
riscv: cpufeature: use bitmap_equal() instead of memcmp()
efi: Don't map the entire mokvar table to determine its size
amdgpu/pm/legacy: fix suspend/resume issues
x86/microcode/AMD: Return bool from find_blobs_in_containers()
x86/microcode/AMD: Have __apply_microcode_amd() return bool
x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations
x86/microcode/AMD: Merge early_apply_microcode() into its single callsite
x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration
x86/microcode/AMD: Add get_patch_level()
x86/microcode/AMD: Load only SHA256-checksummed patches
thermal: gov_power_allocator: Add missing NULL pointer check
Linux 6.12.18
Change-Id: Id06a9c751e3315bfd1a6e642b2c0f276edb46319
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream.
For devices that natively support zone append operations,
REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
and are immediately issued to the zoned device. This means that there is
no write pointer offset tracking done for these operations and that a
zone write plug is not necessary.
However, when receiving a zone append BIO, we may already have a zone
write plug for the target zone if that zone was previously partially
written using regular write operations. In such case, since the write
pointer offset of the zone write plug is not incremented by the amount
of sectors appended to the zone, 2 issues arise:
1) we risk leaving the plug in the disk hash table if the zone is fully
written using zone append or regular write operations, because the
write pointer offset will never reach the "zone full" state.
2) Regular write operations that are issued after zone append operations
will always be failed by blk_zone_wplug_prepare_bio() as the write
pointer alignment check will fail, even if the user correctly
accounted for the zone append operations and issued the regular
writes with a correct sector.
Avoid these issues by immediately removing the zone write plug of zones
that are the target of zone append operations when blk_zone_plug_bio()
is called. The new function blk_zone_wplug_handle_native_zone_append()
implements this for devices that natively support zone append. The
removal of the zone write plug using disk_remove_zone_wplug() requires
aborting all plugged regular write using disk_zone_wplug_abort() as
otherwise the plugged write BIOs would never be executed (with the plug
removed, the completion path will never see again the zone write plug as
disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
to blk_zone_wplug_handle_native_zone_append() and to
disk_zone_wplug_abort() to signal this.
Since blk_zone_wplug_handle_native_zone_append() is called in the hot
path for operations that will not be plugged, disk_get_zone_wplug() is
optimized under the assumption that a user issuing zone append
operations is not at the same time issuing regular writes and that there
are no hashed zone write plugs. The struct gendisk atomic counter
nr_zone_wplugs is added to check this, with this counter incremented in
disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
To be consistent with this fix, we do not need to fill the zone write
plug hash table with zone write plugs for zones that are partially
written for a device that supports native zone append operations.
So modify blk_revalidate_seq_zone() to return early to avoid allocating
and inserting a zone write plug for partially written sequential zones
if the device natively supports zone append.
Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allow block drivers to configure the following:
* Maximum number of hardware sectors values smaller than
PAGE_SIZE >> SECTOR_SHIFT. For PAGE_SIZE = 4096 this means that values
below 8 become supported.
* A maximum segment size below the page size. This is most useful
for page sizes above 4096 bytes.
The blk_sub_page_segments static branch will be used in later patches to
prevent that performance of block drivers that support segments >=
PAGE_SIZE and max_hw_sectors >= PAGE_SIZE >> SECTOR_SHIFT would be affected.
This patch may change the behavior of existing block drivers from not
working into working. An attempt to
configure a limit below what is supported by the block layer causes the
block layer to select a larger value. If that value is not supported by
the block driver, this may cause other data to be transferred than
requested, a kernel crash or other undesirable behavior.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Bug: 346870006
Link: https://lore.kernel.org/all/20230612203314.17820-4-bvanassche@acm.org/
[dhavale: current patch is based on the FROMLIST patch send to kernel mailing list.
As the queue config functions are removed, the logic has been adapted in
analogus function blk_validate_limits().
Block maintainers have rejected all our previous attempt to land patches
which support sub page segment size. But we have decided that these
patches are necessary to have 16KB page size kernel work with hardware
which supports maximum 4KB segment size.
]
Change-Id: I3faa20be1e83d1501d0f25f549b40301443d0df4
Add ANDROID_OEM_DATA(1) in struct request_queue to support more
request queue's status for extend copy feature.
Bug: 283021230
Change-Id: Ic946fd08dcebed708f03749557d9289ddb3696b8
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Weichao Guo <guoweichao@oppo.corp-partner.google.com>
(cherry picked from commit d7b3d8d1e527dc41fe8faeb68cef879290db379c)
(cherry picked from commit b169eba61f7301995db4e5753b4bd9806c0afab5)
commit fe0418eb9bd69a19a948b297c8de815e05f3cde1 upstream.
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b76b840fd93374240b59825f1ab8e2f5c9907acb upstream.
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d7cb6d7414ea1b33536fa6d11805cb8dceec1f97 ]
Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.
disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.
disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6a78699838a0ddeed3620ddf50c1521f1fe1e811 upstream.
commit f1be1788a32e ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.
Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.
Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f1be1788a32e8fa63416ad4518bbd1a85a825c9d ]
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:
1) model q->q_usage_counter as two locks(io and queue lock)
- queue lock covers sync with blk_enter_queue()
- io lock covers sync with bio_enter_queue()
2) make the lockdep class/key as per-queue:
- different subsystem has very different lock use pattern, shared lock
class causes false positive easily
- freeze_queue degrades to no lock in case that disk state becomes DEAD
because bio_enter_queue() won't be blocked any more
- freeze_queue degrades to no lock in case that request queue becomes dying
because blk_enter_queue() won't be blocked any more
3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered
- it is trylock because blk_mq_freeze_queue() are allowed to run
concurrently
4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed
- dependency with blk_mq_freeze_queue() is covered
- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible
With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.
For example, lockdep report can be triggered in the report[3] with this
patch applied.
[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166
[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 3802f73bd807 ("block: fix uaf for flush rq while iterating tags")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Merge in 6.11 final to get the fix for preventing deadlocks on an
elevator switch, as there's a fixup for that patch.
* tag 'v6.11': (1788 commits)
Linux 6.11
Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
pinctrl: pinctrl-cy8c95x0: Fix regcache
cifs: Fix signature miscalculation
mm: avoid leaving partial pfn mappings around in error case
drm/xe/client: add missing bo locking in show_meminfo()
drm/xe/client: fix deadlock in show_meminfo()
drm/xe/oa: Enable Xe2+ PES disaggregation
drm/xe/display: fix compat IS_DISPLAY_STEP() range end
drm/xe: Fix access_ok check in user_fence_create
drm/xe: Fix possible UAF in guc_exec_queue_process_msg
drm/xe: Remove fence check from send_tlb_invalidation
drm/xe/gt: Remove double include
net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init()
PCI: Fix potential deadlock in pcim_intx()
workqueue: Clear worker->pool in the worker thread context
net: tighten bad gso csum offset check in virtio_net_hdr
netlink: specs: mptcp: fix port endianness
net: dpaa: Pad packets to ETH_ZLEN
mptcp: pm: Fix uaf in __timer_delete_sync
...
Pull more block updates from Jens Axboe:
- MD fixes via Song:
- md-cluster fixes (Heming Zhao)
- raid1 fix (Mateusz Jończyk)
- s390/dasd module description (Jeff)
- Series cleaning up and hardening the blk-mq debugfs flag handling
(John, Christoph)
- blk-cgroup cleanup (Xiu)
- Error polled IO attempts if backend doesn't support it (hexue)
- Fix for an sbitmap hang (Yang)
* tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits)
blk-cgroup: move congestion_count to struct blkcg
sbitmap: fix io hung due to race on sbitmap_word::cleared
block: avoid polling configuration errors
block: Catch possible entries missing from rqf_name[]
block: Simplify definition of RQF_NAME()
block: Use enum to define RQF_x bit indexes
block: Catch possible entries missing from cmd_flag_name[]
block: Catch possible entries missing from alloc_policy_name[]
block: Catch possible entries missing from hctx_flag_name[]
block: Catch possible entries missing from hctx_state_name[]
block: Catch possible entries missing from blk_queue_flag_name[]
block: Make QUEUE_FLAG_x as an enum
block: Relocate BLK_MQ_MAX_DEPTH
block: Relocate BLK_MQ_CPU_WORK_BATCH
block: remove QUEUE_FLAG_STOPPED
block: Add missing entry to hctx_flag_name[]
block: Add zone write plugging entry to rqf_name[]
block: Add missing entries from cmd_flag_name[]
s390/dasd: fix error checks in dasd_copy_pair_store()
s390/dasd: add missing MODULE_DESCRIPTION() macros
...
Pull block updates from Jens Axboe:
- NVMe updates via Keith:
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)
- MD updates via Song
- sync_action fix and refactoring (Yu Kuai)
- Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu
Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li)
- Fix loop detach/open race (Gulam)
- Fix lower control limit for blk-throttle (Yu)
- Add module descriptions to various drivers (Jeff)
- Add support for atomic writes for block devices, and statx reporting
for same. Includes SCSI and NVMe (John, Prasad, Alan)
- Add IO priority information to block trace points (Dongliang)
- Various zone improvements and tweaks (Damien)
- mq-deadline tag reservation improvements (Bart)
- Ignore direct reclaim swap writes in writeback throttling (Baokun)
- Block integrity improvements and fixes (Anuj)
- Add basic support for rust based block drivers. Has a dummy null_blk
variant for now (Andreas)
- Series converting driver settings to queue limits, and cleanups and
fixes related to that (Christoph)
- Cleanup for poking too deeply into the bvec internals, in preparation
for DMA mapping API changes (Christoph)
- Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas,
Ming, Zhu, Damien, Christophe, Chaitanya)
* tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits)
floppy: add missing MODULE_DESCRIPTION() macro
loop: add missing MODULE_DESCRIPTION() macro
ublk_drv: add missing MODULE_DESCRIPTION() macro
xen/blkback: add missing MODULE_DESCRIPTION() macro
block/rnbd: Constify struct kobj_type
block: take offset into account in blk_bvec_map_sg again
block: fix get_max_segment_size() warning
loop: Don't bother validating blocksize
virtio_blk: Don't bother validating blocksize
null_blk: Don't bother validating blocksize
block: Validate logical block size in blk_validate_limits()
virtio_blk: Fix default logical block size fallback
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
block: pass a phys_addr_t to get_max_segment_size
block: add a bvec_phys helper
blk-lib: check for kill signal in ioctl BLKZEROOUT
block: limit the Write Zeroes to manually writing zeroes fallback
block: refacto blkdev_issue_zeroout
block: move read-only and supported checks into (__)blkdev_issue_zeroout
...
Some drivers validate that their own logical block size. It is no harm to
always do this, so validate in blk_validate_limits().
This allows us to remove the validation in most of those drivers.
Add a comment to blk_validate_block_size() to inform users that self-
validation of LBS is usually unnecessary.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240708091651.177447-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zeroout can access a significant capacity and take longer than the user
expected. A user may change their mind about wanting to run that
command and attempt to kill the process and do something else with their
device. But since the task is uninterruptable, they have to wait for it
to finish, which could be many hours.
Add a new BLKDEV_ZERO_KILLABLE flag for blkdev_issue_zeroout that checks
for a fatal signal at each iteration so the user doesn't have to wait for
their regretted operation to complete naturally.
Heavily based on an earlier patch from Keith Busch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240701165219.1571322-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dma_pad_mask is a queue_limits by all ways of looking at it, so move it
there and set it through the atomic queue limits APIs.
Add a little helper that takes the alignment and pad into account to
simplify the code that is touched a bit.
Note that there never was any need for the > check in
blk_queue_update_dma_pad, this probably was just copy and paste from
dma_update_dma_alignment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240626142637.300624-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to conditionally define on CONFIG_BLK_DEV_ZONED the
inline helper functions bdev_nr_zones(), bdev_max_open_zones(),
bdev_max_active_zones() and disk_zone_no() as these function will return
the correct valu in all cases (zoned device or not, including when
CONFIG_BLK_DEV_ZONED is not set). Furthermore, disk_nr_zones()
definition can be simplified as disk->nr_zones is always 0 for regular
block devices.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need for bdev_nr_zones() to be an exported function
calculating the number of zones of a block device. Instead, given that
all callers use this helper with a fully initialized block device that
has a gendisk, we can redefine this function as an inline helper in
blkdev.h.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240621031506.759397-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge in queue limits cleanups.
* for-6.11/block-limits:
block: move the raid_partial_stripes_expensive flag into the features field
block: remove the discard_alignment flag
block: move the misaligned flag into the features field
block: renumber and rename the cache disabled flag
block: fix spelling and grammar for in writeback_cache_control.rst
block: remove the unused blk_bounce enum