diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst index dd7b1c0c21da..f4d2662871ab 100644 --- a/Documentation/PCI/pci.rst +++ b/Documentation/PCI/pci.rst @@ -52,7 +52,7 @@ driver generally needs to perform the following initialization: - Enable DMA/processing engines When done using the device, and perhaps the module needs to be unloaded, -the driver needs to take the follow steps: +the driver needs to take the following steps: - Disable the device from generating IRQs - Release the IRQ (free_irq()) diff --git a/Documentation/accel/qaic/qaic.rst b/Documentation/accel/qaic/qaic.rst index efb7771273bb..628bf2f7a416 100644 --- a/Documentation/accel/qaic/qaic.rst +++ b/Documentation/accel/qaic/qaic.rst @@ -93,7 +93,7 @@ commands (does not impact QAIC). uAPI ==== -QAIC creates an accel device per phsyical PCIe device. This accel device exists +QAIC creates an accel device per physical PCIe device. This accel device exists for as long as the PCIe device is known to Linux. The PCIe device may not be in the state to accept requests from userspace at diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst index 325c5d0ed34a..585630d14581 100644 --- a/Documentation/admin-guide/bug-bisect.rst +++ b/Documentation/admin-guide/bug-bisect.rst @@ -1,76 +1,144 @@ -Bisecting a bug -+++++++++++++++ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) +.. [see the bottom of this file for redistribution information] -Last updated: 28 October 2016 +====================== +Bisecting a regression +====================== -Introduction -============ +This document describes how to use a ``git bisect`` to find the source code +change that broke something -- for example when some functionality stopped +working after upgrading from Linux 6.0 to 6.1. -Always try the latest kernel from kernel.org and build from source. If you are -not confident in doing that please report the bug to your distribution vendor -instead of to a kernel developer. +The text focuses on the gist of the process. If you are new to bisecting the +kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +instead: it depicts everything from start to finish while covering multiple +aspects even kernel developers occasionally forget. This includes detecting +situations early where a bisection would be a waste of time, as nobody would +care about the result -- for example, because the problem happens after the +kernel marked itself as 'tainted', occurs in an abandoned version, was already +fixed, or is caused by a .config change you or your Linux distributor performed. -Finding bugs is not always easy. Have a go though. If you can't find it don't -give up. Report as much as you have found to the relevant maintainer. See -MAINTAINERS for who that is for the subsystem you have worked on. +Finding the change causing a kernel issue using a bisection +=========================================================== -Before you submit a bug report read -'Documentation/admin-guide/reporting-issues.rst'. +*Note: the following process assumes you prepared everything for a bisection. +This includes having a Git clone with the appropriate sources, installing the +software required to build and install kernels, as well as a .config file stored +in a safe place (the following example assumes '~/prepared_kernel_.config') to +use as pristine base at each bisection step; ideally, you have also worked out +a fully reliable and straight-forward way to reproduce the regression, too.* -Devices not appearing -===================== +* Preparation: start the bisection and tell Git about the points in the history + you consider to be working and broken, which Git calls 'good' and 'bad':: -Often this is caused by udev/systemd. Check that first before blaming it -on the kernel. + git bisect start + git bisect good v6.0 + git bisect bad v6.1 -Finding patch that caused a bug -=============================== + Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too. -Using the provided tools with ``git`` makes finding bugs easy provided the bug -is reproducible. +1. Copy your prepared .config into the build directory and adjust it to the + needs of the codebase Git checked out for testing:: -Steps to do it: + cp ~/prepared_kernel_.config .config + make olddefconfig -- build the Kernel from its git source -- start bisect with [#f1]_:: +2. Now build, install, and boot a kernel. This might fail for unrelated reasons, + for example, when a compile error happens at the current stage of the + bisection a later change resolves. In such cases run ``git bisect skip`` and + go back to step 1. - $ git bisect start +3. Check if the functionality that regressed works in the kernel you just built. -- mark the broken changeset with:: + If it works, execute:: - $ git bisect bad [commit] + git bisect good -- mark a changeset where the code is known to work with:: + If it is broken, run:: - $ git bisect good [commit] + git bisect bad -- rebuild the Kernel and test -- interact with git bisect by using either:: + Note, getting this wrong just once will send the rest of the bisection + totally off course. To prevent having to start anew later you thus want to + ensure what you tell Git is correct; it is thus often wise to spend a few + minutes more on testing in case your reproducer is unreliable. - $ git bisect good + After issuing one of these two commands, Git will usually check out another + bisection point and print something like 'Bisecting: 675 revisions left to + test after this (roughly 10 steps)'. In that case go back to step 1. - or:: + If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da + is the first bad commit', then you have finished the bisection. In that case + move to the next point below. Note, right after displaying that line Git will + show some details about the culprit including its patch description; this can + easily fill your terminal, so you might need to scroll up to see the message + mentioning the culprit's commit-id. - $ git bisect bad + In case you missed Git's output, you can always run ``git bisect log`` to + print the status: it will show how many steps remain or mention the result of + the bisection. - depending if the bug happened on the changeset you're testing -- After some interactions, git bisect will give you the changeset that - likely caused the bug. +* Recommended complementary task: put the bisection log and the current .config + file aside for the bug report; furthermore tell Git to reset the sources to + the state before the bisection:: -- For example, if you know that the current version is bad, and version - 4.8 is good, you could do:: + git bisect log > ~/bisection-log + cp .config ~/bisection-config-culprit + git bisect reset - $ git bisect start - $ git bisect bad # Current version is bad - $ git bisect good v4.8 +* Recommended optional task: try reverting the culprit on top of the latest + codebase and check if that fixes your bug; if that is the case, it validates + the bisection and enables developers to resolve the regression through a + revert. + + To try this, update your clone and check out latest mainline. Then tell Git + to revert the change by specifying its commit-id:: + + git revert --no-edit cafec0cacaca0 + + Git might reject this, for example when the bisection landed on a merge + commit. In that case, abandon the attempt. Do the same, if Git fails to revert + the culprit on its own because later changes depend on it -- at least unless + you bisected a stable or longterm kernel series, in which case you want to + check out its latest codebase and try a revert there. + + If a revert succeeds, build and test another kernel to check if reverting + resolved your regression. + +With that the process is complete. Now report the regression as described by +Documentation/admin-guide/reporting-issues.rst. -.. [#f1] You can, optionally, provide both good and bad arguments at git - start with ``git bisect start [BAD] [GOOD]`` +Additional reading material +--------------------------- -For further references, please read: +* The `man page for 'git bisect' `_ and + `fighting regressions with 'git bisect' `_ + in the Git documentation. +* `Working with git bisect `_ + from kernel developer Nathan Chancellor. +* `Using Git bisect to figure out when brokenness was introduced `_. +* `Fully automated bisecting with 'git bisect run' `_. -- The man page for ``git-bisect`` -- `Fighting regressions with git bisect `_ -- `Fully automated bisecting with "git bisect run" `_ -- `Using Git bisect to figure out when brokenness was introduced `_ +.. + end-of-content +.. + This document is maintained by Thorsten Leemhuis . If + you spot a typo or small mistake, feel free to let him know directly and + he'll fix it. You are free to do the same in a mostly informal way if you + want to contribute changes to the text -- but for copyright reasons please CC + linux-doc@vger.kernel.org and 'sign-off' your contribution as + Documentation/process/submitting-patches.rst explains in the section 'Sign + your work - the Developer's Certificate of Origin'. +.. + This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top + of the file. If you want to distribute this text under CC-BY-4.0 only, + please use 'The Linux kernel development community' for author attribution + and link this as source: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst + +.. + Note: Only the content of this RST file as found in the Linux kernel sources + is available under CC-BY-4.0, as versions of this text that were processed + (for example by the kernel's build system) might contain content taken from + files which use a more restrictive license. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 95299b08c405..1d0f8ceb3075 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -244,14 +244,14 @@ Reporting the bug Once you find where the bug happened, by inspecting its location, you could either try to fix it yourself or report it upstream. -In order to report it upstream, you should identify the mailing list -used for the development of the affected code. This can be done by using -the ``get_maintainer.pl`` script. +In order to report it upstream, you should identify the bug tracker, if any, or +mailing list used for the development of the affected code. This can be done by +using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: - $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c Hans Verkuil (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo (commit_signer:1/1=100%) @@ -267,11 +267,12 @@ Please notice that it will point to: - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); -- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). +- The Linux Kernel mailing list (linux-kernel@vger.kernel.org); +- The bug reporting URIs for the driver/subsystem (none in the above example). -Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the -driver maintainer (Hans). +If the listing contains bug reporting URIs at the end, please prefer them over +email. Otherwise, please report bugs to the mailing list used for the +development of the code (linux-media ML) copying the driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index 552c9155165d..48a48bd09372 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -162,13 +162,18 @@ iv_large_sectors Module parameters:: - max_read_size - max_write_size - Maximum size of read or write requests. When a request larger than this size + Maximum size of read requests. When a request larger than this size is received, dm-crypt will split the request. The splitting improves concurrency (the split requests could be encrypted in parallel by multiple - cores), but it also causes overhead. The user should tune these parameters to + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + + max_write_size + Maximum size of write requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to fit the actual workload. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 778df5efaaf3..fc987d761fff 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -517,6 +517,18 @@ Format: ,, See header of drivers/net/hamradio/baycom_ser_hdx.c. + bdev_allow_write_mounted= + Format: + Control the ability to open a mounted block device + for writing, i.e., allow / disallow writes that bypass + the FS. This was implemented as a means to prevent + fuzzers from crashing the kernel by overwriting the + metadata underneath a mounted FS without its awareness. + This also prevents destructive formatting of mounted + filesystems by naive storage tooling that don't use + O_EXCL. Default is Y and can be changed through the + Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED. + bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index f92551539e8a..700aa72eecb1 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -182,3 +182,5 @@ More detailed explanation for tainting produce extremely unusual kernel structure layouts (even performance pathological ones), which is important to know when debugging. Set at build time. + + 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. diff --git a/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst b/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst index 2945e0e33104..301aa30890ae 100644 --- a/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst +++ b/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst @@ -359,7 +359,7 @@ Driver updates for STM32 DMA-MDMA chaining support in foo driver descriptor you want a callback to be called at the end of the transfer (dmaengine_prep_slave_sg()) or the period (dmaengine_prep_dma_cyclic()). Depending on the direction, set the callback on the descriptor that finishes - the overal transfer: + the overall transfer: * DMA_DEV_TO_MEM: set the callback on the "MDMA" descriptor * DMA_MEM_TO_DEV: set the callback on the "DMA" descriptor @@ -371,7 +371,7 @@ Driver updates for STM32 DMA-MDMA chaining support in foo driver As STM32 MDMA channel transfer is triggered by STM32 DMA, you must issue STM32 MDMA channel before STM32 DMA channel. - If any, your callback will be called to warn you about the end of the overal + If any, your callback will be called to warn you about the end of the overall transfer or the period completion. Don't forget to terminate both channels. STM32 DMA channel is configured in diff --git a/Documentation/arch/arm64/cpu-hotplug.rst b/Documentation/arch/arm64/cpu-hotplug.rst index 76ba8d932c72..8fb438bf7781 100644 --- a/Documentation/arch/arm64/cpu-hotplug.rst +++ b/Documentation/arch/arm64/cpu-hotplug.rst @@ -26,7 +26,7 @@ There are no systems that support the physical addition (or removal) of CPUs while the system is running, and ACPI is not able to sufficiently describe them. -e.g. New CPUs come with new caches, but the platform's cache toplogy is +e.g. New CPUs come with new caches, but the platform's cache topology is described in a static table, the PPTT. How caches are shared between CPUs is not discoverable, and must be described by firmware. diff --git a/Documentation/arch/powerpc/ultravisor.rst b/Documentation/arch/powerpc/ultravisor.rst index ba6b1bf1cc44..6d0407b2f5a1 100644 --- a/Documentation/arch/powerpc/ultravisor.rst +++ b/Documentation/arch/powerpc/ultravisor.rst @@ -134,7 +134,7 @@ Hardware * PTCR and partition table entries (partition table is in secure memory). An attempt to write to PTCR will cause a Hypervisor - Emulation Assitance interrupt. + Emulation Assistance interrupt. * LDBAR (LD Base Address Register) and IMC (In-Memory Collection) non-architected registers. An attempt to write to them will cause a diff --git a/Documentation/arch/riscv/vector.rst b/Documentation/arch/riscv/vector.rst index 75dd88a62e1d..3987f5f76a9d 100644 --- a/Documentation/arch/riscv/vector.rst +++ b/Documentation/arch/riscv/vector.rst @@ -15,7 +15,7 @@ status for the use of Vector in userspace. The intended usage guideline for these interfaces is to give init systems a way to modify the availability of V for processes running under its domain. Calling these interfaces is not recommended in libraries routines because libraries should not override policies -configured from the parant process. Also, users must noted that these interfaces +configured from the parent process. Also, users must note that these interfaces are not portable to non-Linux, nor non-RISC-V environments, so it is discourage to use in a portable code. To get the availability of V in an ELF program, please read :c:macro:`COMPAT_HWCAP_ISA_V` bit of :c:macro:`ELF_HWCAP` in the diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst index c58c72362911..5a2e6c0ef04a 100644 --- a/Documentation/arch/x86/mds.rst +++ b/Documentation/arch/x86/mds.rst @@ -162,7 +162,7 @@ Mitigation points 3. It would take a large number of these precisely-timed NMIs to mount an actual attack. There's presumably not enough bandwidth. 4. The NMI in question occurs after a VERW, i.e. when user state is - restored and most interesting data is already scrubbed. Whats left + restored and most interesting data is already scrubbed. What's left is only the data that NMI touches, and that may or may not be of any interest. diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst index 50960e09e1f6..d07e445dac5c 100644 --- a/Documentation/arch/x86/x86_64/fsgs.rst +++ b/Documentation/arch/x86/x86_64/fsgs.rst @@ -125,7 +125,7 @@ FSGSBASE instructions enablement FSGSBASE instructions compiler support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE +GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE instructions. Clang 5 supports them as well. =================== =========================== @@ -135,7 +135,7 @@ instructions. Clang 5 supports them as well. _writegsbase_u64() Write the GS base register =================== =========================== -To utilize these instrinsics must be included in the source +To utilize these intrinsics must be included in the source code and the compiler option -mfsgsbase has to be added. Compiler support for FS/GS based addressing diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst index df3a8a47f58c..a0ff0eb11e7f 100644 --- a/Documentation/block/bfq-iosched.rst +++ b/Documentation/block/bfq-iosched.rst @@ -9,7 +9,7 @@ controllers), BFQ's main features are: - BFQ guarantees a high system and application responsiveness, and a low latency for time-sensitive applications, such as audio or video players; -- BFQ distributes bandwidth, and not just time, among processes or +- BFQ distributes bandwidth, not just time, among processes or groups (switching back to time distribution when needed to keep throughput high). @@ -111,7 +111,7 @@ Higher speed for code-development tasks If some additional workload happens to be executed in parallel, then BFQ executes the I/O-related components of typical code-development -tasks (compilation, checkout, merge, ...) much more quickly than CFQ, +tasks (compilation, checkout, merge, etc.) much more quickly than CFQ, NOOP or DEADLINE. High throughput @@ -127,9 +127,9 @@ Strong fairness, bandwidth and delay guarantees ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ BFQ distributes the device throughput, and not just the device time, -among I/O-bound applications in proportion their weights, with any +among I/O-bound applications in proportion to their weights, with any workload and regardless of the device parameters. From these bandwidth -guarantees, it is possible to compute tight per-I/O-request delay +guarantees, it is possible to compute a tight per-I/O-request delay guarantees by a simple formula. If not configured for strict service guarantees, BFQ switches to time-based resource sharing (only) for applications that would otherwise cause a throughput loss. @@ -199,7 +199,7 @@ plus a lot of code, are borrowed from CFQ. - On flash-based storage with internal queueing of commands (typically NCQ), device idling happens to be always detrimental - for throughput. So, with these devices, BFQ performs idling + to throughput. So, with these devices, BFQ performs idling only when strictly needed for service guarantees, i.e., for guaranteeing low latency or fairness. In these cases, overall throughput may be sub-optimal. No solution currently exists to @@ -212,7 +212,7 @@ plus a lot of code, are borrowed from CFQ. and to reduce their latency. The most important action taken to achieve this goal is to give to the queues associated with these applications more than their fair share of the device - throughput. For brevity, we call just "weight-raising" the whole + throughput. For brevity, we call it just "weight-raising" the whole sets of actions taken by BFQ to privilege these queues. In particular, BFQ provides a milder form of weight-raising for interactive applications, and a stronger form for soft real-time @@ -231,7 +231,7 @@ plus a lot of code, are borrowed from CFQ. responsive in detecting interleaved I/O (cooperating processes), that it enables BFQ to achieve a high throughput, by queue merging, even for queues for which CFQ needs a different - mechanism, preemption, to get a high throughput. As such EQM is a + mechanism, preemption, to get a high throughput. As such, EQM is a unified mechanism to achieve a high throughput with interleaved I/O. @@ -254,7 +254,7 @@ plus a lot of code, are borrowed from CFQ. - First, with any proportional-share scheduler, the maximum deviation with respect to an ideal service is proportional to the maximum budget (slice) assigned to queues. As a consequence, - BFQ can keep this deviation tight not only because of the + BFQ can keep this deviation tight, not only because of the accurate service of B-WF2Q+, but also because BFQ *does not* need to assign a larger budget to a queue to let the queue receive a higher fraction of the device throughput. @@ -327,7 +327,7 @@ applications. Unset this tunable if you need/want to control weights. slice_idle ---------- -This parameter specifies how long BFQ should idle for next I/O +This parameter specifies how long BFQ should idle for the next I/O request, when certain sync BFQ queues become empty. By default slice_idle is a non-zero value. Idling has a double purpose: boosting throughput and making sure that the desired throughput distribution is @@ -365,7 +365,7 @@ terms of I/O-request dispatches. To guarantee that the actual service order then corresponds to the dispatch order, the strict_guarantees tunable must be set too. -There is an important flipside for idling: apart from the above cases +There is an important flip side to idling: apart from the above cases where it is beneficial also for throughput, idling can severely impact throughput. One important case is random workload. Because of this issue, BFQ tends to avoid idling as much as possible, when it is not @@ -475,7 +475,7 @@ max_budget Maximum amount of service, measured in sectors, that can be provided to a BFQ queue once it is set in service (of course within the limits -of the above timeout). According to what said in the description of +of the above timeout). According to what was said in the description of the algorithm, larger values increase the throughput in proportion to the percentage of sequential I/O requests issued. The price of larger values is that they coarsen the granularity of short-term bandwidth diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 8b84eb4bdae7..0f19dd524323 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -45,8 +45,9 @@ here we briefly outline their recommended usage: * If the allocation is performed from an atomic context, e.g interrupt handler, use ``GFP_NOWAIT``. This flag prevents direct reclaim and IO or filesystem operations. Consequently, under memory pressure - ``GFP_NOWAIT`` allocation is likely to fail. Allocations which - have a reasonable fallback should be using ``GFP_NOWARN``. + ``GFP_NOWAIT`` allocation is likely to fail. Users of this flag need + to provide a suitable fallback to cope with such failures where + appropriate. * If you think that accessing memory reserves is justified and the kernel will be stressed unless allocation succeeds, you may use ``GFP_ATOMIC``. * Untrusted allocations triggered from userspace should be a subject diff --git a/Documentation/dev-tools/kcsan.rst b/Documentation/dev-tools/kcsan.rst index 02143f060b22..d81c42d1063e 100644 --- a/Documentation/dev-tools/kcsan.rst +++ b/Documentation/dev-tools/kcsan.rst @@ -361,7 +361,8 @@ Alternatives Considered ----------------------- An alternative data race detection approach for the kernel can be found in the -`Kernel Thread Sanitizer (KTSAN) `_. +`Kernel Thread Sanitizer (KTSAN) +`_. KTSAN is a happens-before data race detector, which explicitly establishes the happens-before order between memory operations, which can then be used to determine data races as defined in `Data Races`_. diff --git a/Documentation/doc-guide/checktransupdate.rst b/Documentation/doc-guide/checktransupdate.rst new file mode 100644 index 000000000000..dfaf9d373747 --- /dev/null +++ b/Documentation/doc-guide/checktransupdate.rst @@ -0,0 +1,54 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Checking for needed translation updates +======================================= + +This script helps track the translation status of the documentation in +different locales, i.e., whether the documentation is up-to-date with +the English counterpart. + +How it works +------------ + +It uses ``git log`` command to track the latest English commit from the +translation commit (order by author date) and the latest English commits +from HEAD. If any differences occur, the file is considered as out-of-date, +then commits that need to be updated will be collected and reported. + +Features implemented + +- check all files in a certain locale +- check a single file or a set of files +- provide options to change output format +- track the translation status of files that have no translation + +Usage +----- + +:: + + ./scripts/checktransupdate.py --help + +Please refer to the output of argument parser for usage details. + +Samples + +- ``./scripts/checktransupdate.py -l zh_CN`` + This will print all the files that need to be updated in the zh_CN locale. +- ``./scripts/checktransupdate.py Documentation/translations/zh_CN/dev-tools/testing-overview.rst`` + This will only print the status of the specified file. + +Then the output is something like: + +:: + + Documentation/dev-tools/kfence.rst + No translation in the locale of zh_CN + + Documentation/translations/zh_CN/dev-tools/testing-overview.rst + commit 42fb9cfd5b18 ("Documentation: dev-tools: Add link to RV docs") + 1 commits needs resolving in total + +Features to be implemented + +- files can be a folder instead of only a file diff --git a/Documentation/doc-guide/index.rst b/Documentation/doc-guide/index.rst index 7c7d97784626..24d058faa75c 100644 --- a/Documentation/doc-guide/index.rst +++ b/Documentation/doc-guide/index.rst @@ -12,6 +12,7 @@ How to write kernel documentation parse-headers contributing maintainer-profile + checktransupdate .. only:: subproject and html diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 3c399f132e2d..94b3492dc301 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -262,7 +262,7 @@ vsyscall_32.lds wanxlfw.inc uImage unifdef -utf8data.h +utf8data.c wakeup.bin wakeup.elf wakeup.lds diff --git a/Documentation/driver-api/driver-model/devres.rst b/Documentation/driver-api/driver-model/devres.rst index ac9ee7441887..5f2ee8d717b1 100644 --- a/Documentation/driver-api/driver-model/devres.rst +++ b/Documentation/driver-api/driver-model/devres.rst @@ -391,7 +391,7 @@ PCI devm_pci_remap_cfgspace() : ioremap PCI configuration space devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource - pcim_enable_device() : after success, all PCI ops become managed + pcim_enable_device() : after success, some PCI ops become managed pcim_iomap() : do iomap() on a single BAR pcim_iomap_regions() : do request_region() and iomap() on multiple BARs pcim_iomap_regions_request_all() : do request_region() on all and iomap() on multiple BARs diff --git a/Documentation/driver-api/iio/buffers.rst b/Documentation/driver-api/iio/buffers.rst index e83026aebe97..63f364e862d1 100644 --- a/Documentation/driver-api/iio/buffers.rst +++ b/Documentation/driver-api/iio/buffers.rst @@ -15,8 +15,8 @@ trigger source. Multiple data channels can be read at once from IIO buffer sysfs interface ========================== An IIO buffer has an associated attributes directory under -:file:`/sys/bus/iio/iio:device{X}/buffer/*`. Here are some of the existing -attributes: +:file:`/sys/bus/iio/devices/iio:device{X}/buffer/*`. Here are some of the +existing attributes: * :file:`length`, the total number of data samples (capacity) that can be stored by the buffer. @@ -28,8 +28,8 @@ IIO buffer setup The meta information associated with a channel reading placed in a buffer is called a scan element. The important bits configuring scan elements are exposed to userspace applications via the -:file:`/sys/bus/iio/iio:device{X}/scan_elements/` directory. This directory contains -attributes of the following form: +:file:`/sys/bus/iio/devices/iio:device{X}/scan_elements/` directory. This +directory contains attributes of the following form: * :file:`enable`, used for enabling a channel. If and only if its attribute is non *zero*, then a triggered capture will contain data samples for this diff --git a/Documentation/driver-api/iio/core.rst b/Documentation/driver-api/iio/core.rst index 715cf29482a1..dfe438dc91a7 100644 --- a/Documentation/driver-api/iio/core.rst +++ b/Documentation/driver-api/iio/core.rst @@ -24,7 +24,7 @@ then we will show how a device driver makes use of an IIO device. There are two ways for a user space application to interact with an IIO driver. -1. :file:`/sys/bus/iio/iio:device{X}/`, this represents a hardware sensor +1. :file:`/sys/bus/iio/devices/iio:device{X}/`, this represents a hardware sensor and groups together the data channels of the same chip. 2. :file:`/dev/iio:device{X}`, character device node interface used for buffered data transfer and for events information retrieval. @@ -51,8 +51,8 @@ IIO device sysfs interface Attributes are sysfs files used to expose chip info and also allowing applications to set various configuration parameters. For device with -index X, attributes can be found under /sys/bus/iio/iio:deviceX/ directory. -Common attributes are: +index X, attributes can be found under /sys/bus/iio/devices/iio:deviceX/ +directory. Common attributes are: * :file:`name`, description of the physical chip. * :file:`dev`, shows the major:minor pair associated with @@ -140,16 +140,16 @@ Here is how we can make use of the channel's modifiers:: This channel's definition will generate two separate sysfs files for raw data retrieval: -* :file:`/sys/bus/iio/iio:device{X}/in_intensity_ir_raw` -* :file:`/sys/bus/iio/iio:device{X}/in_intensity_both_raw` +* :file:`/sys/bus/iio/devices/iio:device{X}/in_intensity_ir_raw` +* :file:`/sys/bus/iio/devices/iio:device{X}/in_intensity_both_raw` one file for processed data: -* :file:`/sys/bus/iio/iio:device{X}/in_illuminance_input` +* :file:`/sys/bus/iio/devices/iio:device{X}/in_illuminance_input` and one shared sysfs file for sampling frequency: -* :file:`/sys/bus/iio/iio:device{X}/sampling_frequency`. +* :file:`/sys/bus/iio/devices/iio:device{X}/sampling_frequency`. Here is how we can make use of the channel's indexing:: diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst index 70380a2a01b4..8b8aeea71c68 100644 --- a/Documentation/fault-injection/fault-injection.rst +++ b/Documentation/fault-injection/fault-injection.rst @@ -141,6 +141,14 @@ configuration of fault-injection capabilities. default is 'Y', setting it to 'N' will also inject failures into highmem/user allocations (__GFP_HIGHMEM allocations). +- /sys/kernel/debug/failslab/cache-filter + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' will only inject failures when + objects are requests from certain caches. + + Select the cache by writing '1' to /sys/kernel/slab//failslab: + - /sys/kernel/debug/failslab/ignore-gfp-wait: - /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait: @@ -283,7 +291,7 @@ kernel may crash because it may not be able to handle the error. There are 4 types of errors defined in include/asm-generic/error-injection.h EI_ETYPE_NULL - This function will return `NULL` if it fails. e.g. return an allocateed + This function will return `NULL` if it fails. e.g. return an allocated object address. EI_ETYPE_ERRNO @@ -459,6 +467,18 @@ Application Examples losetup -d $DEVICE rm testfile.img +------------------------------------------------------------------------------ + +- Inject only skbuff allocation failures :: + + # mark skbuff_head_cache as faulty + echo 1 > /sys/kernel/slab/skbuff_head_cache/failslab + # Turn on cache filter (off by default) + echo 1 > /sys/kernel/debug/failslab/cache-filter + # Turn on fault injection + echo 1 > /sys/kernel/debug/failslab/times + echo 1 > /sys/kernel/debug/failslab/probability + Tool to run command with failslab or fail_page_alloc ---------------------------------------------------- diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst index 1e0e0bb6fdf9..d270a0aa8d55 100644 --- a/Documentation/filesystems/9p.rst +++ b/Documentation/filesystems/9p.rst @@ -31,7 +31,7 @@ Other applications are described in the following papers: * PROSE I/O: Using 9p to enable Application Partitions http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf * VirtFS: A Virtualization Aware File System pass-through - http://goo.gl/3WPDg + https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf Usage ===== diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst index 3b6e38e646cd..1ac576458c69 100644 --- a/Documentation/filesystems/autofs.rst +++ b/Documentation/filesystems/autofs.rst @@ -18,7 +18,7 @@ key advantages: 2. The names and locations of filesystems can be stored in a remote database and can change at any time. The content - in that data base at the time of access will be used to provide + in that database at the time of access will be used to provide a target for the access. The interpretation of names in the filesystem can even be programmatic rather than database-backed, allowing wildcards for example, and can vary based on the user who @@ -423,7 +423,7 @@ The available ioctl commands are: and objects are expired if the are not in use. **AUTOFS_EXP_FORCED** causes the in use status to be ignored - and objects are expired ieven if they are in use. This assumes + and objects are expired even if they are in use. This assumes that the daemon has requested this because it is capable of performing the umount. diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst index e18f90ffc6fd..0254f7d57429 100644 --- a/Documentation/filesystems/journalling.rst +++ b/Documentation/filesystems/journalling.rst @@ -137,7 +137,7 @@ Fast commits JBD2 to also allows you to perform file-system specific delta commits known as fast commits. In order to use fast commits, you will need to set following -callbacks that perform correspodning work: +callbacks that perform corresponding work: `journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and fast commit. @@ -149,7 +149,7 @@ File system is free to perform fast commits as and when it wants as long as it gets permission from JBD2 to do so by calling the function :c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client file system should tell JBD2 about it by calling -:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full +:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full commit immediately after stopping the fast commit it can do so by calling :c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation fails for some reason and the only way to guarantee consistency is for JBD2 to @@ -199,7 +199,7 @@ Journal Level .. kernel-doc:: fs/jbd2/recovery.c :internal: -Transasction Level +Transaction Level ~~~~~~~~~~~~~~~~~~ .. kernel-doc:: fs/jbd2/transaction.c diff --git a/Documentation/gpu/komeda-kms.rst b/Documentation/gpu/komeda-kms.rst index 633a016563ae..eaea40eb725b 100644 --- a/Documentation/gpu/komeda-kms.rst +++ b/Documentation/gpu/komeda-kms.rst @@ -86,7 +86,7 @@ types of working mode: - Single display mode Two pipelines work together to drive only one display output. - On this mode, pipeline_B doesn't work indenpendently, but outputs its + On this mode, pipeline_B doesn't work independently, but outputs its composition result into pipeline_A, and its pixel timing also derived from pipeline_A.timing_ctrlr. The pipeline_B works just like a "slave" of pipeline_A(master) diff --git a/Documentation/leds/leds-mlxcpld.rst b/Documentation/leds/leds-mlxcpld.rst index 528582429e0b..c520a134d91e 100644 --- a/Documentation/leds/leds-mlxcpld.rst +++ b/Documentation/leds/leds-mlxcpld.rst @@ -115,4 +115,4 @@ Driver provides the following LEDs for the system "msn2100": - [1,1,1,1] = Blue blink 6Hz Driver supports HW blinking at 3Hz and 6Hz frequency (50% duty cycle). -For 3Hz duty cylce is about 167 msec, for 6Hz is about 83 msec. +For 3Hz duty cycle is about 167 msec, for 6Hz is about 83 msec. diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst index 0595098a74d9..f6d53c37a2ca 100644 --- a/Documentation/mm/hmm.rst +++ b/Documentation/mm/hmm.rst @@ -66,7 +66,7 @@ combinatorial explosion in the library entry points. Finally, with the advance of high level language constructs (in C++ but in other languages too) it is now possible for the compiler to leverage GPUs and other devices without programmer knowledge. Some compiler identified patterns -are only do-able with a shared address space. It is also more reasonable to use +are only doable with a shared address space. It is also more reasonable to use a shared address space for all other patterns. @@ -267,7 +267,7 @@ functions are designed to make drivers easier to write and to centralize common code across drivers. Before migrating pages to device private memory, special device private -``struct page`` need to be created. These will be used as special "swap" +``struct page`` needs to be created. These will be used as special "swap" page table entries so that a CPU process will fault if it tries to access a page that has been migrated to device private memory. @@ -322,7 +322,7 @@ between device driver specific code and shared common code: The ``invalidate_range_start()`` callback is passed a ``struct mmu_notifier_range`` with the ``event`` field set to ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to - the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is + the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This allows the device driver to skip the invalidation callback and only invalidate device private MMU mappings that are actually migrating. This is explained more in the next section. @@ -405,7 +405,7 @@ can be used to make a memory range inaccessible from userspace. This replaces all mappings for pages in the given range with special swap entries. Any attempt to access the swap entry results in a fault which is -resovled by replacing the entry with the original mapping. A driver gets +resolved by replacing the entry with the original mapping. A driver gets notified that the mapping has been changed by MMU notifiers, after which point it will no longer have exclusive access to the page. Exclusive access is guaranteed to last until the driver drops the page lock and page reference, at @@ -431,7 +431,7 @@ Same decision was made for memory cgroup. Device memory pages are accounted against same memory cgroup a regular page would be accounted to. This does simplify migration to and from device memory. This also means that migration back from device memory to regular memory cannot fail because it would -go above memory cgroup limit. We might revisit this choice latter on once we +go above memory cgroup limit. We might revisit this choice later on once we get more experience in how device memory is used and its impact on memory resource control. diff --git a/Documentation/mm/vmalloced-kernel-stacks.rst b/Documentation/mm/vmalloced-kernel-stacks.rst index 4edca515bfd7..5bc0f7ceea89 100644 --- a/Documentation/mm/vmalloced-kernel-stacks.rst +++ b/Documentation/mm/vmalloced-kernel-stacks.rst @@ -110,7 +110,7 @@ Bulk of the code is in: `kernel/fork.c `. stack_vm_area pointer in task_struct keeps track of the virtually allocated -stack and a non-null stack_vm_area pointer serves as a indication that the +stack and a non-null stack_vm_area pointer serves as an indication that the virtually mapped kernel stacks are enabled. :: @@ -120,8 +120,8 @@ virtually mapped kernel stacks are enabled. Stack overflow handling ----------------------- -Leading and trailing guard pages help detect stack overflows. When stack -overflows into the guard pages, handlers have to be careful not overflow +Leading and trailing guard pages help detect stack overflows. When the stack +overflows into the guard pages, handlers have to be careful not to overflow the stack again. When handlers are called, it is likely that very little stack space is left. @@ -148,6 +148,6 @@ Conclusions - THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and simply embed the thread_info (containing only flags) and 'int cpu' into task_struct. -- The thread stack can be free'ed as soon as the task is dead (without +- The thread stack can be freed as soon as the task is dead (without waiting for RCU) and then, if vmapped stacks are in use, cache the entire stack for reuse on the same cpu. diff --git a/Documentation/nvme/feature-and-quirk-policy.rst b/Documentation/nvme/feature-and-quirk-policy.rst index c01d836d8e41..e21966bf20a8 100644 --- a/Documentation/nvme/feature-and-quirk-policy.rst +++ b/Documentation/nvme/feature-and-quirk-policy.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -======================================= -Linux NVMe feature and and quirk policy -======================================= +=================================== +Linux NVMe feature and quirk policy +=================================== This file explains the policy used to decide what is supported by the Linux NVMe driver and what is not. diff --git a/Documentation/process/backporting.rst b/Documentation/process/backporting.rst index e1a6ea0a1e8a..a71480fcf3b4 100644 --- a/Documentation/process/backporting.rst +++ b/Documentation/process/backporting.rst @@ -73,7 +73,7 @@ Once you have the patch in git, you can go ahead and cherry-pick it into your source tree. Don't forget to cherry-pick with ``-x`` if you want a written record of where the patch came from! -Note that if you are submiting a patch for stable, the format is +Note that if you are submitting a patch for stable, the format is slightly different; the first line after the subject line needs tobe either:: @@ -147,7 +147,7 @@ divergence. It's important to always identify the commit or commits that caused the conflict, as otherwise you cannot be confident in the correctness of your resolution. As an added bonus, especially if the patch is in an -area you're not that famliar with, the changelogs of these commits will +area you're not that familiar with, the changelogs of these commits will often give you the context to understand the code and potential problems or pitfalls with your conflict resolution. @@ -197,7 +197,7 @@ git blame Another way to find prerequisite commits (albeit only the most recent one for a given conflict) is to run ``git blame``. In this case, you need to run it against the parent commit of the patch you are -cherry-picking and the file where the conflict appared, i.e.:: +cherry-picking and the file where the conflict appeared, i.e.:: git blame ^ -- diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index 8e30c8f7697d..19d2ed47ff79 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -986,7 +986,7 @@ that can go into these 5 milliseconds. A reasonable rule of thumb is to not put inline at functions that have more than 3 lines of code in them. An exception to this rule are the cases where -a parameter is known to be a compiletime constant, and as a result of this +a parameter is known to be a compile time constant, and as a result of this constantness you *know* the compiler will be able to optimize most of your function away at compile time. For a good example of this later case, see the kmalloc() inline function. diff --git a/Documentation/process/email-clients.rst b/Documentation/process/email-clients.rst index dd22c46d1d02..e6b9173a1845 100644 --- a/Documentation/process/email-clients.rst +++ b/Documentation/process/email-clients.rst @@ -216,7 +216,7 @@ Mutt is highly customizable. Here is a minimum configuration to start using Mutt to send patches through Gmail:: # .muttrc - # ================ IMAP ==================== + # ================ IMAP ==================== set imap_user = 'yourusername@gmail.com' set imap_pass = 'yourpassword' set spoolfile = imaps://imap.gmail.com/INBOX diff --git a/Documentation/process/maintainer-tip.rst b/Documentation/process/maintainer-tip.rst index ba312345d030..349a27a53343 100644 --- a/Documentation/process/maintainer-tip.rst +++ b/Documentation/process/maintainer-tip.rst @@ -154,7 +154,7 @@ Examples for illustration: We modify the hot cpu handling to cancel the delayed work on the dying cpu and run the worker immediately on a different cpu in same domain. We - donot flush the worker because the MBM overflow worker reschedules the + do not flush the worker because the MBM overflow worker reschedules the worker on same CPU and scans the domain->cpu_mask to get the domain pointer. diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index f310f2f36666..1518bd57adab 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -842,6 +842,14 @@ Make sure that base commit is in an official maintainer/mainline tree and not in some internal, accessible only to you tree - otherwise it would be worthless. +Tooling +------- + +Many of the technical aspects of this process can be automated using +b4, documented at . This can +help with things like tracking dependencies, running checkpatch and +with formatting and sending mails. + References ---------- diff --git a/Documentation/scheduler/completion.rst b/Documentation/scheduler/completion.rst index f19aca2062bd..adf0c0a56d02 100644 --- a/Documentation/scheduler/completion.rst +++ b/Documentation/scheduler/completion.rst @@ -51,7 +51,7 @@ which has only two fields:: struct completion { unsigned int done; - wait_queue_head_t wait; + struct swait_queue_head wait; }; This provides the ->wait waitqueue to place tasks on for waiting (if any), and diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst index 43bd8a145b7a..1f2942c4d14b 100644 --- a/Documentation/scheduler/index.rst +++ b/Documentation/scheduler/index.rst @@ -12,6 +12,7 @@ Scheduler sched-bwc sched-deadline sched-design-CFS + sched-eevdf sched-domains sched-capacity sched-energy diff --git a/Documentation/scheduler/sched-design-CFS.rst b/Documentation/scheduler/sched-design-CFS.rst index bc1e507269c6..8786f219fc73 100644 --- a/Documentation/scheduler/sched-design-CFS.rst +++ b/Documentation/scheduler/sched-design-CFS.rst @@ -8,10 +8,12 @@ CFS Scheduler 1. OVERVIEW ============ -CFS stands for "Completely Fair Scheduler," and is the new "desktop" process -scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the -replacement for the previous vanilla scheduler's SCHED_OTHER interactivity -code. +CFS stands for "Completely Fair Scheduler," and is the "desktop" process +scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. When +originally merged, it was the replacement for the previous vanilla +scheduler's SCHED_OTHER interactivity code. Nowadays, CFS is making room +for EEVDF, for which documentation can be found in +Documentation/scheduler/sched-eevdf.rst. 80% of CFS's design can be summed up in a single sentence: CFS basically models an "ideal, precise multi-tasking CPU" on real hardware. diff --git a/Documentation/scheduler/sched-eevdf.rst b/Documentation/scheduler/sched-eevdf.rst new file mode 100644 index 000000000000..83efe7c0a30d --- /dev/null +++ b/Documentation/scheduler/sched-eevdf.rst @@ -0,0 +1,43 @@ +=============== +EEVDF Scheduler +=============== + +The "Earliest Eligible Virtual Deadline First" (EEVDF) was first introduced +in a scientific publication in 1995 [1]. The Linux kernel began +transitioning to EEVDF in version 6.6 (as a new option in 2024), moving +away from the earlier Completely Fair Scheduler (CFS) in favor of a version +of EEVDF proposed by Peter Zijlstra in 2023 [2-4]. More information +regarding CFS can be found in +Documentation/scheduler/sched-design-CFS.rst. + +Similarly to CFS, EEVDF aims to distribute CPU time equally among all +runnable tasks with the same priority. To do so, it assigns a virtual run +time to each task, creating a "lag" value that can be used to determine +whether a task has received its fair share of CPU time. In this way, a task +with a positive lag is owed CPU time, while a negative lag means the task +has exceeded its portion. EEVDF picks tasks with lag greater or equal to +zero and calculates a virtual deadline (VD) for each, selecting the task +with the earliest VD to execute next. It's important to note that this +allows latency-sensitive tasks with shorter time slices to be prioritized, +which helps with their responsiveness. + +There are ongoing discussions on how to manage lag, especially for sleeping +tasks; but at the time of writing EEVDF uses a "decaying" mechanism based +on virtual run time (VRT). This prevents tasks from exploiting the system +by sleeping briefly to reset their negative lag: when a task sleeps, it +remains on the run queue but marked for "deferred dequeue," allowing its +lag to decay over VRT. Hence, long-sleeping tasks eventually have their lag +reset. Finally, tasks can preempt others if their VD is earlier, and tasks +can request specific time slices using the new sched_setattr() system call, +which further facilitates the job of latency-sensitive applications. + +REFERENCES +========== + +[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564 + +[2] https://lore.kernel.org/lkml/a79014e6-ea83-b316-1e12-2ae056bda6fa@linux.vnet.ibm.com/ + +[3] https://lwn.net/Articles/969062/ + +[4] https://lwn.net/Articles/925371/ diff --git a/Documentation/sphinx/kerneldoc-preamble.sty b/Documentation/sphinx/kerneldoc-preamble.sty index d479cfa73658..5d68395539fe 100644 --- a/Documentation/sphinx/kerneldoc-preamble.sty +++ b/Documentation/sphinx/kerneldoc-preamble.sty @@ -199,6 +199,8 @@ % Inactivate CJK after tableofcontents \apptocmd{\sphinxtableofcontents}{\kerneldocCJKoff}{}{} \xeCJKsetup{CJKspace = true}% For inter-phrase space of Korean TOC + % Suppress extra white space at latin .. non-latin in literal blocks + \AtBeginEnvironment{sphinxVerbatim}{\CJKsetecglue{}} }{ % Don't enable CJK % Custom macros to on/off CJK and switch CJK fonts (Dummy) \newcommand{\kerneldocCJKon}{} diff --git a/Documentation/translations/ko_KR/core-api/wrappers/memory-barriers.rst b/Documentation/translations/ko_KR/core-api/wrappers/memory-barriers.rst new file mode 100644 index 000000000000..526ae534dd86 --- /dev/null +++ b/Documentation/translations/ko_KR/core-api/wrappers/memory-barriers.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + This is a simple wrapper to bring memory-barriers.txt into the RST world + until such a time as that file can be converted directly. + +========================= +리눅스 커널 메모리 배리어 +========================= + +.. raw:: latex + + \footnotesize + +.. include:: ../../memory-barriers.txt + :literal: + +.. raw:: latex + + \normalsize diff --git a/Documentation/translations/ko_KR/index.rst b/Documentation/translations/ko_KR/index.rst index 4add6b2fe1f2..a20772f9d61c 100644 --- a/Documentation/translations/ko_KR/index.rst +++ b/Documentation/translations/ko_KR/index.rst @@ -11,19 +11,9 @@ .. toctree:: :maxdepth: 1 - howto - - -리눅스 커널 메모리 배리어 -------------------------- + process/howto + core-api/wrappers/memory-barriers.rst .. raw:: latex - \footnotesize - -.. include:: ./memory-barriers.txt - :literal: - -.. raw:: latex - - }\kerneldocEndKR + }\kerneldocEndKR diff --git a/Documentation/translations/ko_KR/howto.rst b/Documentation/translations/ko_KR/process/howto.rst similarity index 100% rename from Documentation/translations/ko_KR/howto.rst rename to Documentation/translations/ko_KR/process/howto.rst diff --git a/Documentation/translations/sp_SP/scheduler/index.rst b/Documentation/translations/sp_SP/scheduler/index.rst index 768488d6f001..32f9fd7517b2 100644 --- a/Documentation/translations/sp_SP/scheduler/index.rst +++ b/Documentation/translations/sp_SP/scheduler/index.rst @@ -6,3 +6,4 @@ :maxdepth: 1 sched-design-CFS + sched-eevdf diff --git a/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst b/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst index 731c266beb1a..dc728c739e28 100644 --- a/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst +++ b/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst @@ -14,10 +14,10 @@ Gestor de tareas CFS CFS viene de las siglas en inglés de "Gestor de tareas totalmente justo" ("Completely Fair Scheduler"), y es el nuevo gestor de tareas de escritorio -implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto de -el previo gestor de tareas SCHED_OTHER. - -Nota: El planificador EEVDF fue incorporado más recientemente al kernel. +implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto +del previo gestor de tareas SCHED_OTHER. Hoy en día se está abriendo camino +para el gestor de tareas EEVDF, cuya documentación se puede ver en +Documentation/scheduler/sched-eevdf.rst El 80% del diseño de CFS puede ser resumido en una única frase: CFS básicamente modela una "CPU ideal, precisa y multi-tarea" sobre hardware diff --git a/Documentation/translations/sp_SP/scheduler/sched-eevdf.rst b/Documentation/translations/sp_SP/scheduler/sched-eevdf.rst new file mode 100644 index 000000000000..d54736f297b8 --- /dev/null +++ b/Documentation/translations/sp_SP/scheduler/sched-eevdf.rst @@ -0,0 +1,58 @@ + +.. include:: ../disclaimer-sp.rst + +:Original: Documentation/scheduler/sched-eevdf.rst +:Translator: Sergio González Collado + +====================== +Gestor de tareas EEVDF +====================== + +El gestor de tareas EEVDF, del inglés: "Earliest Eligible Virtual Deadline +First", fue presentado por primera vez en una publicación científica en +1995 [1]. El kernel de Linux comenzó a transicionar hacia EEVPF en la +versión 6.6 (y como una nueva opción en 2024), alejándose del gestor +de tareas CFS, en favor de una versión de EEVDF propuesta por Peter +Zijlstra en 2023 [2-4]. Más información relativa a CFS puede encontrarse +en Documentation/scheduler/sched-design-CFS.rst. + +De forma parecida a CFS, EEVDF intenta distribuir el tiempo de ejecución +de la CPU de forma equitativa entre todas las tareas que tengan la misma +prioridad y puedan ser ejecutables. Para eso, asigna un tiempo de +ejecución virtual a cada tarea, creando un "retraso" que puede ser usado +para determinar si una tarea ha recibido su cantidad justa de tiempo +de ejecución en la CPU. De esta manera, una tarea con un "retraso" +positivo, es porque se le debe tiempo de ejecución, mientras que una +con "retraso" negativo implica que la tarea ha excedido su cuota de +tiempo. EEVDF elige las tareas con un "retraso" mayor igual a cero y +calcula un tiempo límite de ejecución virtual (VD, del inglés: virtual +deadline) para cada una, eligiendo la tarea con la VD más próxima para +ser ejecutada a continuación. Es importante darse cuenta que esto permite +que la tareas que sean sensibles a la latencia que tengan porciones de +tiempos de ejecución de CPU más cortos ser priorizadas, lo cual ayuda con +su menor tiempo de respuesta. + +Ahora mismo se está discutiendo cómo gestionar esos "retrasos", especialmente +en tareas que estén en un estado durmiente; pero en el momento en el que +se escribe este texto EEVDF usa un mecanismo de "decaimiento" basado en el +tiempo virtual de ejecución (VRT, del inglés: virtual run time). Esto previene +a las tareas de abusar del sistema simplemente durmiendo brevemente para +reajustar su retraso negativo: cuando una tarea duerme, esta permanece en +la cola de ejecución pero marcada para "desencolado diferido", permitiendo +a su retraso decaer a lo largo de VRT. Por tanto, las tareas que duerman +por más tiempo eventualmente eliminarán su retraso. Finalmente, las tareas +pueden adelantarse a otras si su VD es más próximo en el tiempo, y las +tareas podrán pedir porciones de tiempo específicas con la nueva llamada +del sistema sched_setattr(), todo esto facilitara el trabajo de las aplicaciones +que sean sensibles a las latencias. + +REFERENCIAS +=========== + +[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564 + +[2] https://lore.kernel.org/lkml/a79014e6-ea83-b316-1e12-2ae056bda6fa@linux.vnet.ibm.com/ + +[3] https://lwn.net/Articles/969062/ + +[4] https://lwn.net/Articles/925371/ diff --git a/Documentation/translations/zh_CN/admin-guide/index.rst b/Documentation/translations/zh_CN/admin-guide/index.rst index 0db80ab830a0..15d9ab5993a7 100644 --- a/Documentation/translations/zh_CN/admin-guide/index.rst +++ b/Documentation/translations/zh_CN/admin-guide/index.rst @@ -37,7 +37,6 @@ Todolist: reporting-issues reporting-regressions - security-bugs bug-hunting bug-bisect tainted-kernels diff --git a/Documentation/translations/zh_CN/admin-guide/reporting-issues.rst b/Documentation/translations/zh_CN/admin-guide/reporting-issues.rst index 59e51e3539b4..9ff4ba94391d 100644 --- a/Documentation/translations/zh_CN/admin-guide/reporting-issues.rst +++ b/Documentation/translations/zh_CN/admin-guide/reporting-issues.rst @@ -300,7 +300,7 @@ Documentation/admin-guide/reporting-regressions.rst 对此进行了更详细的 添加到回归跟踪列表中,以确保它不会被忽略。 什么是安全问题留给您自己判断。在继续之前,请考虑阅读 -Documentation/translations/zh_CN/admin-guide/security-bugs.rst , +Documentation/translations/zh_CN/process/security-bugs.rst , 因为它提供了如何最恰当地处理安全问题的额外细节。 当发生了完全无法接受的糟糕事情时,此问题就是一个“非常严重的问题”。例如, @@ -983,7 +983,7 @@ Documentation/admin-guide/reporting-regressions.rst ;它还提供了大量其 报告,请将报告的文本转发到这些地址;但请在报告的顶部加上注释,表明您提交了 报告,并附上工单链接。 -更多信息请参见 Documentation/translations/zh_CN/admin-guide/security-bugs.rst 。 +更多信息请参见 Documentation/translations/zh_CN/process/security-bugs.rst 。 发布报告后的责任 diff --git a/Documentation/translations/zh_CN/dev-tools/index.rst b/Documentation/translations/zh_CN/dev-tools/index.rst index c540e4a7d5db..6a8c637c0be1 100644 --- a/Documentation/translations/zh_CN/dev-tools/index.rst +++ b/Documentation/translations/zh_CN/dev-tools/index.rst @@ -21,6 +21,7 @@ Documentation/translations/zh_CN/dev-tools/testing-overview.rst testing-overview sparse kcov + kcsan gcov kasan ubsan @@ -32,7 +33,6 @@ Todolist: - checkpatch - coccinelle - kmsan - - kcsan - kfence - kgdb - kselftest diff --git a/Documentation/translations/zh_CN/dev-tools/kcsan.rst b/Documentation/translations/zh_CN/dev-tools/kcsan.rst new file mode 100644 index 000000000000..8c495c17f109 --- /dev/null +++ b/Documentation/translations/zh_CN/dev-tools/kcsan.rst @@ -0,0 +1,320 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: Documentation/dev-tools/kcsan.rst +:Translator: 刘浩阳 Haoyang Liu + +内核并发消毒剂(KCSAN) +===================== + +内核并发消毒剂(KCSAN)是一个动态竞争检测器,依赖编译时插桩,并且使用基于观察 +点的采样方法来检测竞争。KCSAN 的主要目的是检测 `数据竞争`_。 + +使用 +---- + +KCSAN 受 GCC 和 Clang 支持。使用 GCC 需要版本 11 或更高,使用 Clang 也需要 +版本 11 或更高。 + +为了启用 KCSAN,用如下参数配置内核:: + + CONFIG_KCSAN = y + +KCSAN 提供了几个其他的配置选项来自定义行为(见 ``lib/Kconfig.kcsan`` 中的各自的 +帮助文档以获取更多信息)。 + +错误报告 +~~~~~~~~ + +一个典型数据竞争的报告如下所示:: + + ================================================================== + BUG: KCSAN: data-race in test_kernel_read / test_kernel_write + + write to 0xffffffffc009a628 of 8 bytes by task 487 on cpu 0: + test_kernel_write+0x1d/0x30 + access_thread+0x89/0xd0 + kthread+0x23e/0x260 + ret_from_fork+0x22/0x30 + + read to 0xffffffffc009a628 of 8 bytes by task 488 on cpu 6: + test_kernel_read+0x10/0x20 + access_thread+0x89/0xd0 + kthread+0x23e/0x260 + ret_from_fork+0x22/0x30 + + value changed: 0x00000000000009a6 -> 0x00000000000009b2 + + Reported by Kernel Concurrency Sanitizer on: + CPU: 6 PID: 488 Comm: access_thread Not tainted 5.12.0-rc2+ #1 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 + ================================================================== + +报告的头部提供了一个关于竞争中涉及到的函数的简短总结。随后是竞争中的两个线程的 +访问类型和堆栈信息。如果 KCSAN 发现了一个值的变化,那么那个值的旧值和新值会在 +“value changed”这一行单独显示。 + +另一个不太常见的数据竞争类型的报告如下所示:: + + ================================================================== + BUG: KCSAN: data-race in test_kernel_rmw_array+0x71/0xd0 + + race at unknown origin, with read to 0xffffffffc009bdb0 of 8 bytes by task 515 on cpu 2: + test_kernel_rmw_array+0x71/0xd0 + access_thread+0x89/0xd0 + kthread+0x23e/0x260 + ret_from_fork+0x22/0x30 + + value changed: 0x0000000000002328 -> 0x0000000000002329 + + Reported by Kernel Concurrency Sanitizer on: + CPU: 2 PID: 515 Comm: access_thread Not tainted 5.12.0-rc2+ #1 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 + ================================================================== + +这个报告是当另一个竞争线程不可能被发现,但是可以从观测的内存地址的值改变而推断 +出来的时候生成的。这类报告总是会带有“value changed”行。这类报告的出现通常是因 +为在竞争线程中缺少插桩,也可能是因为其他原因,比如 DMA 访问。这类报告只会在 +设置了内核参数 ``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y`` 时才会出现,而这 +个参数是默认启用的。 + +选择性分析 +~~~~~~~~~~ + +对于一些特定的访问,函数,编译单元或者整个子系统,可能需要禁用数据竞争检测。 +对于静态黑名单,有如下可用的参数: + +* KCSAN 支持使用 ``data_race(expr)`` 注解,这个注解告诉 KCSAN 任何由访问 + ``expr`` 所引起的数据竞争都应该被忽略,其产生的行为后果被认为是安全的。请查阅 + `在 LKMM 中 "标记共享内存访问"`_ 获得更多信息。 + +* 与 ``data_race(...)`` 相似,可以使用类型限定符 ``__data_racy`` 来标记一个变量 + ,所有访问该变量而导致的数据竞争都是故意为之并且应该被 KCSAN 忽略:: + + struct foo { + ... + int __data_racy stats_counter; + ... + }; + +* 使用函数属性 ``__no_kcsan`` 可以对整个函数禁用数据竞争检测:: + + __no_kcsan + void foo(void) { + ... + + 为了动态限制该为哪些函数生成报告,查阅 `Debug 文件系统接口`_ 黑名单/白名单特性。 + +* 为特定的编译单元禁用数据竞争检测,将下列参数加入到 ``Makefile`` 中:: + + KCSAN_SANITIZE_file.o := n + +* 为 ``Makefile`` 中的所有编译单元禁用数据竞争检测,将下列参数添加到相应的 + ``Makefile`` 中:: + + KCSAN_SANITIZE := n + +.. _在 LKMM 中 "标记共享内存访问": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt + +此外,KCSAN 可以根据偏好设置显示或隐藏整个类别的数据竞争。可以使用如下 +Kconfig 参数进行更改: + +* ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: 如果启用了该参数并且通过观测点 + (watchpoint) 观测到一个有冲突的写操作,但是对应的内存地址中存储的值没有改变, + 则不会报告这起数据竞争。 + +* ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: 假设默认情况下,不超过字大小的简 + 单对齐写入操作是原子的。假设这些写入操作不会受到不安全的编译器优化影响,从而导 + 致数据竞争。该选项使 KCSAN 不报告仅由不超过字大小的简单对齐写入操作引起 + 的冲突所导致的数据竞争。 + +* ``CONFIG_KCSAN_PERMISSIVE``: 启用额外的宽松规则来忽略某些常见类型的数据竞争。 + 与上面的规则不同,这条规则更加复杂,涉及到值改变模式,访问类型和地址。这个 + 选项依赖编译选项 ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=y``。请查看 + ``kernel/kcsan/permissive.h`` 获取更多细节。对于只侧重于特定子系统而不是整个 + 内核报告的测试者和维护者,建议禁用该选项。 + +要使用尽可能严格的规则,选择 ``CONFIG_KCSAN_STRICT=y``,这将配置 KCSAN 尽可 +能紧密地遵循 Linux 内核内存一致性模型(LKMM)。 + +Debug 文件系统接口 +~~~~~~~~~~~~~~~~~~ + +文件 ``/sys/kernel/debug/kcsan`` 提供了如下接口: + +* 读 ``/sys/kernel/debug/kcsan`` 返回不同的运行时统计数据。 + +* 将 ``on`` 或 ``off`` 写入 ``/sys/kernel/debug/kcsan`` 允许打开或关闭 KCSAN。 + +* 将 ``!some_func_name`` 写入 ``/sys/kernel/debug/kcsan`` 会将 + ``some_func_name`` 添加到报告过滤列表中,该列表(默认)会将数据竞争报告中的顶 + 层堆栈帧是列表中函数的情况列入黑名单。 + +* 将 ``blacklist`` 或 ``whitelist`` 写入 ``/sys/kernel/debug/kcsan`` 会改变报告 + 过滤行为。例如,黑名单的特性可以用来过滤掉经常发生的数据竞争。白名单特性可以帮 + 助复现和修复测试。 + +性能调优 +~~~~~~~~ + +影响 KCSAN 整体的性能和 bug 检测能力的核心参数是作为内核命令行参数公开的,其默认 +值也可以通过相应的 Kconfig 选项更改。 + +* ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): 在另一个观测点设置之前每 + 个 CPU 要跳过的内存操作次数。更加频繁的设置观测点将增加观察到竞争情况的可能性 + 。这个参数对系统整体的性能和竞争检测能力影响最显著。 + +* ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): 对于任务,观测点设置之 + 后暂停执行的微秒延迟。值越大,检测到竞争情况的可能性越高。 + +* ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): 对于中断, + 观测点设置之后暂停执行的微秒延迟。中断对于延迟的要求更加严格,其延迟通常应该小 + 于为任务选择的延迟。 + +它们可以通过 ``/sys/module/kcsan/parameters/`` 在运行时进行调整。 + +数据竞争 +-------- + +在一次执行中,如果两个内存访问存在 *冲突*,在不同的线程中并发执行,并且至少 +有一个访问是 *简单访问*,则它们就形成了 *数据竞争*。如果它们访问了同一个内存地址并且 +至少有一个是写操作,则称它们存在 *冲突*。有关更详细的讨论和定义,见 +`LKMM 中的 "简单访问和数据竞争"`_。 + +.. _LKMM 中的 "简单访问和数据竞争": https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1922 + +与 Linux 内核内存一致性模型(LKMM)的关系 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +LKMM 定义了各种内存操作的传播和排序规则,让开发者可以推理并发代码。最终这允许确 +定并发代码可能的执行情况并判断这些代码是否存在数据竞争。 + +KCSAN 可以识别 *被标记的原子操作* ( ``READ_ONCE``, ``WRITE_ONCE`` , ``atomic_*`` +等),以及内存屏障所隐含的一部分顺序保证。启用 ``CONFIG_KCSAN_WEAK_MEMORY=y`` +配置,KCSAN 会对加载或存储缓冲区进行建模,并可以检测遗漏的 +``smp_mb()``, ``smp_wmb()``, ``smp_rmb()``, ``smp_store_release()``,以及所有的 +具有等效隐含内存屏障的 ``atomic_*`` 操作。 + +请注意,KCSAN 不会报告所有由于缺失内存顺序而导致的数据竞争,特别是在需要内存屏障 +来禁止后续内存操作在屏障之前重新排序的情况下。因此,开发人员应该仔细考虑那些未 +被检查的内存顺序要求。 + +数据竞争以外的竞争检测 +--------------------------- + +对于有着复杂并发设计的代码,竞争状况不总是表现为数据竞争。如果并发操作引起了意 +料之外的系统行为,则认为发生了竞争状况。另一方面,数据竞争是在 C 语言层面定义 +的。内核定义了一些宏定义用来检测非数据竞争的漏洞并发代码的属性。 + +.. note:: + 为了不引入新的文档编译警告,这里不展示宏定义的具体内容,如果想查看具体 + 宏定义可以结合原文(Documentation/dev-tools/kcsan.rst)阅读。 + +实现细节 +-------- + +KCSAN 需要观测两个并发访问。特别重要的是,我们想要(a)增加观测到竞争的机会(尤 +其是很少发生的竞争),以及(b)能够实际观测到这些竞争。我们可以通过(a)注入 +不同的延迟,以及(b)使用地址观测点(或断点)来实现。 + +如果我们在设置了地址观察点的情况下故意延迟一个内存访问,然后观察到观察点被触发 +,那么两个对同一地址的访问就发生了竞争。使用硬件观察点,这是 `DataCollider +`_ 中采用 +的方法。与 DataCollider 不同,KCSAN 不使用硬件观察点,而是依赖于编译器插桩和“软 +观测点”。 + +在 KCSAN 中,观察点是通过一种高效的编码实现的,该编码将访问类型、大小和地址存储 +在一个长整型变量中;使用“软观察点”的好处是具有可移植性和更大的灵活性。然后, +KCSAN依赖于编译器对普通访问的插桩。对于每个插桩的普通访问: + +1. 检测是否存在一个符合的观测点,如果存在,并且至少有一个操作是写操作,则我们发 + 现了一个竞争访问。 + +2. 如果不存在匹配的观察点,则定期的设置一个观测点并随机延迟一小段时间。 + +3. 在延迟前检查数据值,并在延迟后重新检查数据值;如果值不匹配,我们推测存在一个 + 未知来源的竞争状况。 + +为了检测普通访问和标记访问之间的数据竞争,KCSAN 也对标记访问进行标记,但仅用于 +检查是否存在观察点;即 KCSAN 不会在标记访问上设置观察点。通过不在标记操作上设 +置观察点,如果对一个变量的所有并发访问都被正确标记,KCSAN 将永远不会触发观察点 +,因此也不会报告这些访问。 + +弱内存建模 +~~~~~~~~~~ + +KCSAN 通过建模访问重新排序(使用 ``CONFIG_KCSAN_WEAK_MEMORY=y``)来检测由于缺少 +内存屏障而导致的数据竞争。每个设置了观察点的普通内存访问也会被选择在其函数范围 +内进行模拟重新排序(最多一个正在进行的访问)。 + +一旦某个访问被选择用于重新排序,它将在函数范围内与每个其他访问进行检查。如果遇 +到适当的内存屏障,该访问将不再被考虑进行模拟重新排序。 + +当内存操作的结果应该由屏障排序时,KCSAN 可以检测到仅由于缺失屏障而导致的冲突的 +数据竞争。考虑下面的例子:: + + int x, flag; + void T1(void) + { + x = 1; // data race! + WRITE_ONCE(flag, 1); // correct: smp_store_release(&flag, 1) + } + void T2(void) + { + while (!READ_ONCE(flag)); // correct: smp_load_acquire(&flag) + ... = x; // data race! + } + +当启用了弱内存建模,KCSAN 将考虑对 ``T1`` 中的 ``x`` 进行模拟重新排序。在写入 +``flag`` 之后,x再次被检查是否有并发访问:因为 ``T2`` 可以在写入 +``flag`` 之后继续进行,因此检测到数据竞争。如果遇到了正确的屏障, ``x`` 在正确 +释放 ``flag`` 后将不会被考虑重新排序,因此不会检测到数据竞争。 + +在复杂性上的权衡以及实际的限制意味着只能检测到一部分由于缺失内存屏障而导致的数 +据竞争。由于当前可用的编译器支持,KCSAN 的实现仅限于建模“缓冲”(延迟访问)的 +效果,因为运行时不能“预取”访问。同时要注意,观测点只设置在普通访问上,这是唯 +一一个 KCSAN 会模拟重新排序的访问类型。这意味着标记访问的重新排序不会被建模。 + +上述情况的一个后果是获取 (acquire) 操作不需要屏障插桩(不需要预取)。此外,引 +入地址或控制依赖的标记访问不需要特殊处理(标记访问不能重新排序,后续依赖的访问 +不能被预取)。 + +关键属性 +~~~~~~~~ + +1. **内存开销**:整体的内存开销只有几 MiB,取决于配置。当前的实现是使用一个小长 + 整型数组来编码观测点信息,几乎可以忽略不计。 + +2. **性能开销**:KCSAN 的运行时旨在性能开销最小化,使用一个高效的观测点编码,在 + 快速路径中不需要获取任何锁。在拥有 8 个 CPU 的系统上的内核启动来说: + + - 使用默认 KCSAN 配置时,性能下降 5 倍; + - 仅因运行时快速路径开销导致性能下降 2.8 倍(设置非常大的 + ``KCSAN_SKIP_WATCH`` 并取消设置 ``KCSAN_SKIP_WATCH_RANDOMIZE``)。 + +3. **注解开销**:KCSAN 运行时之外需要的注释很少。因此,随着内核的发展维护的开 + 销也很小。 + +4. **检测设备的竞争写入**:由于设置观测点时会检查数据值,设备的竞争写入也可以 + 被检测到。 + +5. **内存排序**:KCSAN 只了解一部分 LKMM 排序规则;这可能会导致漏报数据竞争( + 假阴性)。 + +6. **分析准确率**: 对于观察到的执行,由于使用采样策略,分析是 *不健全* 的 + (可能有假阴性),但期望得到完整的分析(没有假阳性)。 + +考虑的替代方案 +-------------- + +一个内核数据竞争检测的替代方法是 `Kernel Thread Sanitizer (KTSAN) +`_。KTSAN 是一 +个基于先行发生关系(happens-before)的数据竞争检测器,它显式建立内存操作之间的先 +后发生顺序,这可以用来确定 `数据竞争`_ 中定义的数据竞争。 + +为了建立正确的先行发生关系,KTSAN 必须了解 LKMM 的所有排序规则和同步原语。不幸 +的是,任何遗漏都会导致大量的假阳性,这在包含众多自定义同步机制的内核上下文中特 +别有害。为了跟踪前因后果关系,KTSAN 的实现需要为每个内存位置提供元数据(影子内 +存),这意味着每页内存对应 4 页影子内存,在大型系统上可能会带来数十 GiB 的开销 +。 diff --git a/Documentation/translations/zh_CN/doc-guide/checktransupdate.rst b/Documentation/translations/zh_CN/doc-guide/checktransupdate.rst new file mode 100644 index 000000000000..d20b4ce66b9f --- /dev/null +++ b/Documentation/translations/zh_CN/doc-guide/checktransupdate.rst @@ -0,0 +1,55 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: Documentation/doc-guide/checktransupdate.rst + +:译者: 慕冬亮 Dongliang Mu + +检查翻译更新 + +这个脚本帮助跟踪不同语言的文档翻译状态,即文档是否与对应的英文版本保持更新。 + +工作原理 +------------ + +它使用 ``git log`` 命令来跟踪翻译提交的最新英文提交(按作者日期排序)和英文文档的 +最新提交。如果有任何差异,则该文件被认为是过期的,然后需要更新的提交将被收集并报告。 + +实现的功能 + +- 检查特定语言中的所有文件 +- 检查单个文件或一组文件 +- 提供更改输出格式的选项 +- 跟踪没有翻译过的文件的翻译状态 + +用法 +----- + +:: + + ./scripts/checktransupdate.py --help + +具体用法请参考参数解析器的输出 + +示例 + +- ``./scripts/checktransupdate.py -l zh_CN`` + 这将打印 zh_CN 语言中需要更新的所有文件。 +- ``./scripts/checktransupdate.py Documentation/translations/zh_CN/dev-tools/testing-overview.rst`` + 这将只打印指定文件的状态。 + +然后输出类似如下的内容: + +:: + + Documentation/dev-tools/kfence.rst + No translation in the locale of zh_CN + + Documentation/translations/zh_CN/dev-tools/testing-overview.rst + commit 42fb9cfd5b18 ("Documentation: dev-tools: Add link to RV docs") + 1 commits needs resolving in total + +待实现的功能 + +- 文件参数可以是文件夹而不仅仅是单个文件 diff --git a/Documentation/translations/zh_CN/doc-guide/index.rst b/Documentation/translations/zh_CN/doc-guide/index.rst index 78c2e9a1697f..0ac1fc9315ea 100644 --- a/Documentation/translations/zh_CN/doc-guide/index.rst +++ b/Documentation/translations/zh_CN/doc-guide/index.rst @@ -18,6 +18,7 @@ parse-headers contributing maintainer-profile + checktransupdate .. only:: subproject and html diff --git a/Documentation/translations/zh_CN/index.rst b/Documentation/translations/zh_CN/index.rst index 20b9d4270d1f..7574e1673180 100644 --- a/Documentation/translations/zh_CN/index.rst +++ b/Documentation/translations/zh_CN/index.rst @@ -89,10 +89,10 @@ TODOList: admin-guide/index admin-guide/reporting-issues.rst userspace-api/index + 内核构建系统 TODOList: -* 内核构建系统 * 用户空间工具 也可参考独立于内核文档的 `Linux 手册页 `_ 。 diff --git a/Documentation/translations/zh_CN/kbuild/gcc-plugins.rst b/Documentation/translations/zh_CN/kbuild/gcc-plugins.rst new file mode 100644 index 000000000000..67a8abbf5887 --- /dev/null +++ b/Documentation/translations/zh_CN/kbuild/gcc-plugins.rst @@ -0,0 +1,126 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: Documentation/kbuild/gcc-plugins.rst +:Translator: 慕冬亮 Dongliang Mu + +================ +GCC 插件基础设施 +================ + + +介绍 +==== + +GCC 插件是为编译器提供额外功能的可加载模块 [1]_。它们对于运行时插装和静态分析非常有用。 +我们可以在编译过程中通过回调 [2]_,GIMPLE [3]_,IPA [4]_ 和 RTL Passes [5]_ +(译者注:Pass 是编译器所采用的一种结构化技术,用于完成编译对象的分析、优化或转换等功能) +来分析、修改和添加更多的代码。 + +内核的 GCC 插件基础设施支持构建树外模块、交叉编译和在单独的目录中构建。插件源文件必须由 +C++ 编译器编译。 + +目前 GCC 插件基础设施只支持一些架构。搜索 "select HAVE_GCC_PLUGINS" 来查找支持 +GCC 插件的架构。 + +这个基础设施是从 grsecurity [6]_ 和 PaX [7]_ 移植过来的。 + +-- + +.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html +.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API +.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html +.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html +.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html +.. [6] https://grsecurity.net/ +.. [7] https://pax.grsecurity.net/ + + +目的 +==== + +GCC 插件的设计目的是提供一个用于试验 GCC 或 Clang 上游没有的潜在编译器功能的场所。 +一旦它们的实用性得到验证,这些功能将被添加到 GCC(和 Clang)的上游。随后,在所有 +支持的 GCC 版本都支持这些功能后,它们会被从内核中移除。 + +具体来说,新插件应该只实现上游编译器(GCC 和 Clang)不支持的功能。 + +当 Clang 中存在 GCC 中不存在的某项功能时,应努力将该功能做到 GCC 上游(而不仅仅 +是作为内核专用的 GCC 插件),以使整个生态都能从中受益。 + +类似的,如果 GCC 插件提供的功能在 Clang 中 **不** 存在,但该功能被证明是有用的,也应 +努力将该功能上传到 GCC(和 Clang)。 + +在上游 GCC 提供了某项功能后,该插件将无法在相应的 GCC 版本(以及更高版本)下编译。 +一旦所有内核支持的 GCC 版本都提供了该功能,该插件将从内核中移除。 + + +文件 +==== + +**$(src)/scripts/gcc-plugins** + + 这是 GCC 插件的目录。 + +**$(src)/scripts/gcc-plugins/gcc-common.h** + + 这是 GCC 插件的兼容性头文件。 + 应始终包含它,而不是单独的 GCC 头文件。 + +**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h** + + 这些头文件可以自动生成 GIMPLE、SIMPLE_IPA、IPA 和 RTL passes 的注册结构。 + 与手动创建结构相比,它们更受欢迎。 + + +用法 +==== + +你必须为你的 GCC 版本安装 GCC 插件头文件,以 Ubuntu 上的 gcc-10 为例:: + + apt-get install gcc-10-plugin-dev + +或者在 Fedora 上:: + + dnf install gcc-plugin-devel libmpc-devel + +或者在 Fedora 上使用包含插件的交叉编译器时:: + + dnf install libmpc-devel + +在内核配置中启用 GCC 插件基础设施与一些你想使用的插件:: + + CONFIG_GCC_PLUGINS=y + CONFIG_GCC_PLUGIN_LATENT_ENTROPY=y + ... + +运行 gcc(本地或交叉编译器),确保能够检测到插件头文件:: + + gcc -print-file-name=plugin + CROSS_COMPILE=arm-linux-gnu- ${CROSS_COMPILE}gcc -print-file-name=plugin + +"plugin" 这个词意味着它们没有被检测到:: + + plugin + +完整的路径则表示插件已经被检测到:: + + /usr/lib/gcc/x86_64-redhat-linux/12/plugin + +编译包括插件在内的最小工具集:: + + make scripts + +或者直接在内核中运行 make,使用循环复杂性 GCC 插件编译整个内核。 + + +4. 如何添加新的 GCC 插件 +======================== + +GCC 插件位于 scripts/gcc-plugins/。你需要将插件源文件放在 scripts/gcc-plugins/ 目录下。 +子目录创建并不支持,你必须添加在 scripts/gcc-plugins/Makefile、scripts/Makefile.gcc-plugins +和相关的 Kconfig 文件中。 diff --git a/Documentation/translations/zh_CN/kbuild/headers_install.rst b/Documentation/translations/zh_CN/kbuild/headers_install.rst new file mode 100644 index 000000000000..02cb8896e555 --- /dev/null +++ b/Documentation/translations/zh_CN/kbuild/headers_install.rst @@ -0,0 +1,39 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: Documentation/kbuild/headers_install.rst +:Translator: 慕冬亮 Dongliang Mu + +============================ +导出内核头文件供用户空间使用 +============================ + +"make headers_install" 命令以适合于用户空间程序的形式导出内核头文件。 + +Linux 内核导出的头文件描述了用户空间程序尝试使用内核服务的 API。这些内核 +头文件被系统的 C 库(例如 glibc 和 uClibc)用于定义可用的系统调用,以及 +与这些系统调用一起使用的常量和结构。C 库的头文件包括来自 linux 子目录的 +内核头文件。系统的 libc 头文件通常被安装在默认位置 /usr/include,而内核 +头文件在该位置的子目录中(主要是 /usr/include/linux 和 /usr/include/asm)。 + +内核头文件向后兼容,但不向前兼容。这意味着使用旧内核头文件的 C 库构建的程序 +可以在新内核上运行(尽管它可能无法访问新特性),但使用新内核头文件构建的程序 +可能无法在旧内核上运行。 + +"make headers_install" 命令可以在内核源代码的顶层目录中运行(或使用标准 +的树外构建)。它接受两个可选参数:: + + make headers_install ARCH=i386 INSTALL_HDR_PATH=/usr + +ARCH 表明为其生成头文件的架构,默认为当前架构。导出内核头文件的 linux/asm +目录是基于特定平台的,要查看支持架构的完整列表,使用以下命令:: + + ls -d include/asm-* | sed 's/.*-//' + +INSTALL_HDR_PATH 表明头文件的安装位置,默认为 "./usr"。 + +该命令会在 INSTALL_HDR_PATH 中自动创建创建一个 'include' 目录,而头文件 +会被安装在 INSTALL_HDR_PATH/include 中。 + +内核头文件导出的基础设施由 David Woodhouse 维护。 diff --git a/Documentation/translations/zh_CN/kbuild/index.rst b/Documentation/translations/zh_CN/kbuild/index.rst new file mode 100644 index 000000000000..b51655d981f6 --- /dev/null +++ b/Documentation/translations/zh_CN/kbuild/index.rst @@ -0,0 +1,35 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: ../disclaimer-zh_CN.rst + +:Original: Documentation/kbuild/index.rst +:Translator: 慕冬亮 Dongliang Mu + +============ +内核编译系统 +============ + +.. toctree:: + :maxdepth: 1 + + headers_install + gcc-plugins + +TODO: + +- kconfig-language +- kconfig-macro-language +- kbuild +- kconfig +- makefiles +- modules +- issues +- reproducible-builds +- llvm + +.. only:: subproject and html + + 目录 + ===== + + * :ref:`genindex` diff --git a/Documentation/translations/zh_CN/process/index.rst b/Documentation/translations/zh_CN/process/index.rst index 5a5cd7c01c62..3bcb3bdaf533 100644 --- a/Documentation/translations/zh_CN/process/index.rst +++ b/Documentation/translations/zh_CN/process/index.rst @@ -49,10 +49,11 @@ TODOLIST: embargoed-hardware-issues cve + security-bugs TODOLIST: -* security-bugs +* handling-regressions 其它大多数开发人员感兴趣的社区指南: diff --git a/Documentation/translations/zh_CN/admin-guide/security-bugs.rst b/Documentation/translations/zh_CN/process/security-bugs.rst similarity index 57% rename from Documentation/translations/zh_CN/admin-guide/security-bugs.rst rename to Documentation/translations/zh_CN/process/security-bugs.rst index d6b8f8a4e7f6..a8f5fcbfadc9 100644 --- a/Documentation/translations/zh_CN/admin-guide/security-bugs.rst +++ b/Documentation/translations/zh_CN/process/security-bugs.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + .. include:: ../disclaimer-zh_CN.rst :Original: :doc:`../../../process/security-bugs` @@ -5,6 +7,7 @@ :译者: 吴想成 Wu XiangCheng + 慕冬亮 Dongliang Mu 安全缺陷 ========= @@ -17,13 +20,13 @@ Linux内核开发人员非常重视安全性。因此我们想知道何时发现 可以通过电子邮件联系Linux内核安全团队。这是一个安全人员 的私有列表,他们将帮助验证错误报告并开发和发布修复程序。如果您已经有了一个 -修复,请将其包含在您的报告中,这样可以大大加快进程。安全团队可能会从区域维护 +修复,请将其包含在您的报告中,这样可以大大加快处理进程。安全团队可能会从区域维护 人员那里获得额外的帮助,以理解和修复安全漏洞。 与任何缺陷一样,提供的信息越多,诊断和修复就越容易。如果您不清楚哪些信息有用, 请查看“Documentation/translations/zh_CN/admin-guide/reporting-issues.rst”中 -概述的步骤。任何利用漏洞的攻击代码都非常有用,未经报告者同意不会对外发布,除 -非已经公开。 +概述的步骤。任何利用漏洞的攻击代码都非常有用,未经报告者同意不会对外发布, +除非已经公开。 请尽可能发送无附件的纯文本电子邮件。如果所有的细节都藏在附件里,那么就很难对 一个复杂的问题进行上下文引用的讨论。把它想象成一个 @@ -49,24 +52,31 @@ Linux内核开发人员非常重视安全性。因此我们想知道何时发现 换句话说,我们唯一感兴趣的是修复缺陷。提交给安全列表的所有其他资料以及对报告 的任何后续讨论,即使在解除限制之后,也将永久保密。 -协调 ------- +与其他团队协调 +-------------- -对敏感缺陷(例如那些可能导致权限提升的缺陷)的修复可能需要与私有邮件列表 -进行协调,以便分发供应商做好准备,在公开披露 -上游补丁时发布一个已修复的内核。发行版将需要一些时间来测试建议的补丁,通常 -会要求至少几天的限制,而供应商更新发布更倾向于周二至周四。若合适,安全团队 -可以协助这种协调,或者报告者可以从一开始就包括linux发行版。在这种情况下,请 -记住在电子邮件主题行前面加上“[vs]”,如linux发行版wiki中所述: -。 +虽然内核安全团队仅关注修复漏洞,但还有其他组织关注修复发行版上的安全问题以及协调 +操作系统厂商的漏洞披露。协调通常由 "linux-distros" 邮件列表处理,而披露则由 +公共 "oss-security" 邮件列表进行。两者紧密关联且被展示在 linux-distros 维基: + + +请注意,这三个列表的各自政策和规则是不同的,因为它们追求不同的目标。内核安全团队 +与其他团队之间的协调很困难,因为对于内核安全团队,保密期(即最大允许天数)是从补丁 +可用时开始,而 "linux-distros" 则从首次发布到列表时开始计算,无论是否存在补丁。 + +因此,内核安全团队强烈建议,作为一位潜在安全问题的报告者,在受影响代码的维护者 +接受补丁之前,且在您阅读上述发行版维基页面并完全理解联系 "linux-distros" +邮件列表会对您和内核社区施加的要求之前,不要联系 "linux-distros" 邮件列表。 +这也意味着通常情况下不要同时抄送两个邮件列表,除非在协调时有已接受但尚未合并的补丁。 +换句话说,在补丁被接受之前,不要抄送 "linux-distros";在修复程序被合并之后, +不要抄送内核安全团队。 CVE分配 -------- -安全团队通常不分配CVE,我们也不需要它们来进行报告或修复,因为这会使过程不必 -要的复杂化,并可能耽误缺陷处理。如果报告者希望在公开披露之前分配一个CVE编号, -他们需要联系上述的私有linux-distros列表。当在提供补丁之前已有这样的CVE编号时, -如报告者愿意,最好在提交消息中提及它。 +安全团队不分配 CVE,同时我们也不需要 CVE 来报告或修复漏洞,因为这会使过程不必要 +的复杂化,并可能延误漏洞处理。如果报告者希望为确认的问题分配一个 CVE 编号, +可以联系 :doc:`内核 CVE 分配团队 <../process/cve>` 获取。 保密协议 --------- diff --git a/Documentation/translations/zh_CN/process/submitting-patches.rst b/Documentation/translations/zh_CN/process/submitting-patches.rst index 7864107e60a8..7ca16bda3709 100644 --- a/Documentation/translations/zh_CN/process/submitting-patches.rst +++ b/Documentation/translations/zh_CN/process/submitting-patches.rst @@ -208,7 +208,7 @@ torvalds@linux-foundation.org 。他收到的邮件很多,所以一般来说 如果您有修复可利用安全漏洞的补丁,请将该补丁发送到 security@kernel.org 。对于 严重的bug,可以考虑短期禁令以允许分销商(有时间)向用户发布补丁;在这种情况下, 显然不应将补丁发送到任何公共列表。 -参见 Documentation/translations/zh_CN/admin-guide/security-bugs.rst 。 +参见 Documentation/translations/zh_CN/process/security-bugs.rst 。 修复已发布内核中严重错误的补丁程序应该抄送给稳定版维护人员,方法是把以下列行 放进补丁的签准区(注意,不是电子邮件收件人):: diff --git a/Documentation/translations/zh_TW/admin-guide/reporting-issues.rst b/Documentation/translations/zh_TW/admin-guide/reporting-issues.rst index bc132b25f2ae..1d4e4c7a6750 100644 --- a/Documentation/translations/zh_TW/admin-guide/reporting-issues.rst +++ b/Documentation/translations/zh_TW/admin-guide/reporting-issues.rst @@ -301,7 +301,7 @@ Documentation/admin-guide/reporting-regressions.rst 對此進行了更詳細的 添加到迴歸跟蹤列表中,以確保它不會被忽略。 什麼是安全問題留給您自己判斷。在繼續之前,請考慮閱讀 -Documentation/translations/zh_CN/admin-guide/security-bugs.rst , +Documentation/translations/zh_CN/process/security-bugs.rst , 因爲它提供瞭如何最恰當地處理安全問題的額外細節。 當發生了完全無法接受的糟糕事情時,此問題就是一個“非常嚴重的問題”。例如, @@ -984,7 +984,7 @@ Documentation/admin-guide/reporting-regressions.rst ;它還提供了大量其 報告,請將報告的文本轉發到這些地址;但請在報告的頂部加上註釋,表明您提交了 報告,並附上工單鏈接。 -更多信息請參見 Documentation/translations/zh_CN/admin-guide/security-bugs.rst 。 +更多信息請參見 Documentation/translations/zh_CN/process/security-bugs.rst 。 發佈報告後的責任 diff --git a/Documentation/translations/zh_TW/process/submitting-patches.rst b/Documentation/translations/zh_TW/process/submitting-patches.rst index f12f2f193f85..64de92c07906 100644 --- a/Documentation/translations/zh_TW/process/submitting-patches.rst +++ b/Documentation/translations/zh_TW/process/submitting-patches.rst @@ -209,7 +209,7 @@ torvalds@linux-foundation.org 。他收到的郵件很多,所以一般來說 如果您有修復可利用安全漏洞的補丁,請將該補丁發送到 security@kernel.org 。對於 嚴重的bug,可以考慮短期禁令以允許分銷商(有時間)向用戶發佈補丁;在這種情況下, 顯然不應將補丁發送到任何公共列表。 -參見 Documentation/translations/zh_CN/admin-guide/security-bugs.rst 。 +參見 Documentation/translations/zh_CN/process/security-bugs.rst 。 修復已發佈內核中嚴重錯誤的補丁程序應該抄送給穩定版維護人員,方法是把以下列行 放進補丁的籤準區(注意,不是電子郵件收件人):: diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index e91c0376ee59..e4be1378ba26 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -78,6 +78,7 @@ Code Seq# Include File Comments 0x03 all linux/hdreg.h 0x04 D2-DC linux/umsdos_fs.h Dead since 2.6.11, but don't reuse these. 0x06 all linux/lp.h +0x07 9F-D0 linux/vmw_vmci_defs.h, uapi/linux/vm_sockets.h 0x09 all linux/raid/md_u.h 0x10 00-0F drivers/char/s390/vmcp.h 0x10 10-1F arch/s390/include/uapi/sclp_ctl.h @@ -292,6 +293,7 @@ Code Seq# Include File Comments 't' 80-8F linux/isdn_ppp.h 't' 90-91 linux/toshiba.h toshiba and toshiba_acpi SMM 'u' 00-1F linux/smb_fs.h gone +'u' 00-2F linux/ublk_cmd.h conflict! 'u' 20-3F linux/uvcvideo.h USB video class host driver 'u' 40-4f linux/udmabuf.h userspace dma-buf misc device 'v' 00-1F linux/ext2_fs.h conflict! diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst index ad13ec55ddfe..9ca5a45c2140 100644 --- a/Documentation/virt/kvm/index.rst +++ b/Documentation/virt/kvm/index.rst @@ -14,6 +14,7 @@ KVM s390/index ppc-pv x86/index + loongarch/index locking vcpu-requests diff --git a/Documentation/virt/kvm/loongarch/hypercalls.rst b/Documentation/virt/kvm/loongarch/hypercalls.rst new file mode 100644 index 000000000000..2d6b94031f1b --- /dev/null +++ b/Documentation/virt/kvm/loongarch/hypercalls.rst @@ -0,0 +1,89 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +The LoongArch paravirtual interface +=================================== + +KVM hypercalls use the HVCL instruction with code 0x100 and the hypercall +number is put in a0. Up to five arguments may be placed in registers a1 - a5. +The return value is placed in v0 (an alias of a0). + +Source code for this interface can be found in arch/loongarch/kvm*. + +Querying for existence +====================== + +To determine if the host is running on KVM, we can utilize the cpucfg() +function at index CPUCFG_KVM_BASE (0x40000000). + +The CPUCFG_KVM_BASE range, spanning from 0x40000000 to 0x400000FF, The +CPUCFG_KVM_BASE range between 0x40000000 - 0x400000FF is marked as reserved. +Consequently, all current and future processors will not implement any +feature within this range. + +On a KVM-virtualized Linux system, a read operation on cpucfg() at index +CPUCFG_KVM_BASE (0x40000000) returns the magic string 'KVM\0'. + +Once you have determined that your host is running on a paravirtualization- +capable KVM, you may now use hypercalls as described below. + +KVM hypercall ABI +================= + +The KVM hypercall ABI is simple, with one scratch register a0 (v0) and at most +five generic registers (a1 - a5) used as input parameters. The FP (Floating- +point) and vector registers are not utilized as input registers and must +remain unmodified during a hypercall. + +Hypercall functions can be inlined as it only uses one scratch register. + +The parameters are as follows: + + ======== ================= ================ + Register IN OUT + ======== ================= ================ + a0 function number Return code + a1 1st parameter - + a2 2nd parameter - + a3 3rd parameter - + a4 4th parameter - + a5 5th parameter - + ======== ================= ================ + +The return codes may be one of the following: + + ==== ========================= + Code Meaning + ==== ========================= + 0 Success + -1 Hypercall not implemented + -2 Bad Hypercall parameter + ==== ========================= + +KVM Hypercalls Documentation +============================ + +The template for each hypercall is as follows: + +1. Hypercall name +2. Purpose + +1. KVM_HCALL_FUNC_IPI +------------------------ + +:Purpose: Send IPIs to multiple vCPUs. + +- a0: KVM_HCALL_FUNC_IPI +- a1: Lower part of the bitmap for destination physical CPUIDs +- a2: Higher part of the bitmap for destination physical CPUIDs +- a3: The lowest physical CPUID in the bitmap + +The hypercall lets a guest send multiple IPIs (Inter-Process Interrupts) with +at most 128 destinations per hypercall. The destinations are represented in a +bitmap contained in the first two input registers (a1 and a2). + +Bit 0 of a1 corresponds to the physical CPUID in the third input register (a3) +and bit 1 corresponds to the physical CPUID in a3+1, and so on. + +PV IPI on LoongArch includes both PV IPI multicast sending and PV IPI receiving, +and SWI is used for PV IPI inject since there is no VM-exits accessing SWI registers. diff --git a/Documentation/virt/kvm/loongarch/index.rst b/Documentation/virt/kvm/loongarch/index.rst new file mode 100644 index 000000000000..83387b4c5345 --- /dev/null +++ b/Documentation/virt/kvm/loongarch/index.rst @@ -0,0 +1,10 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +KVM for LoongArch systems +========================= + +.. toctree:: + :maxdepth: 2 + + hypercalls.rst diff --git a/Documentation/watchdog/watchdog-api.rst b/Documentation/watchdog/watchdog-api.rst index 800dcd7586f2..78e228c272cf 100644 --- a/Documentation/watchdog/watchdog-api.rst +++ b/Documentation/watchdog/watchdog-api.rst @@ -249,7 +249,7 @@ Note that not all devices support these two calls, and some only support the GETBOOTSTATUS call. Some drivers can measure the temperature using the GETTEMP ioctl. The -returned value is the temperature in degrees fahrenheit:: +returned value is the temperature in degrees Fahrenheit:: int temperature; ioctl(fd, WDIOC_GETTEMP, &temperature); diff --git a/MAINTAINERS b/MAINTAINERS index 200b15d61d19..514476562a11 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6753,6 +6753,7 @@ DOCUMENTATION PROCESS M: Jonathan Corbet L: workflows@vger.kernel.org S: Maintained +F: Documentation/dev-tools/ F: Documentation/maintainer/ F: Documentation/process/ @@ -6760,6 +6761,7 @@ DOCUMENTATION REPORTING ISSUES M: Thorsten Leemhuis L: linux-doc@vger.kernel.org S: Maintained +F: Documentation/admin-guide/bug-bisect.rst F: Documentation/admin-guide/quickly-build-trimmed-linux.rst F: Documentation/admin-guide/reporting-issues.rst F: Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -12367,6 +12369,7 @@ L: kvm@vger.kernel.org L: loongarch@lists.linux.dev S: Maintained T: git git://git.kernel.org/pub/scm/virt/kvm/kvm.git +F: Documentation/virt/kvm/loongarch/ F: arch/loongarch/include/asm/kvm* F: arch/loongarch/include/uapi/asm/kvm* F: arch/loongarch/kvm/ @@ -24922,6 +24925,17 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/core F: Documentation/arch/x86/ F: Documentation/devicetree/bindings/x86/ F: arch/x86/ +F: tools/testing/selftests/x86 + +X86 CPUID DATABASE +M: Borislav Petkov +M: Thomas Gleixner +M: x86@kernel.org +R: Ahmed S. Darwish +L: x86-cpuid@lists.linux.dev +S: Maintained +W: https://x86-cpuid.org +F: tools/arch/x86/kcpuid/cpuid.csv X86 ENTRY CODE M: Andy Lutomirski diff --git a/arch/m68k/configs/amiga_defconfig b/arch/m68k/configs/amiga_defconfig index 9a084943a68a..d01dc47d52ea 100644 --- a/arch/m68k/configs/amiga_defconfig +++ b/arch/m68k/configs/amiga_defconfig @@ -559,7 +559,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -636,7 +635,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/apollo_defconfig b/arch/m68k/configs/apollo_defconfig index 58ec725bf392..46808e581d7b 100644 --- a/arch/m68k/configs/apollo_defconfig +++ b/arch/m68k/configs/apollo_defconfig @@ -516,7 +516,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -593,7 +592,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/atari_defconfig b/arch/m68k/configs/atari_defconfig index 63ab7e892e59..4469a7839c9d 100644 --- a/arch/m68k/configs/atari_defconfig +++ b/arch/m68k/configs/atari_defconfig @@ -536,7 +536,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -613,7 +612,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/bvme6000_defconfig b/arch/m68k/configs/bvme6000_defconfig index ca40744eec60..c0719322c028 100644 --- a/arch/m68k/configs/bvme6000_defconfig +++ b/arch/m68k/configs/bvme6000_defconfig @@ -508,7 +508,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -585,7 +584,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/hp300_defconfig b/arch/m68k/configs/hp300_defconfig index 77bcc6faf468..8d429e63f8f2 100644 --- a/arch/m68k/configs/hp300_defconfig +++ b/arch/m68k/configs/hp300_defconfig @@ -518,7 +518,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -595,7 +594,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/mac_defconfig b/arch/m68k/configs/mac_defconfig index e73d4032b659..bafd33da27c1 100644 --- a/arch/m68k/configs/mac_defconfig +++ b/arch/m68k/configs/mac_defconfig @@ -535,7 +535,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -612,7 +611,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/multi_defconfig b/arch/m68k/configs/multi_defconfig index 638df8442c98..6f5ca3f85ea1 100644 --- a/arch/m68k/configs/multi_defconfig +++ b/arch/m68k/configs/multi_defconfig @@ -621,7 +621,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -698,7 +697,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/mvme147_defconfig b/arch/m68k/configs/mvme147_defconfig index 2248db426081..d16b328c7136 100644 --- a/arch/m68k/configs/mvme147_defconfig +++ b/arch/m68k/configs/mvme147_defconfig @@ -507,7 +507,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -584,7 +583,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/mvme16x_defconfig b/arch/m68k/configs/mvme16x_defconfig index 2975b66521f6..80f6c15a5ed5 100644 --- a/arch/m68k/configs/mvme16x_defconfig +++ b/arch/m68k/configs/mvme16x_defconfig @@ -508,7 +508,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -585,7 +584,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/q40_defconfig b/arch/m68k/configs/q40_defconfig index 0a0e32344033..0e81589f0ee2 100644 --- a/arch/m68k/configs/q40_defconfig +++ b/arch/m68k/configs/q40_defconfig @@ -525,7 +525,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -602,7 +601,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/sun3_defconfig b/arch/m68k/configs/sun3_defconfig index f16f1a99c2ba..8cd785290339 100644 --- a/arch/m68k/configs/sun3_defconfig +++ b/arch/m68k/configs/sun3_defconfig @@ -506,7 +506,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -582,7 +581,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/configs/sun3x_defconfig b/arch/m68k/configs/sun3x_defconfig index 667bd61ae9b3..78035369f60f 100644 --- a/arch/m68k/configs/sun3x_defconfig +++ b/arch/m68k/configs/sun3x_defconfig @@ -506,7 +506,6 @@ CONFIG_CRYPTO_DH=m CONFIG_CRYPTO_ECDH=m CONFIG_CRYPTO_ECDSA=m CONFIG_CRYPTO_ECRDSA=m -CONFIG_CRYPTO_SM2=m CONFIG_CRYPTO_CURVE25519=m CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_TI=m @@ -583,7 +582,6 @@ CONFIG_TEST_RHASHTABLE=m CONFIG_TEST_IDA=m CONFIG_TEST_BITOPS=m CONFIG_TEST_VMALLOC=m -CONFIG_TEST_USER_COPY=m CONFIG_TEST_BPF=m CONFIG_TEST_BLACKHOLE_DEV=m CONFIG_FIND_BIT_BENCHMARK=m diff --git a/arch/m68k/include/asm/cmpxchg.h b/arch/m68k/include/asm/cmpxchg.h index 4ba14f3535fc..71fbe5c5c564 100644 --- a/arch/m68k/include/asm/cmpxchg.h +++ b/arch/m68k/include/asm/cmpxchg.h @@ -3,6 +3,7 @@ #define __ARCH_M68K_CMPXCHG__ #include +#include #define __xg(type, x) ((volatile type *)(x)) @@ -11,25 +12,19 @@ extern unsigned long __invalid_xchg_size(unsigned long, volatile void *, int); #ifndef CONFIG_RMW_INSNS static inline unsigned long __arch_xchg(unsigned long x, volatile void * ptr, int size) { - unsigned long flags, tmp; + unsigned long flags; local_irq_save(flags); switch (size) { case 1: - tmp = *(u8 *)ptr; - *(u8 *)ptr = x; - x = tmp; + swap(*(u8 *)ptr, x); break; case 2: - tmp = *(u16 *)ptr; - *(u16 *)ptr = x; - x = tmp; + swap(*(u16 *)ptr, x); break; case 4: - tmp = *(u32 *)ptr; - *(u32 *)ptr = x; - x = tmp; + swap(*(u32 *)ptr, x); break; default: x = __invalid_xchg_size(x, ptr, size); diff --git a/arch/m68k/kernel/process.c b/arch/m68k/kernel/process.c index 2584e94e2134..fda7eac23f87 100644 --- a/arch/m68k/kernel/process.c +++ b/arch/m68k/kernel/process.c @@ -117,7 +117,7 @@ asmlinkage int m68k_clone(struct pt_regs *regs) { /* regs will be equal to current_pt_regs() */ struct kernel_clone_args args = { - .flags = regs->d1 & ~CSIGNAL, + .flags = (u32)(regs->d1) & ~CSIGNAL, .pidfd = (int __user *)regs->d3, .child_tid = (int __user *)regs->d4, .parent_tid = (int __user *)regs->d3, diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 6f1c31f4b9ab..d1fe732979d4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -107,6 +107,7 @@ config X86 select ARCH_HAS_DEBUG_WX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG + select ARCH_HAVE_EXTRA_ELF_NOTES select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 9327eb00e96d..f21ff1932699 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -18,6 +18,11 @@ #define ARCH_APICTIMER_STOPS_ON_C3 1 +/* Macros for apic_extnmi which controls external NMI masking */ +#define APIC_EXTNMI_BSP 0 /* Default */ +#define APIC_EXTNMI_ALL 1 +#define APIC_EXTNMI_NONE 2 + /* * Debugging macros */ @@ -25,22 +30,22 @@ #define APIC_VERBOSE 1 #define APIC_DEBUG 2 -/* Macros for apic_extnmi which controls external NMI masking */ -#define APIC_EXTNMI_BSP 0 /* Default */ -#define APIC_EXTNMI_ALL 1 -#define APIC_EXTNMI_NONE 2 - /* - * Define the default level of output to be very little - * This can be turned up by using apic=verbose for more - * information and apic=debug for _lots_ of information. - * apic_verbosity is defined in apic.c + * Define the default level of output to be very little This can be turned + * up by using apic=verbose for more information and apic=debug for _lots_ + * of information. apic_verbosity is defined in apic.c */ -#define apic_printk(v, s, a...) do { \ - if ((v) <= apic_verbosity) \ - printk(s, ##a); \ - } while (0) +#define apic_printk(v, s, a...) \ +do { \ + if ((v) <= apic_verbosity) \ + printk(s, ##a); \ +} while (0) +#define apic_pr_verbose(s, a...) apic_printk(APIC_VERBOSE, KERN_INFO s, ##a) +#define apic_pr_debug(s, a...) apic_printk(APIC_DEBUG, KERN_DEBUG s, ##a) +#define apic_pr_debug_cont(s, a...) apic_printk(APIC_DEBUG, KERN_CONT s, ##a) +/* Unconditional debug prints for code which is guarded by apic_verbosity already */ +#define apic_dbg(s, a...) printk(KERN_DEBUG s, ##a) #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86_32) extern void x86_32_probe_apic(void); @@ -122,8 +127,6 @@ static inline bool apic_is_x2apic_enabled(void) extern void enable_IR_x2apic(void); -extern int get_physical_broadcast(void); - extern int lapic_get_maxlvt(void); extern void clear_local_APIC(void); extern void disconnect_bsp_APIC(int virt_wire_setup); @@ -345,20 +348,12 @@ extern struct apic *apic; * APIC drivers are probed based on how they are listed in the .apicdrivers * section. So the order is important and enforced by the ordering * of different apic driver files in the Makefile. - * - * For the files having two apic drivers, we use apic_drivers() - * to enforce the order with in them. */ #define apic_driver(sym) \ static const struct apic *__apicdrivers_##sym __used \ __aligned(sizeof(struct apic *)) \ __section(".apicdrivers") = { &sym } -#define apic_drivers(sym1, sym2) \ - static struct apic *__apicdrivers_##sym1##sym2[2] __used \ - __aligned(sizeof(struct apic *)) \ - __section(".apicdrivers") = { &sym1, &sym2 } - extern struct apic *__apicdrivers[], *__apicdrivers_end[]; /* @@ -484,7 +479,6 @@ static inline u64 apic_icr_read(void) { return 0; } static inline void apic_icr_write(u32 low, u32 high) { } static inline void apic_wait_icr_idle(void) { } static inline u32 safe_apic_wait_icr_idle(void) { return 0; } -static inline void apic_set_eoi_cb(void (*eoi)(void)) {} static inline void apic_native_eoi(void) { WARN_ON_ONCE(1); } static inline void apic_setup_apic_calls(void) { } @@ -512,8 +506,6 @@ static inline bool is_vector_pending(unsigned int vector) #define TRAMPOLINE_PHYS_LOW 0x467 #define TRAMPOLINE_PHYS_HIGH 0x469 -extern void generic_bigsmp_probe(void); - #ifdef CONFIG_X86_LOCAL_APIC #include @@ -536,8 +528,6 @@ static inline int default_acpi_madt_oem_check(char *a, char *b) { return 0; } static inline void x86_64_probe_apic(void) { } #endif -extern int default_apic_id_valid(u32 apicid); - extern u32 apic_default_calc_apicid(unsigned int cpu); extern u32 apic_flat_calc_apicid(unsigned int cpu); diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h index a3ec87d198ac..806649c7f23d 100644 --- a/arch/x86/include/asm/bug.h +++ b/arch/x86/include/asm/bug.h @@ -13,6 +13,18 @@ #define INSN_UD2 0x0b0f #define LEN_UD2 2 +/* + * In clang we have UD1s reporting UBSAN failures on X86, 64 and 32bit. + */ +#define INSN_ASOP 0x67 +#define OPCODE_ESCAPE 0x0f +#define SECOND_BYTE_OPCODE_UD1 0xb9 +#define SECOND_BYTE_OPCODE_UD2 0x0b + +#define BUG_NONE 0xffff +#define BUG_UD1 0xfffe +#define BUG_UD2 0xfffd + #ifdef CONFIG_GENERIC_BUG #ifdef CONFIG_X86_32 diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h index fb2809b20b0a..77d20555e04d 100644 --- a/arch/x86/include/asm/entry-common.h +++ b/arch/x86/include/asm/entry-common.h @@ -8,6 +8,7 @@ #include #include #include +#include /* Check that the stack and regs on entry from user mode are sane. */ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) @@ -44,8 +45,7 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) } #define arch_enter_from_user_mode arch_enter_from_user_mode -static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, - unsigned long ti_work) +static inline void arch_exit_work(unsigned long ti_work) { if (ti_work & _TIF_USER_RETURN_NOTIFY) fire_user_return_notifiers(); @@ -56,6 +56,15 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, fpregs_assert_state_consistent(); if (unlikely(ti_work & _TIF_NEED_FPU_LOAD)) switch_fpu_return(); +} + +static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs, + unsigned long ti_work) +{ + if (IS_ENABLED(CONFIG_X86_DEBUG_FPU) || unlikely(ti_work)) + arch_exit_work(ti_work); + + fred_update_rsp0(); #ifdef CONFIG_COMPAT /* diff --git a/arch/x86/include/asm/extable.h b/arch/x86/include/asm/extable.h index eeed395c3177..a0e0c6b50155 100644 --- a/arch/x86/include/asm/extable.h +++ b/arch/x86/include/asm/extable.h @@ -37,7 +37,6 @@ struct pt_regs; extern int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code, unsigned long fault_addr); -extern int fixup_bug(struct pt_regs *regs, int trapnr); extern int ex_get_fixup_type(unsigned long ip); extern void early_fixup_exception(struct pt_regs *regs, int trapnr); diff --git a/arch/x86/include/asm/fpu/signal.h b/arch/x86/include/asm/fpu/signal.h index 611fa41711af..eccc75bc9c4f 100644 --- a/arch/x86/include/asm/fpu/signal.h +++ b/arch/x86/include/asm/fpu/signal.h @@ -29,7 +29,7 @@ fpu__alloc_mathframe(unsigned long sp, int ia32_frame, unsigned long fpu__get_fpstate_size(void); -extern bool copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size); +extern bool copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size, u32 pkru); extern void fpu__clear_user_states(struct fpu *fpu); extern bool fpu__restore_sig(void __user *buf, int ia32_frame); diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h index e86c7ba32435..25ca00bd70e8 100644 --- a/arch/x86/include/asm/fred.h +++ b/arch/x86/include/asm/fred.h @@ -36,6 +36,7 @@ #ifdef CONFIG_X86_FRED #include +#include #include @@ -84,13 +85,33 @@ static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int } void cpu_init_fred_exceptions(void); +void cpu_init_fred_rsps(void); void fred_complete_exception_setup(void); +DECLARE_PER_CPU(unsigned long, fred_rsp0); + +static __always_inline void fred_sync_rsp0(unsigned long rsp0) +{ + __this_cpu_write(fred_rsp0, rsp0); +} + +static __always_inline void fred_update_rsp0(void) +{ + unsigned long rsp0 = (unsigned long) task_stack_page(current) + THREAD_SIZE; + + if (cpu_feature_enabled(X86_FEATURE_FRED) && (__this_cpu_read(fred_rsp0) != rsp0)) { + wrmsrns(MSR_IA32_FRED_RSP0, rsp0); + __this_cpu_write(fred_rsp0, rsp0); + } +} #else /* CONFIG_X86_FRED */ static __always_inline unsigned long fred_event_data(struct pt_regs *regs) { return 0; } static inline void cpu_init_fred_exceptions(void) { } +static inline void cpu_init_fred_rsps(void) { } static inline void fred_complete_exception_setup(void) { } -static __always_inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { } +static inline void fred_entry_from_kvm(unsigned int type, unsigned int vector) { } +static inline void fred_sync_rsp0(unsigned long rsp0) { } +static inline void fred_update_rsp0(void) { } #endif /* CONFIG_X86_FRED */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h index c67fa6ad098a..6ffa8b75f4cd 100644 --- a/arch/x86/include/asm/hardirq.h +++ b/arch/x86/include/asm/hardirq.h @@ -69,7 +69,11 @@ extern u64 arch_irq_stat(void); #define local_softirq_pending_ref pcpu_hot.softirq_pending #if IS_ENABLED(CONFIG_KVM_INTEL) -static inline void kvm_set_cpu_l1tf_flush_l1d(void) +/* + * This function is called from noinstr interrupt contexts + * and must be inlined to not get instrumentation. + */ +static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { __this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1); } @@ -84,7 +88,7 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void) return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d); } #else /* !IS_ENABLED(CONFIG_KVM_INTEL) */ -static inline void kvm_set_cpu_l1tf_flush_l1d(void) { } +static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { } #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */ #endif /* _ASM_X86_HARDIRQ_H */ diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index d4f24499b256..ad5c68f0509d 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -212,8 +212,8 @@ __visible noinstr void func(struct pt_regs *regs, \ irqentry_state_t state = irqentry_enter(regs); \ u32 vector = (u32)(u8)error_code; \ \ + kvm_set_cpu_l1tf_flush_l1d(); \ instrumentation_begin(); \ - kvm_set_cpu_l1tf_flush_l1d(); \ run_irq_on_irqstack_cond(__##func, regs, vector); \ instrumentation_end(); \ irqentry_exit(regs, state); \ @@ -250,7 +250,6 @@ static void __##func(struct pt_regs *regs); \ \ static __always_inline void instr_##func(struct pt_regs *regs) \ { \ - kvm_set_cpu_l1tf_flush_l1d(); \ run_sysvec_on_irqstack_cond(__##func, regs); \ } \ \ @@ -258,6 +257,7 @@ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state = irqentry_enter(regs); \ \ + kvm_set_cpu_l1tf_flush_l1d(); \ instrumentation_begin(); \ instr_##func (regs); \ instrumentation_end(); \ @@ -288,7 +288,6 @@ static __always_inline void __##func(struct pt_regs *regs); \ static __always_inline void instr_##func(struct pt_regs *regs) \ { \ __irq_enter_raw(); \ - kvm_set_cpu_l1tf_flush_l1d(); \ __##func (regs); \ __irq_exit_raw(); \ } \ @@ -297,6 +296,7 @@ __visible noinstr void func(struct pt_regs *regs) \ { \ irqentry_state_t state = irqentry_enter(regs); \ \ + kvm_set_cpu_l1tf_flush_l1d(); \ instrumentation_begin(); \ instr_##func (regs); \ instrumentation_end(); \ diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 13aea8fc3d45..47051871b436 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h @@ -18,8 +18,8 @@ * Vectors 0 ... 31 : system traps and exceptions - hardcoded events * Vectors 32 ... 127 : device interrupts * Vector 128 : legacy int80 syscall interface - * Vectors 129 ... LOCAL_TIMER_VECTOR-1 - * Vectors LOCAL_TIMER_VECTOR ... 255 : special interrupts + * Vectors 129 ... FIRST_SYSTEM_VECTOR-1 : device interrupts + * Vectors FIRST_SYSTEM_VECTOR ... 255 : special interrupts * * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table. * diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 8dac45a2c7fc..19091ebb8633 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -88,7 +88,13 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) #ifdef CONFIG_ADDRESS_MASKING static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm) { - return mm->context.lam_cr3_mask; + /* + * When switch_mm_irqs_off() is called for a kthread, it may race with + * LAM enablement. switch_mm_irqs_off() uses the LAM mask to do two + * things: populate CR3 and populate 'cpu_tlbstate.lam'. Make sure it + * reads a single value for both. + */ + return READ_ONCE(mm->context.lam_cr3_mask); } static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm) diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index d642037f9ed5..001853541f1e 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -99,19 +99,6 @@ static __always_inline void __wrmsr(unsigned int msr, u32 low, u32 high) : : "c" (msr), "a"(low), "d" (high) : "memory"); } -/* - * WRMSRNS behaves exactly like WRMSR with the only difference being - * that it is not a serializing instruction by default. - */ -static __always_inline void __wrmsrns(u32 msr, u32 low, u32 high) -{ - /* Instruction opcode for WRMSRNS; supported in binutils >= 2.40. */ - asm volatile("1: .byte 0x0f,0x01,0xc6\n" - "2:\n" - _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) - : : "c" (msr), "a"(low), "d" (high)); -} - #define native_rdmsr(msr, val1, val2) \ do { \ u64 __val = __rdmsr((msr)); \ @@ -312,9 +299,19 @@ do { \ #endif /* !CONFIG_PARAVIRT_XXL */ +/* Instruction opcode for WRMSRNS supported in binutils >= 2.40 */ +#define WRMSRNS _ASM_BYTES(0x0f,0x01,0xc6) + +/* Non-serializing WRMSR, when available. Falls back to a serializing WRMSR. */ static __always_inline void wrmsrns(u32 msr, u64 val) { - __wrmsrns(msr, val, val >> 32); + /* + * WRMSR is 2 bytes. WRMSRNS is 3 bytes. Pad WRMSR with a redundant + * DS prefix to avoid a trailing NOP. + */ + asm volatile("1: " ALTERNATIVE("ds wrmsr", WRMSRNS, X86_FEATURE_WRMSRNS) + "2: " _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_WRMSR) + : : "c" (msr), "a" ((u32)val), "d" ((u32)(val >> 32))); } /* diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h index 090d658a85a6..4218248083d9 100644 --- a/arch/x86/include/asm/mtrr.h +++ b/arch/x86/include/asm/mtrr.h @@ -69,7 +69,6 @@ extern int mtrr_add_page(unsigned long base, unsigned long size, unsigned int type, bool increment); extern int mtrr_del(int reg, unsigned long base, unsigned long size); extern int mtrr_del_page(int reg, unsigned long base, unsigned long size); -extern void mtrr_bp_restore(void); extern int mtrr_trim_uncached_memory(unsigned long end_pfn); extern int amd_special_default_mtrr(void); void mtrr_disable(void); @@ -117,7 +116,6 @@ static inline int mtrr_trim_uncached_memory(unsigned long end_pfn) return 0; } #define mtrr_bp_init() do {} while (0) -#define mtrr_bp_restore() do {} while (0) #define mtrr_disable() do {} while (0) #define mtrr_enable() do {} while (0) #define mtrr_generic_set_state() do {} while (0) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 2f321137736c..6f82e75b6149 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -517,8 +517,6 @@ typedef struct page *pgtable_t; extern pteval_t __supported_pte_mask; extern pteval_t __default_kernel_pte_mask; -extern void set_nx(void); -extern int nx_enabled; #define pgprot_writecombine pgprot_writecombine extern pgprot_t pgprot_writecombine(pgprot_t prot); diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 775acbdea1a9..4a686f0e5dbf 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -582,7 +582,8 @@ extern void switch_gdt_and_percpu_base(int); extern void load_direct_gdt(int); extern void load_fixmap_gdt(int); extern void cpu_init(void); -extern void cpu_init_exception_handling(void); +extern void cpu_init_exception_handling(bool boot_cpu); +extern void cpu_init_replace_early_idt(void); extern void cr4_init(void); extern void set_task_blockstep(struct task_struct *task, bool on); diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h index c3bd0c0758c9..75248546403d 100644 --- a/arch/x86/include/asm/switch_to.h +++ b/arch/x86/include/asm/switch_to.h @@ -70,13 +70,9 @@ static inline void update_task_stack(struct task_struct *task) #ifdef CONFIG_X86_32 this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0); #else - if (cpu_feature_enabled(X86_FEATURE_FRED)) { - /* WRMSRNS is a baseline feature for FRED. */ - wrmsrns(MSR_IA32_FRED_RSP0, (unsigned long)task_stack_page(task) + THREAD_SIZE); - } else if (cpu_feature_enabled(X86_FEATURE_XENPV)) { + if (!cpu_feature_enabled(X86_FEATURE_FRED) && cpu_feature_enabled(X86_FEATURE_XENPV)) /* Xen PV enters the kernel on the thread stack. */ load_sp0(task_top_of_stack(task)); - } #endif } diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index 2fc7bc3863ff..7c488ff0c764 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -82,7 +82,12 @@ static inline void syscall_get_arguments(struct task_struct *task, struct pt_regs *regs, unsigned long *args) { - memcpy(args, ®s->bx, 6 * sizeof(args[0])); + args[0] = regs->bx; + args[1] = regs->cx; + args[2] = regs->dx; + args[3] = regs->si; + args[4] = regs->di; + args[5] = regs->bp; } static inline int syscall_get_arch(struct task_struct *task) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 25726893c6f4..69e79fff41b8 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -399,11 +399,10 @@ static inline u64 tlbstate_lam_cr3_mask(void) return lam << X86_CR3_LAM_U57_BIT; } -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) +static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) { - this_cpu_write(cpu_tlbstate.lam, - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT); - this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask); + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT); + this_cpu_write(tlbstate_untag_mask, untag_mask); } #else @@ -413,7 +412,7 @@ static inline u64 tlbstate_lam_cr3_mask(void) return 0; } -static inline void set_tlbstate_lam_mode(struct mm_struct *mm) +static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) { } #endif diff --git a/arch/x86/include/asm/uv/uv_irq.h b/arch/x86/include/asm/uv/uv_irq.h index d6b17c760622..1876b5edd142 100644 --- a/arch/x86/include/asm/uv/uv_irq.h +++ b/arch/x86/include/asm/uv/uv_irq.h @@ -31,7 +31,6 @@ enum { UV_AFFINITY_CPU }; -extern int uv_irq_2_mmr_info(int, unsigned long *, int *); extern int uv_setup_irq(char *, int, int, unsigned long, int); extern void uv_teardown_irq(unsigned int); diff --git a/arch/x86/include/uapi/asm/elf.h b/arch/x86/include/uapi/asm/elf.h new file mode 100644 index 000000000000..468e135fa285 --- /dev/null +++ b/arch/x86/include/uapi/asm/elf.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_ASM_X86_ELF_H +#define _UAPI_ASM_X86_ELF_H + +#include + +struct x86_xfeat_component { + __u32 type; + __u32 size; + __u32 offset; + __u32 flags; +} __packed; + +_Static_assert(sizeof(struct x86_xfeat_component) % 4 == 0, "x86_xfeat_component is not aligned"); + +#endif /* _UAPI_ASM_X86_ELF_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index a847180836e4..f7918980667a 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -35,6 +35,14 @@ KMSAN_SANITIZE_nmi.o := n # If instrumentation of the following files is enabled, boot hangs during # first second. KCOV_INSTRUMENT_head$(BITS).o := n +# These are called from save_stack_trace() on debug paths, +# and produce large amounts of uninteresting coverage. +KCOV_INSTRUMENT_stacktrace.o := n +KCOV_INSTRUMENT_dumpstack.o := n +KCOV_INSTRUMENT_dumpstack_$(BITS).o := n +KCOV_INSTRUMENT_unwind_orc.o := n +KCOV_INSTRUMENT_unwind_frame.o := n +KCOV_INSTRUMENT_unwind_guess.o := n CFLAGS_irq.o := -I $(src)/../include/asm/trace diff --git a/arch/x86/kernel/amd_nb.c b/arch/x86/kernel/amd_nb.c index 059e5c16af05..dc5d3216af24 100644 --- a/arch/x86/kernel/amd_nb.c +++ b/arch/x86/kernel/amd_nb.c @@ -26,6 +26,7 @@ #define PCI_DEVICE_ID_AMD_19H_M70H_ROOT 0x14e8 #define PCI_DEVICE_ID_AMD_1AH_M00H_ROOT 0x153a #define PCI_DEVICE_ID_AMD_1AH_M20H_ROOT 0x1507 +#define PCI_DEVICE_ID_AMD_1AH_M60H_ROOT 0x1122 #define PCI_DEVICE_ID_AMD_MI200_ROOT 0x14bb #define PCI_DEVICE_ID_AMD_MI300_ROOT 0x14f8 @@ -43,6 +44,8 @@ #define PCI_DEVICE_ID_AMD_19H_M70H_DF_F4 0x14f4 #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F4 0x12fc #define PCI_DEVICE_ID_AMD_1AH_M00H_DF_F4 0x12c4 +#define PCI_DEVICE_ID_AMD_1AH_M60H_DF_F4 0x124c +#define PCI_DEVICE_ID_AMD_1AH_M70H_DF_F4 0x12bc #define PCI_DEVICE_ID_AMD_MI200_DF_F4 0x14d4 #define PCI_DEVICE_ID_AMD_MI300_DF_F4 0x152c @@ -63,6 +66,7 @@ static const struct pci_device_id amd_root_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M70H_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M00H_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M20H_ROOT) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M60H_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_ROOT) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_ROOT) }, {} @@ -95,6 +99,7 @@ static const struct pci_device_id amd_nb_misc_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M70H_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F3) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_DF_F3) }, @@ -122,6 +127,8 @@ static const struct pci_device_id amd_nb_link_ids[] = { { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CNB17H_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M00H_DF_F4) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M60H_DF_F4) }, + { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_1AH_M70H_DF_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI200_DF_F4) }, { PCI_DEVICE(PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_MI300_DF_F4) }, {} diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 373638691cd4..6513c53c9459 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -677,7 +677,7 @@ calibrate_by_pmtimer(u32 deltapm, long *delta, long *deltatsc) return -1; #endif - apic_printk(APIC_VERBOSE, "... PM-Timer delta = %u\n", deltapm); + apic_pr_verbose("... PM-Timer delta = %u\n", deltapm); /* Check, if the PM timer is available */ if (!deltapm) @@ -687,14 +687,14 @@ calibrate_by_pmtimer(u32 deltapm, long *delta, long *deltatsc) if (deltapm > (pm_100ms - pm_thresh) && deltapm < (pm_100ms + pm_thresh)) { - apic_printk(APIC_VERBOSE, "... PM-Timer result ok\n"); + apic_pr_verbose("... PM-Timer result ok\n"); return 0; } res = (((u64)deltapm) * mult) >> 22; do_div(res, 1000000); - pr_warn("APIC calibration not consistent " - "with PM-Timer: %ldms instead of 100ms\n", (long)res); + pr_warn("APIC calibration not consistent with PM-Timer: %ldms instead of 100ms\n", + (long)res); /* Correct the lapic counter value */ res = (((u64)(*delta)) * pm_100ms); @@ -707,9 +707,8 @@ calibrate_by_pmtimer(u32 deltapm, long *delta, long *deltatsc) if (boot_cpu_has(X86_FEATURE_TSC)) { res = (((u64)(*deltatsc)) * pm_100ms); do_div(res, deltapm); - apic_printk(APIC_VERBOSE, "TSC delta adjusted to " - "PM-Timer: %lu (%ld)\n", - (unsigned long)res, *deltatsc); + apic_pr_verbose("TSC delta adjusted to PM-Timer: %lu (%ld)\n", + (unsigned long)res, *deltatsc); *deltatsc = (long)res; } @@ -792,8 +791,7 @@ static int __init calibrate_APIC_clock(void) * in the clockevent structure and return. */ if (!lapic_init_clockevent()) { - apic_printk(APIC_VERBOSE, "lapic timer already calibrated %d\n", - lapic_timer_period); + apic_pr_verbose("lapic timer already calibrated %d\n", lapic_timer_period); /* * Direct calibration methods must have an always running * local APIC timer, no need for broadcast timer. @@ -802,8 +800,7 @@ static int __init calibrate_APIC_clock(void) return 0; } - apic_printk(APIC_VERBOSE, "Using local APIC timer interrupts.\n" - "calibrating APIC timer ...\n"); + apic_pr_verbose("Using local APIC timer interrupts. Calibrating APIC timer ...\n"); /* * There are platforms w/o global clockevent devices. Instead of @@ -866,7 +863,7 @@ static int __init calibrate_APIC_clock(void) /* Build delta t1-t2 as apic timer counts down */ delta = lapic_cal_t1 - lapic_cal_t2; - apic_printk(APIC_VERBOSE, "... lapic delta = %ld\n", delta); + apic_pr_verbose("... lapic delta = %ld\n", delta); deltatsc = (long)(lapic_cal_tsc2 - lapic_cal_tsc1); @@ -877,22 +874,19 @@ static int __init calibrate_APIC_clock(void) lapic_timer_period = (delta * APIC_DIVISOR) / LAPIC_CAL_LOOPS; lapic_init_clockevent(); - apic_printk(APIC_VERBOSE, "..... delta %ld\n", delta); - apic_printk(APIC_VERBOSE, "..... mult: %u\n", lapic_clockevent.mult); - apic_printk(APIC_VERBOSE, "..... calibration result: %u\n", - lapic_timer_period); + apic_pr_verbose("..... delta %ld\n", delta); + apic_pr_verbose("..... mult: %u\n", lapic_clockevent.mult); + apic_pr_verbose("..... calibration result: %u\n", lapic_timer_period); if (boot_cpu_has(X86_FEATURE_TSC)) { - apic_printk(APIC_VERBOSE, "..... CPU clock speed is " - "%ld.%04ld MHz.\n", - (deltatsc / LAPIC_CAL_LOOPS) / (1000000 / HZ), - (deltatsc / LAPIC_CAL_LOOPS) % (1000000 / HZ)); + apic_pr_verbose("..... CPU clock speed is %ld.%04ld MHz.\n", + (deltatsc / LAPIC_CAL_LOOPS) / (1000000 / HZ), + (deltatsc / LAPIC_CAL_LOOPS) % (1000000 / HZ)); } - apic_printk(APIC_VERBOSE, "..... host bus clock speed is " - "%u.%04u MHz.\n", - lapic_timer_period / (1000000 / HZ), - lapic_timer_period % (1000000 / HZ)); + apic_pr_verbose("..... host bus clock speed is %u.%04u MHz.\n", + lapic_timer_period / (1000000 / HZ), + lapic_timer_period % (1000000 / HZ)); /* * Do a sanity check on the APIC calibration result @@ -911,7 +905,7 @@ static int __init calibrate_APIC_clock(void) * available. */ if (!pm_referenced && global_clock_event) { - apic_printk(APIC_VERBOSE, "... verify APIC timer\n"); + apic_pr_verbose("... verify APIC timer\n"); /* * Setup the apic timer manually @@ -932,11 +926,11 @@ static int __init calibrate_APIC_clock(void) /* Jiffies delta */ deltaj = lapic_cal_j2 - lapic_cal_j1; - apic_printk(APIC_VERBOSE, "... jiffies delta = %lu\n", deltaj); + apic_pr_verbose("... jiffies delta = %lu\n", deltaj); /* Check, if the jiffies result is consistent */ if (deltaj >= LAPIC_CAL_LOOPS-2 && deltaj <= LAPIC_CAL_LOOPS+2) - apic_printk(APIC_VERBOSE, "... jiffies result ok\n"); + apic_pr_verbose("... jiffies result ok\n"); else levt->features |= CLOCK_EVT_FEAT_DUMMY; } @@ -1221,9 +1215,8 @@ void __init sync_Arb_IDs(void) */ apic_wait_icr_idle(); - apic_printk(APIC_DEBUG, "Synchronizing Arb IDs.\n"); - apic_write(APIC_ICR, APIC_DEST_ALLINC | - APIC_INT_LEVELTRIG | APIC_DM_INIT); + apic_pr_debug("Synchronizing Arb IDs.\n"); + apic_write(APIC_ICR, APIC_DEST_ALLINC | APIC_INT_LEVELTRIG | APIC_DM_INIT); } enum apic_intr_mode_id apic_intr_mode __ro_after_init; @@ -1409,10 +1402,10 @@ static void lapic_setup_esr(void) if (maxlvt > 3) apic_write(APIC_ESR, 0); value = apic_read(APIC_ESR); - if (value != oldvalue) - apic_printk(APIC_VERBOSE, "ESR value before enabling " - "vector: 0x%08x after: 0x%08x\n", - oldvalue, value); + if (value != oldvalue) { + apic_pr_verbose("ESR value before enabling vector: 0x%08x after: 0x%08x\n", + oldvalue, value); + } } #define APIC_IR_REGS APIC_ISR_NR @@ -1599,10 +1592,10 @@ static void setup_local_APIC(void) value = apic_read(APIC_LVT0) & APIC_LVT_MASKED; if (!cpu && (pic_mode || !value || ioapic_is_disabled)) { value = APIC_DM_EXTINT; - apic_printk(APIC_VERBOSE, "enabled ExtINT on CPU#%d\n", cpu); + apic_pr_verbose("Enabled ExtINT on CPU#%d\n", cpu); } else { value = APIC_DM_EXTINT | APIC_LVT_MASKED; - apic_printk(APIC_VERBOSE, "masked ExtINT on CPU#%d\n", cpu); + apic_pr_verbose("Masked ExtINT on CPU#%d\n", cpu); } apic_write(APIC_LVT0, value); @@ -2067,8 +2060,7 @@ static __init void apic_set_fixmap(bool read_apic) { set_fixmap_nocache(FIX_APIC_BASE, mp_lapic_addr); apic_mmio_base = APIC_BASE; - apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n", - apic_mmio_base, mp_lapic_addr); + apic_pr_verbose("Mapped APIC to %16lx (%16lx)\n", apic_mmio_base, mp_lapic_addr); if (read_apic) apic_read_boot_cpu_id(false); } @@ -2171,18 +2163,17 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_error_interrupt) apic_eoi(); atomic_inc(&irq_err_count); - apic_printk(APIC_DEBUG, KERN_DEBUG "APIC error on CPU%d: %02x", - smp_processor_id(), v); + apic_pr_debug("APIC error on CPU%d: %02x", smp_processor_id(), v); v &= 0xff; while (v) { if (v & 0x1) - apic_printk(APIC_DEBUG, KERN_CONT " : %s", error_interrupt_reason[i]); + apic_pr_debug_cont(" : %s", error_interrupt_reason[i]); i++; v >>= 1; } - apic_printk(APIC_DEBUG, KERN_CONT "\n"); + apic_pr_debug_cont("\n"); trace_error_apic_exit(ERROR_APIC_VECTOR); } @@ -2202,8 +2193,7 @@ static void __init connect_bsp_APIC(void) * PIC mode, enable APIC mode in the IMCR, i.e. connect BSP's * local APIC to INT and NMI lines. */ - apic_printk(APIC_VERBOSE, "leaving PIC mode, " - "enabling APIC mode.\n"); + apic_pr_verbose("Leaving PIC mode, enabling APIC mode.\n"); imcr_pic_to_apic(); } #endif @@ -2228,8 +2218,7 @@ void disconnect_bsp_APIC(int virt_wire_setup) * IPIs, won't work beyond this point! The only exception are * INIT IPIs. */ - apic_printk(APIC_VERBOSE, "disabling APIC mode, " - "entering PIC mode.\n"); + apic_pr_verbose("Disabling APIC mode, entering PIC mode.\n"); imcr_apic_to_pic(); return; } diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c index f37ad3392fec..e0308d8c4e6c 100644 --- a/arch/x86/kernel/apic/apic_flat_64.c +++ b/arch/x86/kernel/apic/apic_flat_64.c @@ -8,129 +8,25 @@ * Martin Bligh, Andi Kleen, James Bottomley, John Stultz, and * James Cleverdon. */ -#include #include -#include -#include #include #include "local.h" -static struct apic apic_physflat; -static struct apic apic_flat; - -struct apic *apic __ro_after_init = &apic_flat; -EXPORT_SYMBOL_GPL(apic); - -static int flat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) -{ - return 1; -} - -static void _flat_send_IPI_mask(unsigned long mask, int vector) -{ - unsigned long flags; - - local_irq_save(flags); - __default_send_IPI_dest_field(mask, vector, APIC_DEST_LOGICAL); - local_irq_restore(flags); -} - -static void flat_send_IPI_mask(const struct cpumask *cpumask, int vector) -{ - unsigned long mask = cpumask_bits(cpumask)[0]; - - _flat_send_IPI_mask(mask, vector); -} - -static void -flat_send_IPI_mask_allbutself(const struct cpumask *cpumask, int vector) -{ - unsigned long mask = cpumask_bits(cpumask)[0]; - int cpu = smp_processor_id(); - - if (cpu < BITS_PER_LONG) - __clear_bit(cpu, &mask); - - _flat_send_IPI_mask(mask, vector); -} - -static u32 flat_get_apic_id(u32 x) +static u32 physflat_get_apic_id(u32 x) { return (x >> 24) & 0xFF; } -static int flat_probe(void) +static int physflat_probe(void) { return 1; } -static struct apic apic_flat __ro_after_init = { - .name = "flat", - .probe = flat_probe, - .acpi_madt_oem_check = flat_acpi_madt_oem_check, - - .dest_mode_logical = true, - - .disable_esr = 0, - - .init_apic_ldr = default_init_apic_ldr, - .cpu_present_to_apicid = default_cpu_present_to_apicid, - - .max_apic_id = 0xFE, - .get_apic_id = flat_get_apic_id, - - .calc_dest_apicid = apic_flat_calc_apicid, - - .send_IPI = default_send_IPI_single, - .send_IPI_mask = flat_send_IPI_mask, - .send_IPI_mask_allbutself = flat_send_IPI_mask_allbutself, - .send_IPI_allbutself = default_send_IPI_allbutself, - .send_IPI_all = default_send_IPI_all, - .send_IPI_self = default_send_IPI_self, - .nmi_to_offline_cpu = true, - - .read = native_apic_mem_read, - .write = native_apic_mem_write, - .eoi = native_apic_mem_eoi, - .icr_read = native_apic_icr_read, - .icr_write = native_apic_icr_write, - .wait_icr_idle = apic_mem_wait_icr_idle, - .safe_wait_icr_idle = apic_mem_wait_icr_idle_timeout, -}; - -/* - * Physflat mode is used when there are more than 8 CPUs on a system. - * We cannot use logical delivery in this case because the mask - * overflows, so use physical mode. - */ static int physflat_acpi_madt_oem_check(char *oem_id, char *oem_table_id) { -#ifdef CONFIG_ACPI - /* - * Quirk: some x86_64 machines can only use physical APIC mode - * regardless of how many processors are present (x86_64 ES7000 - * is an example). - */ - if (acpi_gbl_FADT.header.revision >= FADT2_REVISION_ID && - (acpi_gbl_FADT.flags & ACPI_FADT_APIC_PHYSICAL)) { - printk(KERN_DEBUG "system APIC only can use physical flat"); - return 1; - } - - if (!strncmp(oem_id, "IBM", 3) && !strncmp(oem_table_id, "EXA", 3)) { - printk(KERN_DEBUG "IBM Summit detected, will use apic physical"); - return 1; - } -#endif - - return 0; -} - -static int physflat_probe(void) -{ - return apic == &apic_physflat || num_possible_cpus() > 8 || jailhouse_paravirt(); + return 1; } static struct apic apic_physflat __ro_after_init = { @@ -146,7 +42,7 @@ static struct apic apic_physflat __ro_after_init = { .cpu_present_to_apicid = default_cpu_present_to_apicid, .max_apic_id = 0xFE, - .get_apic_id = flat_get_apic_id, + .get_apic_id = physflat_get_apic_id, .calc_dest_apicid = apic_default_calc_apicid, @@ -166,8 +62,7 @@ static struct apic apic_physflat __ro_after_init = { .wait_icr_idle = apic_mem_wait_icr_idle, .safe_wait_icr_idle = apic_mem_wait_icr_idle_timeout, }; +apic_driver(apic_physflat); -/* - * We need to check for physflat first, so this order is important. - */ -apic_drivers(apic_physflat, apic_flat); +struct apic *apic __ro_after_init = &apic_physflat; +EXPORT_SYMBOL_GPL(apic); diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 477b740b2f26..1029ea4ac8ba 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -86,8 +86,8 @@ static unsigned int ioapic_dynirq_base; static int ioapic_initialized; struct irq_pin_list { - struct list_head list; - int apic, pin; + struct list_head list; + int apic, pin; }; struct mp_chip_data { @@ -96,7 +96,7 @@ struct mp_chip_data { bool is_level; bool active_low; bool isa_irq; - u32 count; + u32 count; }; struct mp_ioapic_gsi { @@ -105,21 +105,17 @@ struct mp_ioapic_gsi { }; static struct ioapic { - /* - * # of IRQ routing registers - */ - int nr_registers; - /* - * Saved state during suspend/resume, or while enabling intr-remap. - */ - struct IO_APIC_route_entry *saved_registers; + /* # of IRQ routing registers */ + int nr_registers; + /* Saved state during suspend/resume, or while enabling intr-remap. */ + struct IO_APIC_route_entry *saved_registers; /* I/O APIC config */ - struct mpc_ioapic mp_config; + struct mpc_ioapic mp_config; /* IO APIC gsi routing info */ - struct mp_ioapic_gsi gsi_config; - struct ioapic_domain_cfg irqdomain_cfg; - struct irq_domain *irqdomain; - struct resource *iomem_res; + struct mp_ioapic_gsi gsi_config; + struct ioapic_domain_cfg irqdomain_cfg; + struct irq_domain *irqdomain; + struct resource *iomem_res; } ioapics[MAX_IO_APICS]; #define mpc_ioapic_ver(ioapic_idx) ioapics[ioapic_idx].mp_config.apicver @@ -205,10 +201,9 @@ void mp_save_irq(struct mpc_intsrc *m) { int i; - apic_printk(APIC_VERBOSE, "Int: type %d, pol %d, trig %d, bus %02x," - " IRQ %02x, APIC ID %x, APIC INT %02x\n", - m->irqtype, m->irqflag & 3, (m->irqflag >> 2) & 3, m->srcbus, - m->srcbusirq, m->dstapic, m->dstirq); + apic_pr_verbose("Int: type %d, pol %d, trig %d, bus %02x, IRQ %02x, APIC ID %x, APIC INT %02x\n", + m->irqtype, m->irqflag & 3, (m->irqflag >> 2) & 3, m->srcbus, + m->srcbusirq, m->dstapic, m->dstirq); for (i = 0; i < mp_irq_entries; i++) { if (!memcmp(&mp_irqs[i], m, sizeof(*m))) @@ -269,12 +264,14 @@ static __attribute_const__ struct io_apic __iomem *io_apic_base(int idx) static inline void io_apic_eoi(unsigned int apic, unsigned int vector) { struct io_apic __iomem *io_apic = io_apic_base(apic); + writel(vector, &io_apic->eoi); } unsigned int native_io_apic_read(unsigned int apic, unsigned int reg) { struct io_apic __iomem *io_apic = io_apic_base(apic); + writel(reg, &io_apic->index); return readl(&io_apic->data); } @@ -300,14 +297,8 @@ static struct IO_APIC_route_entry __ioapic_read_entry(int apic, int pin) static struct IO_APIC_route_entry ioapic_read_entry(int apic, int pin) { - struct IO_APIC_route_entry entry; - unsigned long flags; - - raw_spin_lock_irqsave(&ioapic_lock, flags); - entry = __ioapic_read_entry(apic, pin); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); - - return entry; + guard(raw_spinlock_irqsave)(&ioapic_lock); + return __ioapic_read_entry(apic, pin); } /* @@ -324,11 +315,8 @@ static void __ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e static void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) { - unsigned long flags; - - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); __ioapic_write_entry(apic, pin, e); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } /* @@ -339,12 +327,10 @@ static void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e) static void ioapic_mask_entry(int apic, int pin) { struct IO_APIC_route_entry e = { .masked = true }; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); io_apic_write(apic, 0x10 + 2*pin, e.w1); io_apic_write(apic, 0x11 + 2*pin, e.w2); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } /* @@ -352,68 +338,39 @@ static void ioapic_mask_entry(int apic, int pin) * shared ISA-space IRQs, so we have to support them. We are super * fast in the common case, and fast for shared ISA-space IRQs. */ -static int __add_pin_to_irq_node(struct mp_chip_data *data, - int node, int apic, int pin) +static bool add_pin_to_irq_node(struct mp_chip_data *data, int node, int apic, int pin) { struct irq_pin_list *entry; - /* don't allow duplicates */ - for_each_irq_pin(entry, data->irq_2_pin) + /* Don't allow duplicates */ + for_each_irq_pin(entry, data->irq_2_pin) { if (entry->apic == apic && entry->pin == pin) - return 0; + return true; + } entry = kzalloc_node(sizeof(struct irq_pin_list), GFP_ATOMIC, node); if (!entry) { - pr_err("can not alloc irq_pin_list (%d,%d,%d)\n", - node, apic, pin); - return -ENOMEM; + pr_err("Cannot allocate irq_pin_list (%d,%d,%d)\n", node, apic, pin); + return false; } + entry->apic = apic; entry->pin = pin; list_add_tail(&entry->list, &data->irq_2_pin); - - return 0; + return true; } static void __remove_pin_from_irq(struct mp_chip_data *data, int apic, int pin) { struct irq_pin_list *tmp, *entry; - list_for_each_entry_safe(entry, tmp, &data->irq_2_pin, list) + list_for_each_entry_safe(entry, tmp, &data->irq_2_pin, list) { if (entry->apic == apic && entry->pin == pin) { list_del(&entry->list); kfree(entry); return; } -} - -static void add_pin_to_irq_node(struct mp_chip_data *data, - int node, int apic, int pin) -{ - if (__add_pin_to_irq_node(data, node, apic, pin)) - panic("IO-APIC: failed to add irq-pin. Can not proceed\n"); -} - -/* - * Reroute an IRQ to a different pin. - */ -static void __init replace_pin_at_irq_node(struct mp_chip_data *data, int node, - int oldapic, int oldpin, - int newapic, int newpin) -{ - struct irq_pin_list *entry; - - for_each_irq_pin(entry, data->irq_2_pin) { - if (entry->apic == oldapic && entry->pin == oldpin) { - entry->apic = newapic; - entry->pin = newpin; - /* every one is different, right? */ - return; - } } - - /* old apic/pin didn't exist, so just add new ones */ - add_pin_to_irq_node(data, node, newapic, newpin); } static void io_apic_modify_irq(struct mp_chip_data *data, bool masked, @@ -430,12 +387,12 @@ static void io_apic_modify_irq(struct mp_chip_data *data, bool masked, } } +/* + * Synchronize the IO-APIC and the CPU by doing a dummy read from the + * IO-APIC + */ static void io_apic_sync(struct irq_pin_list *entry) { - /* - * Synchronize the IO-APIC and the CPU by doing - * a dummy read from the IO-APIC - */ struct io_apic __iomem *io_apic; io_apic = io_apic_base(entry->apic); @@ -445,11 +402,9 @@ static void io_apic_sync(struct irq_pin_list *entry) static void mask_ioapic_irq(struct irq_data *irq_data) { struct mp_chip_data *data = irq_data->chip_data; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); io_apic_modify_irq(data, true, &io_apic_sync); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } static void __unmask_ioapic(struct mp_chip_data *data) @@ -460,11 +415,9 @@ static void __unmask_ioapic(struct mp_chip_data *data) static void unmask_ioapic_irq(struct irq_data *irq_data) { struct mp_chip_data *data = irq_data->chip_data; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); __unmask_ioapic(data); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } /* @@ -492,30 +445,24 @@ static void __eoi_ioapic_pin(int apic, int pin, int vector) entry = entry1 = __ioapic_read_entry(apic, pin); - /* - * Mask the entry and change the trigger mode to edge. - */ + /* Mask the entry and change the trigger mode to edge. */ entry1.masked = true; entry1.is_level = false; __ioapic_write_entry(apic, pin, entry1); - /* - * Restore the previous level triggered entry. - */ + /* Restore the previous level triggered entry. */ __ioapic_write_entry(apic, pin, entry); } } static void eoi_ioapic_pin(int vector, struct mp_chip_data *data) { - unsigned long flags; struct irq_pin_list *entry; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); for_each_irq_pin(entry, data->irq_2_pin) __eoi_ioapic_pin(entry->apic, entry->pin, vector); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) @@ -538,8 +485,6 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) } if (entry.irr) { - unsigned long flags; - /* * Make sure the trigger mode is set to level. Explicit EOI * doesn't clear the remote-IRR if the trigger mode is not @@ -549,9 +494,8 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) entry.is_level = true; ioapic_write_entry(apic, pin, entry); } - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); __eoi_ioapic_pin(apic, pin, entry.vector); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } /* @@ -586,28 +530,23 @@ static int pirq_entries[MAX_PIRQS] = { static int __init ioapic_pirq_setup(char *str) { - int i, max; - int ints[MAX_PIRQS+1]; + int i, max, ints[MAX_PIRQS+1]; get_options(str, ARRAY_SIZE(ints), ints); - apic_printk(APIC_VERBOSE, KERN_INFO - "PIRQ redirection, working around broken MP-BIOS.\n"); + apic_pr_verbose("PIRQ redirection, working around broken MP-BIOS.\n"); + max = MAX_PIRQS; if (ints[0] < MAX_PIRQS) max = ints[0]; for (i = 0; i < max; i++) { - apic_printk(APIC_VERBOSE, KERN_DEBUG - "... PIRQ%d -> IRQ %d\n", i, ints[i+1]); - /* - * PIRQs are mapped upside down, usually. - */ + apic_pr_verbose("... PIRQ%d -> IRQ %d\n", i, ints[i + 1]); + /* PIRQs are mapped upside down, usually */ pirq_entries[MAX_PIRQS-i-1] = ints[i+1]; } return 1; } - __setup("pirq=", ioapic_pirq_setup); #endif /* CONFIG_X86_32 */ @@ -626,8 +565,7 @@ int save_ioapic_entries(void) } for_each_pin(apic, pin) - ioapics[apic].saved_registers[pin] = - ioapic_read_entry(apic, pin); + ioapics[apic].saved_registers[pin] = ioapic_read_entry(apic, pin); } return err; @@ -668,8 +606,7 @@ int restore_ioapic_entries(void) continue; for_each_pin(apic, pin) - ioapic_write_entry(apic, pin, - ioapics[apic].saved_registers[pin]); + ioapic_write_entry(apic, pin, ioapics[apic].saved_registers[pin]); } return 0; } @@ -681,12 +618,13 @@ static int find_irq_entry(int ioapic_idx, int pin, int type) { int i; - for (i = 0; i < mp_irq_entries; i++) + for (i = 0; i < mp_irq_entries; i++) { if (mp_irqs[i].irqtype == type && (mp_irqs[i].dstapic == mpc_ioapic_id(ioapic_idx) || mp_irqs[i].dstapic == MP_APIC_ALL) && mp_irqs[i].dstirq == pin) return i; + } return -1; } @@ -701,10 +639,8 @@ static int __init find_isa_irq_pin(int irq, int type) for (i = 0; i < mp_irq_entries; i++) { int lbus = mp_irqs[i].srcbus; - if (test_bit(lbus, mp_bus_not_pci) && - (mp_irqs[i].irqtype == type) && + if (test_bit(lbus, mp_bus_not_pci) && (mp_irqs[i].irqtype == type) && (mp_irqs[i].srcbusirq == irq)) - return mp_irqs[i].dstirq; } return -1; @@ -717,8 +653,7 @@ static int __init find_isa_irq_apic(int irq, int type) for (i = 0; i < mp_irq_entries; i++) { int lbus = mp_irqs[i].srcbus; - if (test_bit(lbus, mp_bus_not_pci) && - (mp_irqs[i].irqtype == type) && + if (test_bit(lbus, mp_bus_not_pci) && (mp_irqs[i].irqtype == type) && (mp_irqs[i].srcbusirq == irq)) break; } @@ -726,9 +661,10 @@ static int __init find_isa_irq_apic(int irq, int type) if (i < mp_irq_entries) { int ioapic_idx; - for_each_ioapic(ioapic_idx) + for_each_ioapic(ioapic_idx) { if (mpc_ioapic_id(ioapic_idx) == mp_irqs[i].dstapic) return ioapic_idx; + } } return -1; @@ -769,8 +705,7 @@ static bool EISA_ELCR(unsigned int irq) unsigned int port = PIC_ELCR1 + (irq >> 3); return (inb(port) >> (irq & 7)) & 1; } - apic_printk(APIC_VERBOSE, KERN_INFO - "Broken MPtable reports ISA irq %d\n", irq); + apic_pr_verbose("Broken MPtable reports ISA irq %d\n", irq); return false; } @@ -947,9 +882,9 @@ static bool mp_check_pin_attr(int irq, struct irq_alloc_info *info) static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi, struct irq_alloc_info *info) { + int type = ioapics[ioapic].irqdomain_cfg.type; bool legacy = false; int irq = -1; - int type = ioapics[ioapic].irqdomain_cfg.type; switch (type) { case IOAPIC_DOMAIN_LEGACY: @@ -971,8 +906,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi, return -1; } - return __irq_domain_alloc_irqs(domain, irq, 1, - ioapic_alloc_attr_node(info), + return __irq_domain_alloc_irqs(domain, irq, 1, ioapic_alloc_attr_node(info), info, legacy, NULL); } @@ -986,13 +920,12 @@ static int alloc_irq_from_domain(struct irq_domain *domain, int ioapic, u32 gsi, * PIRQs instead of reprogramming the interrupt routing logic. Thus there may be * multiple pins sharing the same legacy IRQ number when ACPI is disabled. */ -static int alloc_isa_irq_from_domain(struct irq_domain *domain, - int irq, int ioapic, int pin, +static int alloc_isa_irq_from_domain(struct irq_domain *domain, int irq, int ioapic, int pin, struct irq_alloc_info *info) { - struct mp_chip_data *data; struct irq_data *irq_data = irq_get_irq_data(irq); int node = ioapic_alloc_attr_node(info); + struct mp_chip_data *data; /* * Legacy ISA IRQ has already been allocated, just add pin to @@ -1002,13 +935,11 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain, if (irq_data && irq_data->parent_data) { if (!mp_check_pin_attr(irq, info)) return -EBUSY; - if (__add_pin_to_irq_node(irq_data->chip_data, node, ioapic, - info->ioapic.pin)) + if (!add_pin_to_irq_node(irq_data->chip_data, node, ioapic, info->ioapic.pin)) return -ENOMEM; } else { info->flags |= X86_IRQ_ALLOC_LEGACY; - irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true, - NULL); + irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true, NULL); if (irq >= 0) { irq_data = irq_domain_get_irq_data(domain, irq); data = irq_data->chip_data; @@ -1022,11 +953,11 @@ static int alloc_isa_irq_from_domain(struct irq_domain *domain, static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin, unsigned int flags, struct irq_alloc_info *info) { - int irq; - bool legacy = false; + struct irq_domain *domain = mp_ioapic_irqdomain(ioapic); struct irq_alloc_info tmp; struct mp_chip_data *data; - struct irq_domain *domain = mp_ioapic_irqdomain(ioapic); + bool legacy = false; + int irq; if (!domain) return -ENOSYS; @@ -1046,7 +977,7 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin, return -EINVAL; } - mutex_lock(&ioapic_mutex); + guard(mutex)(&ioapic_mutex); if (!(flags & IOAPIC_MAP_ALLOC)) { if (!legacy) { irq = irq_find_mapping(domain, pin); @@ -1067,8 +998,6 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin, data->count++; } } - mutex_unlock(&ioapic_mutex); - return irq; } @@ -1076,26 +1005,20 @@ static int pin_2_irq(int idx, int ioapic, int pin, unsigned int flags) { u32 gsi = mp_pin_to_gsi(ioapic, pin); - /* - * Debugging check, we are in big trouble if this message pops up! - */ + /* Debugging check, we are in big trouble if this message pops up! */ if (mp_irqs[idx].dstirq != pin) pr_err("broken BIOS or MPTABLE parser, ayiee!!\n"); #ifdef CONFIG_X86_32 - /* - * PCI IRQ command line redirection. Yes, limits are hardcoded. - */ + /* PCI IRQ command line redirection. Yes, limits are hardcoded. */ if ((pin >= 16) && (pin <= 23)) { - if (pirq_entries[pin-16] != -1) { - if (!pirq_entries[pin-16]) { - apic_printk(APIC_VERBOSE, KERN_DEBUG - "disabling PIRQ%d\n", pin-16); + if (pirq_entries[pin - 16] != -1) { + if (!pirq_entries[pin - 16]) { + apic_pr_verbose("Disabling PIRQ%d\n", pin - 16); } else { int irq = pirq_entries[pin-16]; - apic_printk(APIC_VERBOSE, KERN_DEBUG - "using PIRQ%d -> IRQ %d\n", - pin-16, irq); + + apic_pr_verbose("Using PIRQ%d -> IRQ %d\n", pin - 16, irq); return irq; } } @@ -1133,10 +1056,9 @@ void mp_unmap_irq(int irq) if (!data || data->isa_irq) return; - mutex_lock(&ioapic_mutex); + guard(mutex)(&ioapic_mutex); if (--data->count == 0) irq_domain_free_irqs(irq, 1); - mutex_unlock(&ioapic_mutex); } /* @@ -1147,12 +1069,10 @@ int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin) { int irq, i, best_ioapic = -1, best_idx = -1; - apic_printk(APIC_DEBUG, - "querying PCI -> IRQ mapping bus:%d, slot:%d, pin:%d.\n", - bus, slot, pin); + apic_pr_debug("Querying PCI -> IRQ mapping bus:%d, slot:%d, pin:%d.\n", + bus, slot, pin); if (test_bit(bus, mp_bus_not_pci)) { - apic_printk(APIC_VERBOSE, - "PCI BIOS passed nonexistent PCI bus %d!\n", bus); + apic_pr_verbose("PCI BIOS passed nonexistent PCI bus %d!\n", bus); return -1; } @@ -1197,8 +1117,7 @@ int IO_APIC_get_PCI_irq_vector(int bus, int slot, int pin) return -1; out: - return pin_2_irq(best_idx, best_ioapic, mp_irqs[best_idx].dstirq, - IOAPIC_MAP_ALLOC); + return pin_2_irq(best_idx, best_ioapic, mp_irqs[best_idx].dstirq, IOAPIC_MAP_ALLOC); } EXPORT_SYMBOL(IO_APIC_get_PCI_irq_vector); @@ -1209,17 +1128,16 @@ static void __init setup_IO_APIC_irqs(void) unsigned int ioapic, pin; int idx; - apic_printk(APIC_VERBOSE, KERN_DEBUG "init IO_APIC IRQs\n"); + apic_pr_verbose("Init IO_APIC IRQs\n"); for_each_ioapic_pin(ioapic, pin) { idx = find_irq_entry(ioapic, pin, mp_INT); - if (idx < 0) - apic_printk(APIC_VERBOSE, - KERN_DEBUG " apic %d pin %d not connected\n", - mpc_ioapic_id(ioapic), pin); - else - pin_2_irq(idx, ioapic, pin, - ioapic ? 0 : IOAPIC_MAP_ALLOC); + if (idx < 0) { + apic_pr_verbose("apic %d pin %d not connected\n", + mpc_ioapic_id(ioapic), pin); + } else { + pin_2_irq(idx, ioapic, pin, ioapic ? 0 : IOAPIC_MAP_ALLOC); + } } } @@ -1234,26 +1152,21 @@ static void io_apic_print_entries(unsigned int apic, unsigned int nr_entries) char buf[256]; int i; - printk(KERN_DEBUG "IOAPIC %d:\n", apic); + apic_dbg("IOAPIC %d:\n", apic); for (i = 0; i <= nr_entries; i++) { entry = ioapic_read_entry(apic, i); - snprintf(buf, sizeof(buf), - " pin%02x, %s, %s, %s, V(%02X), IRR(%1d), S(%1d)", - i, - entry.masked ? "disabled" : "enabled ", + snprintf(buf, sizeof(buf), " pin%02x, %s, %s, %s, V(%02X), IRR(%1d), S(%1d)", + i, entry.masked ? "disabled" : "enabled ", entry.is_level ? "level" : "edge ", entry.active_low ? "low " : "high", entry.vector, entry.irr, entry.delivery_status); if (entry.ir_format) { - printk(KERN_DEBUG "%s, remapped, I(%04X), Z(%X)\n", - buf, - (entry.ir_index_15 << 15) | entry.ir_index_0_14, - entry.ir_zero); + apic_dbg("%s, remapped, I(%04X), Z(%X)\n", buf, + (entry.ir_index_15 << 15) | entry.ir_index_0_14, entry.ir_zero); } else { - printk(KERN_DEBUG "%s, %s, D(%02X%02X), M(%1d)\n", buf, - entry.dest_mode_logical ? "logical " : "physical", - entry.virt_destid_8_14, entry.destid_0_7, - entry.delivery_mode); + apic_dbg("%s, %s, D(%02X%02X), M(%1d)\n", buf, + entry.dest_mode_logical ? "logical " : "physic al", + entry.virt_destid_8_14, entry.destid_0_7, entry.delivery_mode); } } } @@ -1264,30 +1177,25 @@ static void __init print_IO_APIC(int ioapic_idx) union IO_APIC_reg_01 reg_01; union IO_APIC_reg_02 reg_02; union IO_APIC_reg_03 reg_03; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(ioapic_idx, 0); - reg_01.raw = io_apic_read(ioapic_idx, 1); - if (reg_01.bits.version >= 0x10) - reg_02.raw = io_apic_read(ioapic_idx, 2); - if (reg_01.bits.version >= 0x20) - reg_03.raw = io_apic_read(ioapic_idx, 3); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) { + reg_00.raw = io_apic_read(ioapic_idx, 0); + reg_01.raw = io_apic_read(ioapic_idx, 1); + if (reg_01.bits.version >= 0x10) + reg_02.raw = io_apic_read(ioapic_idx, 2); + if (reg_01.bits.version >= 0x20) + reg_03.raw = io_apic_read(ioapic_idx, 3); + } - printk(KERN_DEBUG "IO APIC #%d......\n", mpc_ioapic_id(ioapic_idx)); - printk(KERN_DEBUG ".... register #00: %08X\n", reg_00.raw); - printk(KERN_DEBUG "....... : physical APIC id: %02X\n", reg_00.bits.ID); - printk(KERN_DEBUG "....... : Delivery Type: %X\n", reg_00.bits.delivery_type); - printk(KERN_DEBUG "....... : LTS : %X\n", reg_00.bits.LTS); - - printk(KERN_DEBUG ".... register #01: %08X\n", *(int *)®_01); - printk(KERN_DEBUG "....... : max redirection entries: %02X\n", - reg_01.bits.entries); - - printk(KERN_DEBUG "....... : PRQ implemented: %X\n", reg_01.bits.PRQ); - printk(KERN_DEBUG "....... : IO APIC version: %02X\n", - reg_01.bits.version); + apic_dbg("IO APIC #%d......\n", mpc_ioapic_id(ioapic_idx)); + apic_dbg(".... register #00: %08X\n", reg_00.raw); + apic_dbg("....... : physical APIC id: %02X\n", reg_00.bits.ID); + apic_dbg("....... : Delivery Type: %X\n", reg_00.bits.delivery_type); + apic_dbg("....... : LTS : %X\n", reg_00.bits.LTS); + apic_dbg(".... register #01: %08X\n", *(int *)®_01); + apic_dbg("....... : max redirection entries: %02X\n", reg_01.bits.entries); + apic_dbg("....... : PRQ implemented: %X\n", reg_01.bits.PRQ); + apic_dbg("....... : IO APIC version: %02X\n", reg_01.bits.version); /* * Some Intel chipsets with IO APIC VERSION of 0x1? don't have reg_02, @@ -1295,8 +1203,8 @@ static void __init print_IO_APIC(int ioapic_idx) * value, so ignore it if reg_02 == reg_01. */ if (reg_01.bits.version >= 0x10 && reg_02.raw != reg_01.raw) { - printk(KERN_DEBUG ".... register #02: %08X\n", reg_02.raw); - printk(KERN_DEBUG "....... : arbitration: %02X\n", reg_02.bits.arbitration); + apic_dbg(".... register #02: %08X\n", reg_02.raw); + apic_dbg("....... : arbitration: %02X\n", reg_02.bits.arbitration); } /* @@ -1306,11 +1214,11 @@ static void __init print_IO_APIC(int ioapic_idx) */ if (reg_01.bits.version >= 0x20 && reg_03.raw != reg_02.raw && reg_03.raw != reg_01.raw) { - printk(KERN_DEBUG ".... register #03: %08X\n", reg_03.raw); - printk(KERN_DEBUG "....... : Boot DT : %X\n", reg_03.bits.boot_DT); + apic_dbg(".... register #03: %08X\n", reg_03.raw); + apic_dbg("....... : Boot DT : %X\n", reg_03.bits.boot_DT); } - printk(KERN_DEBUG ".... IRQ redirection table:\n"); + apic_dbg(".... IRQ redirection table:\n"); io_apic_print_entries(ioapic_idx, reg_01.bits.entries); } @@ -1319,11 +1227,11 @@ void __init print_IO_APICs(void) int ioapic_idx; unsigned int irq; - printk(KERN_DEBUG "number of MP IRQ sources: %d.\n", mp_irq_entries); - for_each_ioapic(ioapic_idx) - printk(KERN_DEBUG "number of IO-APIC #%d registers: %d.\n", - mpc_ioapic_id(ioapic_idx), - ioapics[ioapic_idx].nr_registers); + apic_dbg("number of MP IRQ sources: %d.\n", mp_irq_entries); + for_each_ioapic(ioapic_idx) { + apic_dbg("number of IO-APIC #%d registers: %d.\n", + mpc_ioapic_id(ioapic_idx), ioapics[ioapic_idx].nr_registers); + } /* * We are a bit conservative about what we expect. We have to @@ -1334,7 +1242,7 @@ void __init print_IO_APICs(void) for_each_ioapic(ioapic_idx) print_IO_APIC(ioapic_idx); - printk(KERN_DEBUG "IRQ to pin mappings:\n"); + apic_dbg("IRQ to pin mappings:\n"); for_each_active_irq(irq) { struct irq_pin_list *entry; struct irq_chip *chip; @@ -1349,7 +1257,7 @@ void __init print_IO_APICs(void) if (list_empty(&data->irq_2_pin)) continue; - printk(KERN_DEBUG "IRQ%d ", irq); + apic_dbg("IRQ%d ", irq); for_each_irq_pin(entry, data->irq_2_pin) pr_cont("-> %d:%d", entry->apic, entry->pin); pr_cont("\n"); @@ -1363,8 +1271,7 @@ static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; void __init enable_IO_APIC(void) { - int i8259_apic, i8259_pin; - int apic, pin; + int i8259_apic, i8259_pin, apic, pin; if (ioapic_is_disabled) nr_ioapics = 0; @@ -1376,19 +1283,21 @@ void __init enable_IO_APIC(void) /* See if any of the pins is in ExtINT mode */ struct IO_APIC_route_entry entry = ioapic_read_entry(apic, pin); - /* If the interrupt line is enabled and in ExtInt mode - * I have found the pin where the i8259 is connected. + /* + * If the interrupt line is enabled and in ExtInt mode I + * have found the pin where the i8259 is connected. */ - if (!entry.masked && - entry.delivery_mode == APIC_DELIVERY_MODE_EXTINT) { + if (!entry.masked && entry.delivery_mode == APIC_DELIVERY_MODE_EXTINT) { ioapic_i8259.apic = apic; ioapic_i8259.pin = pin; - goto found_i8259; + break; } } - found_i8259: - /* Look to see what if the MP table has reported the ExtINT */ - /* If we could not find the appropriate pin by looking at the ioapic + + /* + * Look to see what if the MP table has reported the ExtINT + * + * If we could not find the appropriate pin by looking at the ioapic * the i8259 probably is not connected the ioapic but give the * mptable a chance anyway. */ @@ -1396,29 +1305,24 @@ void __init enable_IO_APIC(void) i8259_apic = find_isa_irq_apic(0, mp_ExtINT); /* Trust the MP table if nothing is setup in the hardware */ if ((ioapic_i8259.pin == -1) && (i8259_pin >= 0)) { - printk(KERN_WARNING "ExtINT not setup in hardware but reported by MP table\n"); + pr_warn("ExtINT not setup in hardware but reported by MP table\n"); ioapic_i8259.pin = i8259_pin; ioapic_i8259.apic = i8259_apic; } /* Complain if the MP table and the hardware disagree */ if (((ioapic_i8259.apic != i8259_apic) || (ioapic_i8259.pin != i8259_pin)) && - (i8259_pin >= 0) && (ioapic_i8259.pin >= 0)) - { - printk(KERN_WARNING "ExtINT in hardware and MP table differ\n"); - } + (i8259_pin >= 0) && (ioapic_i8259.pin >= 0)) + pr_warn("ExtINT in hardware and MP table differ\n"); - /* - * Do not trust the IO-APIC being empty at bootup - */ + /* Do not trust the IO-APIC being empty at bootup */ clear_IO_APIC(); } void native_restore_boot_irq_mode(void) { /* - * If the i8259 is routed through an IOAPIC - * Put that IOAPIC in virtual wire mode - * so legacy interrupts can be delivered. + * If the i8259 is routed through an IOAPIC Put that IOAPIC in + * virtual wire mode so legacy interrupts can be delivered. */ if (ioapic_i8259.pin != -1) { struct IO_APIC_route_entry entry; @@ -1433,9 +1337,7 @@ void native_restore_boot_irq_mode(void) entry.destid_0_7 = apic_id & 0xFF; entry.virt_destid_8_14 = apic_id >> 8; - /* - * Add it to the IO-APIC irq-routing table: - */ + /* Add it to the IO-APIC irq-routing table */ ioapic_write_entry(ioapic_i8259.apic, ioapic_i8259.pin, entry); } @@ -1464,7 +1366,6 @@ static void __init setup_ioapic_ids_from_mpc_nocheck(void) const u32 broadcast_id = 0xF; union IO_APIC_reg_00 reg_00; unsigned char old_id; - unsigned long flags; int ioapic_idx, i; /* @@ -1478,9 +1379,8 @@ static void __init setup_ioapic_ids_from_mpc_nocheck(void) */ for_each_ioapic(ioapic_idx) { /* Read the register 0 value */ - raw_spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(ioapic_idx, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) + reg_00.raw = io_apic_read(ioapic_idx, 0); old_id = mpc_ioapic_id(ioapic_idx); @@ -1508,47 +1408,42 @@ static void __init setup_ioapic_ids_from_mpc_nocheck(void) set_bit(i, phys_id_present_map); ioapics[ioapic_idx].mp_config.apicid = i; } else { - apic_printk(APIC_VERBOSE, "Setting %d in the phys_id_present_map\n", - mpc_ioapic_id(ioapic_idx)); + apic_pr_verbose("Setting %d in the phys_id_present_map\n", + mpc_ioapic_id(ioapic_idx)); set_bit(mpc_ioapic_id(ioapic_idx), phys_id_present_map); } /* - * We need to adjust the IRQ routing table - * if the ID changed. + * We need to adjust the IRQ routing table if the ID + * changed. */ - if (old_id != mpc_ioapic_id(ioapic_idx)) - for (i = 0; i < mp_irq_entries; i++) + if (old_id != mpc_ioapic_id(ioapic_idx)) { + for (i = 0; i < mp_irq_entries; i++) { if (mp_irqs[i].dstapic == old_id) - mp_irqs[i].dstapic - = mpc_ioapic_id(ioapic_idx); + mp_irqs[i].dstapic = mpc_ioapic_id(ioapic_idx); + } + } /* - * Update the ID register according to the right value - * from the MPC table if they are different. + * Update the ID register according to the right value from + * the MPC table if they are different. */ if (mpc_ioapic_id(ioapic_idx) == reg_00.bits.ID) continue; - apic_printk(APIC_VERBOSE, KERN_INFO - "...changing IO-APIC physical APIC ID to %d ...", - mpc_ioapic_id(ioapic_idx)); + apic_pr_verbose("...changing IO-APIC physical APIC ID to %d ...", + mpc_ioapic_id(ioapic_idx)); reg_00.bits.ID = mpc_ioapic_id(ioapic_idx); - raw_spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(ioapic_idx, 0, reg_00.raw); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); - - /* - * Sanity check - */ - raw_spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(ioapic_idx, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) { + io_apic_write(ioapic_idx, 0, reg_00.raw); + reg_00.raw = io_apic_read(ioapic_idx, 0); + } + /* Sanity check */ if (reg_00.bits.ID != mpc_ioapic_id(ioapic_idx)) pr_cont("could not set ID!\n"); else - apic_printk(APIC_VERBOSE, " ok.\n"); + apic_pr_verbose(" ok.\n"); } } @@ -1593,8 +1488,7 @@ static void __init delay_with_tsc(void) do { rep_nop(); now = rdtsc(); - } while ((now - start) < 40000000000ULL / HZ && - time_before_eq(jiffies, end)); + } while ((now - start) < 40000000000ULL / HZ && time_before_eq(jiffies, end)); } static void __init delay_without_tsc(void) @@ -1655,36 +1549,29 @@ static int __init timer_irq_works(void) * so we 'resend' these IRQs via IPIs, to the same CPU. It's much * better to do it this way as thus we do not have to be aware of * 'pending' interrupts in the IRQ path, except at this point. - */ -/* - * Edge triggered needs to resend any interrupt - * that was delayed but this is now handled in the device - * independent code. - */ - -/* - * Starting up a edge-triggered IO-APIC interrupt is - * nasty - we need to make sure that we get the edge. - * If it is already asserted for some reason, we need - * return 1 to indicate that is was pending. * - * This is not complete - we should be able to fake - * an edge even if it isn't on the 8259A... + * + * Edge triggered needs to resend any interrupt that was delayed but this + * is now handled in the device independent code. + * + * Starting up a edge-triggered IO-APIC interrupt is nasty - we need to + * make sure that we get the edge. If it is already asserted for some + * reason, we need return 1 to indicate that is was pending. + * + * This is not complete - we should be able to fake an edge even if it + * isn't on the 8259A... */ static unsigned int startup_ioapic_irq(struct irq_data *data) { int was_pending = 0, irq = data->irq; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); if (irq < nr_legacy_irqs()) { legacy_pic->mask(irq); if (legacy_pic->irq_pending(irq)) was_pending = 1; } __unmask_ioapic(data->chip_data); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); - return was_pending; } @@ -1694,9 +1581,8 @@ atomic_t irq_mis_count; static bool io_apic_level_ack_pending(struct mp_chip_data *data) { struct irq_pin_list *entry; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); for_each_irq_pin(entry, data->irq_2_pin) { struct IO_APIC_route_entry e; int pin; @@ -1704,13 +1590,9 @@ static bool io_apic_level_ack_pending(struct mp_chip_data *data) pin = entry->pin; e.w1 = io_apic_read(entry->apic, 0x10 + pin*2); /* Is the remote IRR bit set? */ - if (e.irr) { - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + if (e.irr) return true; - } } - raw_spin_unlock_irqrestore(&ioapic_lock, flags); - return false; } @@ -1728,7 +1610,8 @@ static inline bool ioapic_prepare_move(struct irq_data *data) static inline void ioapic_finish_move(struct irq_data *data, bool moveit) { if (unlikely(moveit)) { - /* Only migrate the irq if the ack has been received. + /* + * Only migrate the irq if the ack has been received. * * On rare occasions the broadcast level triggered ack gets * delayed going to ioapics, and if we reprogram the @@ -1911,18 +1794,16 @@ static void ioapic_configure_entry(struct irq_data *irqd) __ioapic_write_entry(entry->apic, entry->pin, mpd->entry); } -static int ioapic_set_affinity(struct irq_data *irq_data, - const struct cpumask *mask, bool force) +static int ioapic_set_affinity(struct irq_data *irq_data, const struct cpumask *mask, bool force) { struct irq_data *parent = irq_data->parent_data; - unsigned long flags; int ret; ret = parent->chip->irq_set_affinity(parent, mask, force); - raw_spin_lock_irqsave(&ioapic_lock, flags); + + guard(raw_spinlock_irqsave)(&ioapic_lock); if (ret >= 0 && ret != IRQ_SET_MASK_OK_DONE) ioapic_configure_entry(irq_data); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); return ret; } @@ -1941,9 +1822,8 @@ static int ioapic_set_affinity(struct irq_data *irq_data, * * Verify that the corresponding Remote-IRR bits are clear. */ -static int ioapic_irq_get_chip_state(struct irq_data *irqd, - enum irqchip_irq_state which, - bool *state) +static int ioapic_irq_get_chip_state(struct irq_data *irqd, enum irqchip_irq_state which, + bool *state) { struct mp_chip_data *mcd = irqd->chip_data; struct IO_APIC_route_entry rentry; @@ -1953,7 +1833,8 @@ static int ioapic_irq_get_chip_state(struct irq_data *irqd, return -EINVAL; *state = false; - raw_spin_lock(&ioapic_lock); + + guard(raw_spinlock)(&ioapic_lock); for_each_irq_pin(p, mcd->irq_2_pin) { rentry = __ioapic_read_entry(p->apic, p->pin); /* @@ -1967,7 +1848,6 @@ static int ioapic_irq_get_chip_state(struct irq_data *irqd, break; } } - raw_spin_unlock(&ioapic_lock); return 0; } @@ -2008,14 +1888,13 @@ static inline void init_IO_APIC_traps(void) cfg = irq_cfg(irq); if (IO_APIC_IRQ(irq) && cfg && !cfg->vector) { /* - * Hmm.. We don't have an entry for this, - * so default to an old-fashioned 8259 - * interrupt if we can.. + * Hmm.. We don't have an entry for this, so + * default to an old-fashioned 8259 interrupt if we + * can. Otherwise set the dummy interrupt chip. */ if (irq < nr_legacy_irqs()) legacy_pic->make_irq(irq); else - /* Strange. Oh, well.. */ irq_set_chip(irq, &no_irq_chip); } } @@ -2024,20 +1903,17 @@ static inline void init_IO_APIC_traps(void) /* * The local APIC irq-chip implementation: */ - static void mask_lapic_irq(struct irq_data *data) { - unsigned long v; + unsigned long v = apic_read(APIC_LVT0); - v = apic_read(APIC_LVT0); apic_write(APIC_LVT0, v | APIC_LVT_MASKED); } static void unmask_lapic_irq(struct irq_data *data) { - unsigned long v; + unsigned long v = apic_read(APIC_LVT0); - v = apic_read(APIC_LVT0); apic_write(APIC_LVT0, v & ~APIC_LVT_MASKED); } @@ -2056,8 +1932,7 @@ static struct irq_chip lapic_chip __read_mostly = { static void lapic_register_intr(int irq) { irq_clear_status_flags(irq, IRQ_LEVEL); - irq_set_chip_and_handler_name(irq, &lapic_chip, handle_edge_irq, - "edge"); + irq_set_chip_and_handler_name(irq, &lapic_chip, handle_edge_irq, "edge"); } /* @@ -2069,9 +1944,9 @@ static void lapic_register_intr(int irq) */ static inline void __init unlock_ExtINT_logic(void) { - int apic, pin, i; - struct IO_APIC_route_entry entry0, entry1; unsigned char save_control, save_freq_select; + struct IO_APIC_route_entry entry0, entry1; + int apic, pin, i; u32 apic_id; pin = find_isa_irq_pin(8, mp_INT); @@ -2131,10 +2006,10 @@ static int __init disable_timer_pin_setup(char *arg) } early_param("disable_timer_pin_1", disable_timer_pin_setup); -static int mp_alloc_timer_irq(int ioapic, int pin) +static int __init mp_alloc_timer_irq(int ioapic, int pin) { - int irq = -1; struct irq_domain *domain = mp_ioapic_irqdomain(ioapic); + int irq = -1; if (domain) { struct irq_alloc_info info; @@ -2142,21 +2017,36 @@ static int mp_alloc_timer_irq(int ioapic, int pin) ioapic_set_alloc_attr(&info, NUMA_NO_NODE, 0, 0); info.devid = mpc_ioapic_id(ioapic); info.ioapic.pin = pin; - mutex_lock(&ioapic_mutex); + guard(mutex)(&ioapic_mutex); irq = alloc_isa_irq_from_domain(domain, 0, ioapic, pin, &info); - mutex_unlock(&ioapic_mutex); } return irq; } +static void __init replace_pin_at_irq_node(struct mp_chip_data *data, int node, + int oldapic, int oldpin, + int newapic, int newpin) +{ + struct irq_pin_list *entry; + + for_each_irq_pin(entry, data->irq_2_pin) { + if (entry->apic == oldapic && entry->pin == oldpin) { + entry->apic = newapic; + entry->pin = newpin; + return; + } + } + + /* Old apic/pin didn't exist, so just add a new one */ + add_pin_to_irq_node(data, node, newapic, newpin); +} + /* * This code may look a bit paranoid, but it's supposed to cooperate with * a wide range of boards and BIOS bugs. Fortunately only the timer IRQ * is so screwy. Thanks to Brian Perkins for testing/hacking this beast * fanatically on his truly buggy board. - * - * FIXME: really need to revamp this for all platforms. */ static inline void __init check_timer(void) { @@ -2194,9 +2084,8 @@ static inline void __init check_timer(void) pin2 = ioapic_i8259.pin; apic2 = ioapic_i8259.apic; - apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X " - "apic1=%d pin1=%d apic2=%d pin2=%d\n", - cfg->vector, apic1, pin1, apic2, pin2); + pr_info("..TIMER: vector=0x%02X apic1=%d pin1=%d apic2=%d pin2=%d\n", + cfg->vector, apic1, pin1, apic2, pin2); /* * Some BIOS writers are clueless and report the ExtINTA @@ -2240,13 +2129,10 @@ static inline void __init check_timer(void) panic_if_irq_remap("timer doesn't work through Interrupt-remapped IO-APIC"); clear_IO_APIC_pin(apic1, pin1); if (!no_pin1) - apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: " - "8254 timer not connected to IO-APIC\n"); + pr_err("..MP-BIOS bug: 8254 timer not connected to IO-APIC\n"); - apic_printk(APIC_QUIET, KERN_INFO "...trying to set up timer " - "(IRQ0) through the 8259A ...\n"); - apic_printk(APIC_QUIET, KERN_INFO - "..... (found apic %d pin %d) ...\n", apic2, pin2); + pr_info("...trying to set up timer (IRQ0) through the 8259A ...\n"); + pr_info("..... (found apic %d pin %d) ...\n", apic2, pin2); /* * legacy devices should be connected to IO APIC #0 */ @@ -2255,7 +2141,7 @@ static inline void __init check_timer(void) irq_domain_activate_irq(irq_data, false); legacy_pic->unmask(0); if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "....... works.\n"); + pr_info("....... works.\n"); goto out; } /* @@ -2263,26 +2149,24 @@ static inline void __init check_timer(void) */ legacy_pic->mask(0); clear_IO_APIC_pin(apic2, pin2); - apic_printk(APIC_QUIET, KERN_INFO "....... failed.\n"); + pr_info("....... failed.\n"); } - apic_printk(APIC_QUIET, KERN_INFO - "...trying to set up timer as Virtual Wire IRQ...\n"); + pr_info("...trying to set up timer as Virtual Wire IRQ...\n"); lapic_register_intr(0); apic_write(APIC_LVT0, APIC_DM_FIXED | cfg->vector); /* Fixed mode */ legacy_pic->unmask(0); if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); + pr_info("..... works.\n"); goto out; } legacy_pic->mask(0); apic_write(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_FIXED | cfg->vector); - apic_printk(APIC_QUIET, KERN_INFO "..... failed.\n"); + pr_info("..... failed.\n"); - apic_printk(APIC_QUIET, KERN_INFO - "...trying to set up timer as ExtINT IRQ...\n"); + pr_info("...trying to set up timer as ExtINT IRQ...\n"); legacy_pic->init(0); legacy_pic->make_irq(0); @@ -2292,14 +2176,15 @@ static inline void __init check_timer(void) unlock_ExtINT_logic(); if (timer_irq_works()) { - apic_printk(APIC_QUIET, KERN_INFO "..... works.\n"); + pr_info("..... works.\n"); goto out; } - apic_printk(APIC_QUIET, KERN_INFO "..... failed :(.\n"); - if (apic_is_x2apic_enabled()) - apic_printk(APIC_QUIET, KERN_INFO - "Perhaps problem with the pre-enabled x2apic mode\n" - "Try booting with x2apic and interrupt-remapping disabled in the bios.\n"); + + pr_info("..... failed :\n"); + if (apic_is_x2apic_enabled()) { + pr_info("Perhaps problem with the pre-enabled x2apic mode\n" + "Try booting with x2apic and interrupt-remapping disabled in the bios.\n"); + } panic("IO-APIC + timer doesn't work! Boot with apic=debug and send a " "report. Then try booting with the 'noapic' option.\n"); out: @@ -2327,11 +2212,11 @@ out: static int mp_irqdomain_create(int ioapic) { - struct irq_domain *parent; + struct mp_ioapic_gsi *gsi_cfg = mp_ioapic_gsi_routing(ioapic); int hwirqs = mp_ioapic_pin_count(ioapic); struct ioapic *ip = &ioapics[ioapic]; struct ioapic_domain_cfg *cfg = &ip->irqdomain_cfg; - struct mp_ioapic_gsi *gsi_cfg = mp_ioapic_gsi_routing(ioapic); + struct irq_domain *parent; struct fwnode_handle *fn; struct irq_fwspec fwspec; @@ -2367,10 +2252,8 @@ static int mp_irqdomain_create(int ioapic) return -ENOMEM; } - if (cfg->type == IOAPIC_DOMAIN_LEGACY || - cfg->type == IOAPIC_DOMAIN_STRICT) - ioapic_dynirq_base = max(ioapic_dynirq_base, - gsi_cfg->gsi_end + 1); + if (cfg->type == IOAPIC_DOMAIN_LEGACY || cfg->type == IOAPIC_DOMAIN_STRICT) + ioapic_dynirq_base = max(ioapic_dynirq_base, gsi_cfg->gsi_end + 1); return 0; } @@ -2397,13 +2280,11 @@ void __init setup_IO_APIC(void) io_apic_irqs = nr_legacy_irqs() ? ~PIC_IRQS : ~0UL; - apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n"); + apic_pr_verbose("ENABLING IO-APIC IRQs\n"); for_each_ioapic(ioapic) BUG_ON(mp_irqdomain_create(ioapic)); - /* - * Set up IO-APIC IRQ routing. - */ + /* Set up IO-APIC IRQ routing. */ x86_init.mpparse.setup_ioapic_ids(); sync_Arb_IDs(); @@ -2417,16 +2298,14 @@ void __init setup_IO_APIC(void) static void resume_ioapic_id(int ioapic_idx) { - unsigned long flags; union IO_APIC_reg_00 reg_00; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); reg_00.raw = io_apic_read(ioapic_idx, 0); if (reg_00.bits.ID != mpc_ioapic_id(ioapic_idx)) { reg_00.bits.ID = mpc_ioapic_id(ioapic_idx); io_apic_write(ioapic_idx, 0, reg_00.raw); } - raw_spin_unlock_irqrestore(&ioapic_lock, flags); } static void ioapic_resume(void) @@ -2440,8 +2319,8 @@ static void ioapic_resume(void) } static struct syscore_ops ioapic_syscore_ops = { - .suspend = save_ioapic_entries, - .resume = ioapic_resume, + .suspend = save_ioapic_entries, + .resume = ioapic_resume, }; static int __init ioapic_init_ops(void) @@ -2456,15 +2335,13 @@ device_initcall(ioapic_init_ops); static int io_apic_get_redir_entries(int ioapic) { union IO_APIC_reg_01 reg_01; - unsigned long flags; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); reg_01.raw = io_apic_read(ioapic, 1); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); - /* The register returns the maximum index redir index - * supported, which is one less than the total number of redir - * entries. + /* + * The register returns the maximum index redir index supported, + * which is one less than the total number of redir entries. */ return reg_01.bits.entries + 1; } @@ -2494,16 +2371,14 @@ static int io_apic_get_unique_id(int ioapic, int apic_id) static DECLARE_BITMAP(apic_id_map, MAX_LOCAL_APIC); const u32 broadcast_id = 0xF; union IO_APIC_reg_00 reg_00; - unsigned long flags; int i = 0; /* Initialize the ID map */ if (bitmap_empty(apic_id_map, MAX_LOCAL_APIC)) copy_phys_cpu_present_map(apic_id_map); - raw_spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(ioapic, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) + reg_00.raw = io_apic_read(ioapic, 0); if (apic_id >= broadcast_id) { pr_warn("IOAPIC[%d]: Invalid apic_id %d, trying %d\n", @@ -2530,21 +2405,19 @@ static int io_apic_get_unique_id(int ioapic, int apic_id) if (reg_00.bits.ID != apic_id) { reg_00.bits.ID = apic_id; - raw_spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(ioapic, 0, reg_00.raw); - reg_00.raw = io_apic_read(ioapic, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) { + io_apic_write(ioapic, 0, reg_00.raw); + reg_00.raw = io_apic_read(ioapic, 0); + } /* Sanity check */ if (reg_00.bits.ID != apic_id) { - pr_err("IOAPIC[%d]: Unable to change apic_id!\n", - ioapic); + pr_err("IOAPIC[%d]: Unable to change apic_id!\n", ioapic); return -1; } } - apic_printk(APIC_VERBOSE, KERN_INFO - "IOAPIC[%d]: Assigned apic_id %d\n", ioapic, apic_id); + apic_pr_verbose("IOAPIC[%d]: Assigned apic_id %d\n", ioapic, apic_id); return apic_id; } @@ -2560,7 +2433,6 @@ static u8 io_apic_unique_id(int idx, u8 id) { union IO_APIC_reg_00 reg_00; DECLARE_BITMAP(used, 256); - unsigned long flags; u8 new_id; int i; @@ -2576,26 +2448,23 @@ static u8 io_apic_unique_id(int idx, u8 id) * Read the current id from the ioapic and keep it if * available. */ - raw_spin_lock_irqsave(&ioapic_lock, flags); - reg_00.raw = io_apic_read(idx, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) + reg_00.raw = io_apic_read(idx, 0); + new_id = reg_00.bits.ID; if (!test_bit(new_id, used)) { - apic_printk(APIC_VERBOSE, KERN_INFO - "IOAPIC[%d]: Using reg apic_id %d instead of %d\n", - idx, new_id, id); + apic_pr_verbose("IOAPIC[%d]: Using reg apic_id %d instead of %d\n", + idx, new_id, id); return new_id; } - /* - * Get the next free id and write it to the ioapic. - */ + /* Get the next free id and write it to the ioapic. */ new_id = find_first_zero_bit(used, 256); reg_00.bits.ID = new_id; - raw_spin_lock_irqsave(&ioapic_lock, flags); - io_apic_write(idx, 0, reg_00.raw); - reg_00.raw = io_apic_read(idx, 0); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); + scoped_guard (raw_spinlock_irqsave, &ioapic_lock) { + io_apic_write(idx, 0, reg_00.raw); + reg_00.raw = io_apic_read(idx, 0); + } /* Sanity check */ BUG_ON(reg_00.bits.ID != new_id); @@ -2605,12 +2474,10 @@ static u8 io_apic_unique_id(int idx, u8 id) static int io_apic_get_version(int ioapic) { - union IO_APIC_reg_01 reg_01; - unsigned long flags; + union IO_APIC_reg_01 reg_01; - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); reg_01.raw = io_apic_read(ioapic, 1); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); return reg_01.bits.version; } @@ -2625,8 +2492,8 @@ static struct resource *ioapic_resources; static struct resource * __init ioapic_setup_resources(void) { - unsigned long n; struct resource *res; + unsigned long n; char *mem; int i; @@ -2686,9 +2553,7 @@ void __init io_apic_init_mappings(void) ioapic_phys = mpc_ioapic_addr(i); #ifdef CONFIG_X86_32 if (!ioapic_phys) { - printk(KERN_ERR - "WARNING: bogus zero IO-APIC " - "address found in MPTABLE, " + pr_err("WARNING: bogus zero IO-APIC address found in MPTABLE, " "disabling IO/APIC support!\n"); smp_found_config = 0; ioapic_is_disabled = true; @@ -2707,9 +2572,8 @@ fake_ioapic_page: ioapic_phys = __pa(ioapic_phys); } io_apic_set_fixmap(idx, ioapic_phys); - apic_printk(APIC_VERBOSE, "mapped IOAPIC to %08lx (%08lx)\n", - __fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK), - ioapic_phys); + apic_pr_verbose("mapped IOAPIC to %08lx (%08lx)\n", + __fix_to_virt(idx) + (ioapic_phys & ~PAGE_MASK), ioapic_phys); idx++; ioapic_res->start = ioapic_phys; @@ -2720,13 +2584,12 @@ fake_ioapic_page: void __init ioapic_insert_resources(void) { - int i; struct resource *r = ioapic_resources; + int i; if (!r) { if (nr_ioapics > 0) - printk(KERN_ERR - "IO APIC resources couldn't be allocated.\n"); + pr_err("IO APIC resources couldn't be allocated.\n"); return; } @@ -2746,11 +2609,12 @@ int mp_find_ioapic(u32 gsi) /* Find the IOAPIC that manages this GSI. */ for_each_ioapic(i) { struct mp_ioapic_gsi *gsi_cfg = mp_ioapic_gsi_routing(i); + if (gsi >= gsi_cfg->gsi_base && gsi <= gsi_cfg->gsi_end) return i; } - printk(KERN_ERR "ERROR: Unable to locate IOAPIC for GSI %d\n", gsi); + pr_err("ERROR: Unable to locate IOAPIC for GSI %d\n", gsi); return -1; } @@ -2789,12 +2653,10 @@ static int bad_ioapic_register(int idx) static int find_free_ioapic_entry(void) { - int idx; - - for (idx = 0; idx < MAX_IO_APICS; idx++) + for (int idx = 0; idx < MAX_IO_APICS; idx++) { if (ioapics[idx].nr_registers == 0) return idx; - + } return MAX_IO_APICS; } @@ -2805,8 +2667,7 @@ static int find_free_ioapic_entry(void) * @gsi_base: base of GSI associated with the IOAPIC * @cfg: configuration information for the IOAPIC */ -int mp_register_ioapic(int id, u32 address, u32 gsi_base, - struct ioapic_domain_cfg *cfg) +int mp_register_ioapic(int id, u32 address, u32 gsi_base, struct ioapic_domain_cfg *cfg) { bool hotplug = !!ioapic_initialized; struct mp_ioapic_gsi *gsi_cfg; @@ -2817,12 +2678,13 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base, pr_warn("Bogus (zero) I/O APIC address found, skipping!\n"); return -EINVAL; } - for_each_ioapic(ioapic) + + for_each_ioapic(ioapic) { if (ioapics[ioapic].mp_config.apicaddr == address) { - pr_warn("address 0x%x conflicts with IOAPIC%d\n", - address, ioapic); + pr_warn("address 0x%x conflicts with IOAPIC%d\n", address, ioapic); return -EEXIST; } + } idx = find_free_ioapic_entry(); if (idx >= MAX_IO_APICS) { @@ -2857,8 +2719,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base, (gsi_end >= gsi_cfg->gsi_base && gsi_end <= gsi_cfg->gsi_end)) { pr_warn("GSI range [%u-%u] for new IOAPIC conflicts with GSI[%u-%u]\n", - gsi_base, gsi_end, - gsi_cfg->gsi_base, gsi_cfg->gsi_end); + gsi_base, gsi_end, gsi_cfg->gsi_base, gsi_cfg->gsi_end); clear_fixmap(FIX_IO_APIC_BASE_0 + idx); return -ENOSPC; } @@ -2892,8 +2753,7 @@ int mp_register_ioapic(int id, u32 address, u32 gsi_base, ioapics[idx].nr_registers = entries; pr_info("IOAPIC[%d]: apic_id %d, version %d, address 0x%x, GSI %d-%d\n", - idx, mpc_ioapic_id(idx), - mpc_ioapic_ver(idx), mpc_ioapic_addr(idx), + idx, mpc_ioapic_id(idx), mpc_ioapic_ver(idx), mpc_ioapic_addr(idx), gsi_cfg->gsi_base, gsi_cfg->gsi_end); return 0; @@ -2904,11 +2764,13 @@ int mp_unregister_ioapic(u32 gsi_base) int ioapic, pin; int found = 0; - for_each_ioapic(ioapic) + for_each_ioapic(ioapic) { if (ioapics[ioapic].gsi_config.gsi_base == gsi_base) { found = 1; break; } + } + if (!found) { pr_warn("can't find IOAPIC for GSI %d\n", gsi_base); return -ENODEV; @@ -2922,8 +2784,7 @@ int mp_unregister_ioapic(u32 gsi_base) if (irq >= 0) { data = irq_get_chip_data(irq); if (data && data->count) { - pr_warn("pin%d on IOAPIC%d is still in use.\n", - pin, ioapic); + pr_warn("pin%d on IOAPIC%d is still in use.\n", pin, ioapic); return -EBUSY; } } @@ -2958,8 +2819,7 @@ static void mp_irqdomain_get_attr(u32 gsi, struct mp_chip_data *data, if (info && info->ioapic.valid) { data->is_level = info->ioapic.is_level; data->active_low = info->ioapic.active_low; - } else if (__acpi_get_override_irq(gsi, &data->is_level, - &data->active_low) < 0) { + } else if (__acpi_get_override_irq(gsi, &data->is_level, &data->active_low) < 0) { /* PCI interrupts are always active low level triggered. */ data->is_level = true; data->active_low = true; @@ -3017,10 +2877,8 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq, return -ENOMEM; ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, info); - if (ret < 0) { - kfree(data); - return ret; - } + if (ret < 0) + goto free_data; INIT_LIST_HEAD(&data->irq_2_pin); irq_data->hwirq = info->ioapic.pin; @@ -3029,7 +2887,10 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq, irq_data->chip_data = data; mp_irqdomain_get_attr(mp_pin_to_gsi(ioapic, pin), data, info); - add_pin_to_irq_node(data, ioapic_alloc_attr_node(info), ioapic, pin); + if (!add_pin_to_irq_node(data, ioapic_alloc_attr_node(info), ioapic, pin)) { + ret = -ENOMEM; + goto free_irqs; + } mp_preconfigure_entry(data); mp_register_handler(virq, data->is_level); @@ -3039,11 +2900,15 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq, legacy_pic->mask(virq); local_irq_restore(flags); - apic_printk(APIC_VERBOSE, KERN_DEBUG - "IOAPIC[%d]: Preconfigured routing entry (%d-%d -> IRQ %d Level:%i ActiveLow:%i)\n", - ioapic, mpc_ioapic_id(ioapic), pin, virq, - data->is_level, data->active_low); + apic_pr_verbose("IOAPIC[%d]: Preconfigured routing entry (%d-%d -> IRQ %d Level:%i ActiveLow:%i)\n", + ioapic, mpc_ioapic_id(ioapic), pin, virq, data->is_level, data->active_low); return 0; + +free_irqs: + irq_domain_free_irqs_parent(domain, virq, nr_irqs); +free_data: + kfree(data); + return ret; } void mp_irqdomain_free(struct irq_domain *domain, unsigned int virq, @@ -3056,22 +2921,17 @@ void mp_irqdomain_free(struct irq_domain *domain, unsigned int virq, irq_data = irq_domain_get_irq_data(domain, virq); if (irq_data && irq_data->chip_data) { data = irq_data->chip_data; - __remove_pin_from_irq(data, mp_irqdomain_ioapic_idx(domain), - (int)irq_data->hwirq); + __remove_pin_from_irq(data, mp_irqdomain_ioapic_idx(domain), (int)irq_data->hwirq); WARN_ON(!list_empty(&data->irq_2_pin)); kfree(irq_data->chip_data); } irq_domain_free_irqs_top(domain, virq, nr_irqs); } -int mp_irqdomain_activate(struct irq_domain *domain, - struct irq_data *irq_data, bool reserve) +int mp_irqdomain_activate(struct irq_domain *domain, struct irq_data *irq_data, bool reserve) { - unsigned long flags; - - raw_spin_lock_irqsave(&ioapic_lock, flags); + guard(raw_spinlock_irqsave)(&ioapic_lock); ioapic_configure_entry(irq_data); - raw_spin_unlock_irqrestore(&ioapic_lock, flags); return 0; } @@ -3079,8 +2939,7 @@ void mp_irqdomain_deactivate(struct irq_domain *domain, struct irq_data *irq_data) { /* It won't be called for IRQ with multiple IOAPIC pins associated */ - ioapic_mask_entry(mp_irqdomain_ioapic_idx(domain), - (int)irq_data->hwirq); + ioapic_mask_entry(mp_irqdomain_ioapic_idx(domain), (int)irq_data->hwirq); } int mp_irqdomain_ioapic_idx(struct irq_domain *domain) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index be307c9ef263..07a34d723505 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1510,6 +1510,11 @@ static void __init cpu_parse_early_param(void) if (cmdline_find_option_bool(boot_command_line, "nousershstk")) setup_clear_cpu_cap(X86_FEATURE_USER_SHSTK); + /* Minimize the gap between FRED is available and available but disabled. */ + arglen = cmdline_find_option(boot_command_line, "fred", arg, sizeof(arg)); + if (arglen != 2 || strncmp(arg, "on", 2)) + setup_clear_cpu_cap(X86_FEATURE_FRED); + arglen = cmdline_find_option(boot_command_line, "clearcpuid", arg, sizeof(arg)); if (arglen <= 0) return; @@ -2171,7 +2176,7 @@ static inline void tss_setup_io_bitmap(struct tss_struct *tss) * Setup everything needed to handle exceptions from the IDT, including the IST * exceptions which use paranoid_entry(). */ -void cpu_init_exception_handling(void) +void cpu_init_exception_handling(bool boot_cpu) { struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw); int cpu = raw_smp_processor_id(); @@ -2190,10 +2195,23 @@ void cpu_init_exception_handling(void) /* GHCB needs to be setup to handle #VC. */ setup_ghcb(); + if (cpu_feature_enabled(X86_FEATURE_FRED)) { + /* The boot CPU has enabled FRED during early boot */ + if (!boot_cpu) + cpu_init_fred_exceptions(); + + cpu_init_fred_rsps(); + } else { + load_current_idt(); + } +} + +void __init cpu_init_replace_early_idt(void) +{ if (cpu_feature_enabled(X86_FEATURE_FRED)) cpu_init_fred_exceptions(); else - load_current_idt(); + idt_setup_early_pf(); } /* diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c index b7d9f530ae16..8bd84114c2d9 100644 --- a/arch/x86/kernel/cpu/cpuid-deps.c +++ b/arch/x86/kernel/cpu/cpuid-deps.c @@ -83,7 +83,6 @@ static const struct cpuid_dep cpuid_deps[] = { { X86_FEATURE_AMX_TILE, X86_FEATURE_XFD }, { X86_FEATURE_SHSTK, X86_FEATURE_XSAVES }, { X86_FEATURE_FRED, X86_FEATURE_LKGS }, - { X86_FEATURE_FRED, X86_FEATURE_WRMSRNS }, {} }; diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c index 1640ae76548f..4a4118784c13 100644 --- a/arch/x86/kernel/cpu/feat_ctl.c +++ b/arch/x86/kernel/cpu/feat_ctl.c @@ -188,7 +188,7 @@ update_caps: update_sgx: if (!(msr & FEAT_CTL_SGX_ENABLED)) { if (enable_sgx_kvm || enable_sgx_driver) - pr_err_once("SGX disabled by BIOS.\n"); + pr_err_once("SGX disabled or unsupported by BIOS.\n"); clear_cpu_cap(c, X86_FEATURE_SGX); return; } diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index ead967479fa6..d18078834ded 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include #include @@ -537,16 +536,6 @@ static void __init ms_hyperv_init_platform(void) if (efi_enabled(EFI_BOOT)) x86_platform.get_nmi_reason = hv_get_nmi_reason; - /* - * Hyper-V VMs have a PIT emulation quirk such that zeroing the - * counter register during PIT shutdown restarts the PIT. So it - * continues to interrupt @18.2 HZ. Setting i8253_clear_counter - * to false tells pit_shutdown() not to zero the counter so that - * the PIT really is shutdown. Generation 2 VMs don't have a PIT, - * and setting this value has no effect. - */ - i8253_clear_counter_on_shutdown = false; - #if IS_ENABLED(CONFIG_HYPERV) if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) || ms_hyperv.paravisor_present) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index f3f1461273ee..3a79105455f1 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -733,7 +733,7 @@ out: return 0; } -/** +/* * A section metric is concatenated in a way that @low bits 12-31 define the * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the * metric. diff --git a/arch/x86/kernel/eisa.c b/arch/x86/kernel/eisa.c index 53935b4d62e3..9535a6507db7 100644 --- a/arch/x86/kernel/eisa.c +++ b/arch/x86/kernel/eisa.c @@ -11,15 +11,15 @@ static __init int eisa_bus_probe(void) { - void __iomem *p; + u32 *p; if ((xen_pv_domain() && !xen_initial_domain()) || cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) return 0; - p = ioremap(0x0FFFD9, 4); - if (p && readl(p) == 'E' + ('I' << 8) + ('S' << 16) + ('A' << 24)) + p = memremap(0x0FFFD9, 4, MEMREMAP_WB); + if (p && *p == 'E' + ('I' << 8) + ('S' << 16) + ('A' << 24)) EISA_bus = 1; - iounmap(p); + memunmap(p); return 0; } subsys_initcall(eisa_bus_probe); diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index 247f2225aa9f..1065ab995305 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -63,6 +63,16 @@ setfx: return true; } +/* + * Update the value of PKRU register that was already pushed onto the signal frame. + */ +static inline int update_pkru_in_sigframe(struct xregs_state __user *buf, u32 pkru) +{ + if (unlikely(!cpu_feature_enabled(X86_FEATURE_OSPKE))) + return 0; + return __put_user(pkru, (unsigned int __user *)get_xsave_addr_user(buf, XFEATURE_PKRU)); +} + /* * Signal frame handlers. */ @@ -156,10 +166,17 @@ static inline bool save_xstate_epilog(void __user *buf, int ia32_frame, return !err; } -static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) +static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf, u32 pkru) { - if (use_xsave()) - return xsave_to_user_sigframe(buf); + int err = 0; + + if (use_xsave()) { + err = xsave_to_user_sigframe(buf); + if (!err) + err = update_pkru_in_sigframe(buf, pkru); + return err; + } + if (use_fxsr()) return fxsave_to_user_sigframe((struct fxregs_state __user *) buf); else @@ -185,7 +202,7 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf) * For [f]xsave state, update the SW reserved fields in the [f]xsave frame * indicating the absence/presence of the extended state to the user. */ -bool copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size) +bool copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size, u32 pkru) { struct task_struct *tsk = current; struct fpstate *fpstate = tsk->thread.fpu.fpstate; @@ -228,7 +245,7 @@ retry: fpregs_restore_userregs(); pagefault_disable(); - ret = copy_fpregs_to_sigframe(buf_fx); + ret = copy_fpregs_to_sigframe(buf_fx, pkru); pagefault_enable(); fpregs_unlock(); diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 1339f8328db5..22abb5ee0cf2 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -23,6 +24,8 @@ #include #include +#include + #include "context.h" #include "internal.h" #include "legacy.h" @@ -996,6 +999,19 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr) } EXPORT_SYMBOL_GPL(get_xsave_addr); +/* + * Given an xstate feature nr, calculate where in the xsave buffer the state is. + * The xsave buffer should be in standard format, not compacted (e.g. user mode + * signal frames). + */ +void __user *get_xsave_addr_user(struct xregs_state __user *xsave, int xfeature_nr) +{ + if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr))) + return NULL; + + return (void __user *)xsave + xstate_offsets[xfeature_nr]; +} + #ifdef CONFIG_ARCH_HAS_PKEYS /* @@ -1841,3 +1857,89 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns, return 0; } #endif /* CONFIG_PROC_PID_ARCH_STATUS */ + +#ifdef CONFIG_COREDUMP +static const char owner_name[] = "LINUX"; + +/* + * Dump type, size, offset and flag values for every xfeature that is present. + */ +static int dump_xsave_layout_desc(struct coredump_params *cprm) +{ + int num_records = 0; + int i; + + for_each_extended_xfeature(i, fpu_user_cfg.max_features) { + struct x86_xfeat_component xc = { + .type = i, + .size = xstate_sizes[i], + .offset = xstate_offsets[i], + /* reserved for future use */ + .flags = 0, + }; + + if (!dump_emit(cprm, &xc, sizeof(xc))) + return 0; + + num_records++; + } + return num_records; +} + +static u32 get_xsave_desc_size(void) +{ + u32 cnt = 0; + u32 i; + + for_each_extended_xfeature(i, fpu_user_cfg.max_features) + cnt++; + + return cnt * (sizeof(struct x86_xfeat_component)); +} + +int elf_coredump_extra_notes_write(struct coredump_params *cprm) +{ + int num_records = 0; + struct elf_note en; + + if (!fpu_user_cfg.max_features) + return 0; + + en.n_namesz = sizeof(owner_name); + en.n_descsz = get_xsave_desc_size(); + en.n_type = NT_X86_XSAVE_LAYOUT; + + if (!dump_emit(cprm, &en, sizeof(en))) + return 1; + if (!dump_emit(cprm, owner_name, en.n_namesz)) + return 1; + if (!dump_align(cprm, 4)) + return 1; + + num_records = dump_xsave_layout_desc(cprm); + if (!num_records) + return 1; + + /* Total size should be equal to the number of records */ + if ((sizeof(struct x86_xfeat_component) * num_records) != en.n_descsz) + return 1; + + return 0; +} + +int elf_coredump_extra_notes_size(void) +{ + int size; + + if (!fpu_user_cfg.max_features) + return 0; + + /* .note header */ + size = sizeof(struct elf_note); + /* Name plus alignment to 4 bytes */ + size += roundup(sizeof(owner_name), 4); + size += get_xsave_desc_size(); + + return size; +} +#endif /* CONFIG_COREDUMP */ diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index afb404cd2059..0b86a5002c84 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -54,6 +54,8 @@ extern int copy_sigframe_from_user_to_xstate(struct task_struct *tsk, const void extern void fpu__init_cpu_xstate(void); extern void fpu__init_system_xstate(unsigned int legacy_size); +extern void __user *get_xsave_addr_user(struct xregs_state __user *xsave, int xfeature_nr); + static inline u64 xfeatures_mask_supervisor(void) { return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED; diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c index 4bcd8791ad96..8d32c3f48abc 100644 --- a/arch/x86/kernel/fred.c +++ b/arch/x86/kernel/fred.c @@ -21,17 +21,53 @@ #define FRED_STKLVL(vector, lvl) ((lvl) << (2 * (vector))) +DEFINE_PER_CPU(unsigned long, fred_rsp0); +EXPORT_PER_CPU_SYMBOL(fred_rsp0); + void cpu_init_fred_exceptions(void) { /* When FRED is enabled by default, remove this log message */ pr_info("Initialize FRED on CPU%d\n", smp_processor_id()); + /* + * If a kernel event is delivered before a CPU goes to user level for + * the first time, its SS is NULL thus NULL is pushed into the SS field + * of the FRED stack frame. But before ERETS is executed, the CPU may + * context switch to another task and go to user level. Then when the + * CPU comes back to kernel mode, SS is changed to __KERNEL_DS. Later + * when ERETS is executed to return from the kernel event handler, a #GP + * fault is generated because SS doesn't match the SS saved in the FRED + * stack frame. + * + * Initialize SS to __KERNEL_DS when enabling FRED to avoid such #GPs. + */ + loadsegment(ss, __KERNEL_DS); + wrmsrl(MSR_IA32_FRED_CONFIG, /* Reserve for CALL emulation */ FRED_CONFIG_REDZONE | FRED_CONFIG_INT_STKLVL(0) | FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user)); + wrmsrl(MSR_IA32_FRED_STKLVLS, 0); + wrmsrl(MSR_IA32_FRED_RSP0, 0); + wrmsrl(MSR_IA32_FRED_RSP1, 0); + wrmsrl(MSR_IA32_FRED_RSP2, 0); + wrmsrl(MSR_IA32_FRED_RSP3, 0); + + /* Enable FRED */ + cr4_set_bits(X86_CR4_FRED); + /* Any further IDT use is a bug */ + idt_invalidate(); + + /* Use int $0x80 for 32-bit system calls in FRED mode */ + setup_clear_cpu_cap(X86_FEATURE_SYSENTER32); + setup_clear_cpu_cap(X86_FEATURE_SYSCALL32); +} + +/* Must be called after setup_cpu_entry_areas() */ +void cpu_init_fred_rsps(void) +{ /* * The purpose of separate stacks for NMI, #DB and #MC *in the kernel* * (remember that user space faults are always taken on stack level 0) @@ -47,13 +83,4 @@ void cpu_init_fred_exceptions(void) wrmsrl(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB)); wrmsrl(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI)); wrmsrl(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF)); - - /* Enable FRED */ - cr4_set_bits(X86_CR4_FRED); - /* Any further IDT use is a bug */ - idt_invalidate(); - - /* Use int $0x80 for 32-bit system calls in FRED mode */ - setup_clear_cpu_cap(X86_FEATURE_SYSENTER32); - setup_clear_cpu_cap(X86_FEATURE_SYSCALL32); } diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index a817ed0724d1..4b9d4557fc94 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -559,10 +559,11 @@ void early_setup_idt(void) */ void __head startup_64_setup_gdt_idt(void) { + struct desc_struct *gdt = (void *)(__force unsigned long)init_per_cpu_var(gdt_page.gdt); void *handler = NULL; struct desc_ptr startup_gdt_descr = { - .address = (unsigned long)&RIP_REL_REF(init_per_cpu_var(gdt_page.gdt)), + .address = (unsigned long)&RIP_REL_REF(*gdt), .size = GDT_SIZE - 1, }; diff --git a/arch/x86/kernel/i8253.c b/arch/x86/kernel/i8253.c index 2b7999a1a50a..80e262bb627f 100644 --- a/arch/x86/kernel/i8253.c +++ b/arch/x86/kernel/i8253.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -39,9 +40,15 @@ static bool __init use_pit(void) bool __init pit_timer_init(void) { - if (!use_pit()) + if (!use_pit()) { + /* + * Don't just ignore the PIT. Ensure it's stopped, because + * VMMs otherwise steal CPU time just to pointlessly waggle + * the (masked) IRQ. + */ + clockevent_i8253_disable(); return false; - + } clockevent_i8253_init(true); global_clock_event = &i8253_clockevent; return true; diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index cc0f7f70b17b..9c9ac606893e 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -28,6 +28,7 @@ #include #include #include +#include #ifdef CONFIG_ACPI /* @@ -87,6 +88,8 @@ map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p) { #ifdef CONFIG_EFI unsigned long mstart, mend; + void *kaddr; + int ret; if (!efi_enabled(EFI_BOOT)) return 0; @@ -102,6 +105,30 @@ map_efi_systab(struct x86_mapping_info *info, pgd_t *level4p) if (!mstart) return 0; + ret = kernel_ident_mapping_init(info, level4p, mstart, mend); + if (ret) + return ret; + + kaddr = memremap(mstart, mend - mstart, MEMREMAP_WB); + if (!kaddr) { + pr_err("Could not map UEFI system table\n"); + return -ENOMEM; + } + + mstart = efi_config_table; + + if (efi_enabled(EFI_64BIT)) { + efi_system_table_64_t *stbl = (efi_system_table_64_t *)kaddr; + + mend = mstart + sizeof(efi_config_table_64_t) * stbl->nr_tables; + } else { + efi_system_table_32_t *stbl = (efi_system_table_32_t *)kaddr; + + mend = mstart + sizeof(efi_config_table_32_t) * stbl->nr_tables; + } + + memunmap(kaddr); + return kernel_ident_mapping_init(info, level4p, mstart, mend); #endif return 0; diff --git a/arch/x86/kernel/mpparse.c b/arch/x86/kernel/mpparse.c index e89171b0347a..4a1b1b28abf9 100644 --- a/arch/x86/kernel/mpparse.c +++ b/arch/x86/kernel/mpparse.c @@ -68,7 +68,7 @@ static void __init mpc_oem_bus_info(struct mpc_bus *m, char *str) { memcpy(str, m->bustype, 6); str[6] = 0; - apic_printk(APIC_VERBOSE, "Bus #%d is %s\n", m->busid, str); + apic_pr_verbose("Bus #%d is %s\n", m->busid, str); } static void __init MP_bus_info(struct mpc_bus *m) @@ -417,7 +417,7 @@ static unsigned long __init get_mpc_size(unsigned long physptr) mpc = early_memremap(physptr, PAGE_SIZE); size = mpc->length; early_memunmap(mpc, PAGE_SIZE); - apic_printk(APIC_VERBOSE, " mpc: %lx-%lx\n", physptr, physptr + size); + apic_pr_verbose(" mpc: %lx-%lx\n", physptr, physptr + size); return size; } @@ -560,8 +560,7 @@ static int __init smp_scan_config(unsigned long base, unsigned long length) struct mpf_intel *mpf; int ret = 0; - apic_printk(APIC_VERBOSE, "Scan for SMP in [mem %#010lx-%#010lx]\n", - base, base + length - 1); + apic_pr_verbose("Scan for SMP in [mem %#010lx-%#010lx]\n", base, base + length - 1); BUILD_BUG_ON(sizeof(*mpf) != 16); while (length > 0) { @@ -683,13 +682,13 @@ static void __init check_irq_src(struct mpc_intsrc *m, int *nr_m_spare) { int i; - apic_printk(APIC_VERBOSE, "OLD "); + apic_pr_verbose("OLD "); print_mp_irq_info(m); i = get_MP_intsrc_index(m); if (i > 0) { memcpy(m, &mp_irqs[i], sizeof(*m)); - apic_printk(APIC_VERBOSE, "NEW "); + apic_pr_verbose("NEW "); print_mp_irq_info(&mp_irqs[i]); return; } @@ -772,7 +771,7 @@ static int __init replace_intsrc_all(struct mpc_table *mpc, continue; if (nr_m_spare > 0) { - apic_printk(APIC_VERBOSE, "*NEW* found\n"); + apic_pr_verbose("*NEW* found\n"); nr_m_spare--; memcpy(m_spare[nr_m_spare], &mp_irqs[i], sizeof(mp_irqs[i])); m_spare[nr_m_spare] = NULL; diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 6d3d20e3e43a..226472332a70 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -798,6 +798,32 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr) #define LAM_U57_BITS 6 +static void enable_lam_func(void *__mm) +{ + struct mm_struct *mm = __mm; + unsigned long lam; + + if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm) { + lam = mm_lam_cr3_mask(mm); + write_cr3(__read_cr3() | lam); + cpu_tlbstate_update_lam(lam, mm_untag_mask(mm)); + } +} + +static void mm_enable_lam(struct mm_struct *mm) +{ + mm->context.lam_cr3_mask = X86_CR3_LAM_U57; + mm->context.untag_mask = ~GENMASK(62, 57); + + /* + * Even though the process must still be single-threaded at this + * point, kernel threads may be using the mm. IPI those kernel + * threads if they exist. + */ + on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true); + set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); +} + static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) { if (!cpu_feature_enabled(X86_FEATURE_LAM)) @@ -814,25 +840,21 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) if (mmap_write_lock_killable(mm)) return -EINTR; + /* + * MM_CONTEXT_LOCK_LAM is set on clone. Prevent LAM from + * being enabled unless the process is single threaded: + */ if (test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags)) { mmap_write_unlock(mm); return -EBUSY; } - if (!nr_bits) { - mmap_write_unlock(mm); - return -EINVAL; - } else if (nr_bits <= LAM_U57_BITS) { - mm->context.lam_cr3_mask = X86_CR3_LAM_U57; - mm->context.untag_mask = ~GENMASK(62, 57); - } else { + if (!nr_bits || nr_bits > LAM_U57_BITS) { mmap_write_unlock(mm); return -EINVAL; } - write_cr3(__read_cr3() | mm->context.lam_cr3_mask); - set_tlbstate_lam_mode(mm); - set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); + mm_enable_lam(mm); mmap_write_unlock(mm); diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S index 042c9a0334e9..e9e88c342f75 100644 --- a/arch/x86/kernel/relocate_kernel_64.S +++ b/arch/x86/kernel/relocate_kernel_64.S @@ -170,6 +170,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped) wbinvd .Lsme_off: + /* Save the preserve_context to %r11 as swap_pages clobbers %rcx. */ movq %rcx, %r11 call swap_pages @@ -258,7 +259,7 @@ SYM_CODE_END(virtual_mapped) /* Do the copies */ SYM_CODE_START_LOCAL_NOALIGN(swap_pages) UNWIND_HINT_END_OF_STACK - movq %rdi, %rcx /* Put the page_list in %rcx */ + movq %rdi, %rcx /* Put the indirection_page in %rcx */ xorl %edi, %edi xorl %esi, %esi jmp 1f @@ -289,18 +290,21 @@ SYM_CODE_START_LOCAL_NOALIGN(swap_pages) movq %rcx, %rsi /* For ever source page do a copy */ andq $0xfffffffffffff000, %rsi - movq %rdi, %rdx - movq %rsi, %rax + movq %rdi, %rdx /* Save destination page to %rdx */ + movq %rsi, %rax /* Save source page to %rax */ + /* copy source page to swap page */ movq %r10, %rdi movl $512, %ecx rep ; movsq + /* copy destination page to source page */ movq %rax, %rdi movq %rdx, %rsi movl $512, %ecx rep ; movsq + /* copy swap page to destination page */ movq %rdx, %rdi movq %r10, %rsi movl $512, %ecx diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 6129dc2ba784..f1fea506e20f 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1039,7 +1039,12 @@ void __init setup_arch(char **cmdline_p) init_mem_mapping(); - idt_setup_early_pf(); + /* + * init_mem_mapping() relies on the early IDT page fault handling. + * Now either enable FRED or install the real page fault handler + * for 64-bit in the IDT. + */ + cpu_init_replace_early_idt(); /* * Update mmu_cr4_features (and, indirectly, trampoline_cr4_features) diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 31b6f5dddfc2..5f441039b572 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -60,6 +60,24 @@ static inline int is_x32_frame(struct ksignal *ksig) ksig->ka.sa.sa_flags & SA_X32_ABI; } +/* + * Enable all pkeys temporarily, so as to ensure that both the current + * execution stack as well as the alternate signal stack are writeable. + * The application can use any of the available pkeys to protect the + * alternate signal stack, and we don't know which one it is, so enable + * all. The PKRU register will be reset to init_pkru later in the flow, + * in fpu__clear_user_states(), and it is the application's responsibility + * to enable the appropriate pkey as the first step in the signal handler + * so that the handler does not segfault. + */ +static inline u32 sig_prepare_pkru(void) +{ + u32 orig_pkru = read_pkru(); + + write_pkru(0); + return orig_pkru; +} + /* * Set up a signal frame. */ @@ -84,6 +102,7 @@ get_sigframe(struct ksignal *ksig, struct pt_regs *regs, size_t frame_size, unsigned long math_size = 0; unsigned long sp = regs->sp; unsigned long buf_fx = 0; + u32 pkru; /* redzone */ if (!ia32_frame) @@ -138,9 +157,17 @@ get_sigframe(struct ksignal *ksig, struct pt_regs *regs, size_t frame_size, return (void __user *)-1L; } + /* Update PKRU to enable access to the alternate signal stack. */ + pkru = sig_prepare_pkru(); /* save i387 and extended state */ - if (!copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size)) + if (!copy_fpstate_to_sigframe(*fpstate, (void __user *)buf_fx, math_size, pkru)) { + /* + * Restore PKRU to the original, user-defined value; disable + * extra pkeys enabled for the alternate signal stack, if any. + */ + write_pkru(pkru); return (void __user *)-1L; + } return (void __user *)sp; } diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c index 8a94053c5444..ee9453891901 100644 --- a/arch/x86/kernel/signal_64.c +++ b/arch/x86/kernel/signal_64.c @@ -260,15 +260,15 @@ SYSCALL_DEFINE0(rt_sigreturn) set_current_blocked(&set); + if (restore_altstack(&frame->uc.uc_stack)) + goto badframe; + if (!restore_sigcontext(regs, &frame->uc.uc_mcontext, uc_flags)) goto badframe; if (restore_signal_shadow_stack()) goto badframe; - if (restore_altstack(&frame->uc.uc_stack)) - goto badframe; - return regs->ax; badframe: diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 0c35207320cb..dc4fff8fccce 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -246,7 +246,7 @@ static void notrace start_secondary(void *unused) __flush_tlb_all(); } - cpu_init_exception_handling(); + cpu_init_exception_handling(false); /* * Load the microcode before reaching the AP alive synchronization diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 4fa0b17e5043..d05392db5d0f 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -91,6 +92,47 @@ __always_inline int is_valid_bugaddr(unsigned long addr) return *(unsigned short *)addr == INSN_UD2; } +/* + * Check for UD1 or UD2, accounting for Address Size Override Prefixes. + * If it's a UD1, get the ModRM byte to pass along to UBSan. + */ +__always_inline int decode_bug(unsigned long addr, u32 *imm) +{ + u8 v; + + if (addr < TASK_SIZE_MAX) + return BUG_NONE; + + v = *(u8 *)(addr++); + if (v == INSN_ASOP) + v = *(u8 *)(addr++); + if (v != OPCODE_ESCAPE) + return BUG_NONE; + + v = *(u8 *)(addr++); + if (v == SECOND_BYTE_OPCODE_UD2) + return BUG_UD2; + + if (!IS_ENABLED(CONFIG_UBSAN_TRAP) || v != SECOND_BYTE_OPCODE_UD1) + return BUG_NONE; + + /* Retrieve the immediate (type value) for the UBSAN UD1 */ + v = *(u8 *)(addr++); + if (X86_MODRM_RM(v) == 4) + addr++; + + *imm = 0; + if (X86_MODRM_MOD(v) == 1) + *imm = *(u8 *)addr; + else if (X86_MODRM_MOD(v) == 2) + *imm = *(u32 *)addr; + else + WARN_ONCE(1, "Unexpected MODRM_MOD: %u\n", X86_MODRM_MOD(v)); + + return BUG_UD1; +} + + static nokprobe_inline int do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str, struct pt_regs *regs, long error_code) @@ -216,6 +258,8 @@ static inline void handle_invalid_op(struct pt_regs *regs) static noinstr bool handle_bug(struct pt_regs *regs) { bool handled = false; + int ud_type; + u32 imm; /* * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug() @@ -223,7 +267,8 @@ static noinstr bool handle_bug(struct pt_regs *regs) * irqentry_enter(). */ kmsan_unpoison_entry_regs(regs); - if (!is_valid_bugaddr(regs->ip)) + ud_type = decode_bug(regs->ip, &imm); + if (ud_type == BUG_NONE) return handled; /* @@ -236,10 +281,14 @@ static noinstr bool handle_bug(struct pt_regs *regs) */ if (regs->flags & X86_EFLAGS_IF) raw_local_irq_enable(); - if (report_bug(regs->ip, regs) == BUG_TRAP_TYPE_WARN || - handle_cfi_failure(regs) == BUG_TRAP_TYPE_WARN) { - regs->ip += LEN_UD2; - handled = true; + if (ud_type == BUG_UD2) { + if (report_bug(regs->ip, regs) == BUG_TRAP_TYPE_WARN || + handle_cfi_failure(regs) == BUG_TRAP_TYPE_WARN) { + regs->ip += LEN_UD2; + handled = true; + } + } else if (IS_ENABLED(CONFIG_UBSAN_TRAP)) { + pr_crit("%s at %pS\n", report_ubsan_failure(regs, imm), (void *)regs->ip); } if (regs->flags & X86_EFLAGS_IF) raw_local_irq_disable(); @@ -1402,34 +1451,8 @@ DEFINE_IDTENTRY_SW(iret_error) } #endif -/* Do not enable FRED by default yet. */ -static bool enable_fred __ro_after_init = false; - -#ifdef CONFIG_X86_FRED -static int __init fred_setup(char *str) -{ - if (!str) - return -EINVAL; - - if (!cpu_feature_enabled(X86_FEATURE_FRED)) - return 0; - - if (!strcmp(str, "on")) - enable_fred = true; - else if (!strcmp(str, "off")) - enable_fred = false; - else - pr_warn("invalid FRED option: 'fred=%s'\n", str); - return 0; -} -early_param("fred", fred_setup); -#endif - void __init trap_init(void) { - if (cpu_feature_enabled(X86_FEATURE_FRED) && !enable_fred) - setup_clear_cpu_cap(X86_FEATURE_FRED); - /* Init cpu_entry_area before IST entries are set up */ setup_cpu_entry_areas(); @@ -1437,7 +1460,7 @@ void __init trap_init(void) sev_es_init_vc_handling(); /* Initialize TSS before setting up traps so ISTs work */ - cpu_init_exception_handling(); + cpu_init_exception_handling(true); /* Setup traps as cpu_init() might #GP */ if (!cpu_feature_enabled(X86_FEATURE_FRED)) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index d4462fb26299..dfe6847fd99e 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -28,6 +28,7 @@ #include #include #include +#include #include unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */ @@ -1253,15 +1254,12 @@ static void __init check_system_tsc_reliable(void) * - TSC which does not stop in C-States * - the TSC_ADJUST register which allows to detect even minimal * modifications - * - not more than two sockets. As the number of sockets cannot be - * evaluated at the early boot stage where this has to be - * invoked, check the number of online memory nodes as a - * fallback solution which is an reasonable estimate. + * - not more than four packages */ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && boot_cpu_has(X86_FEATURE_TSC_ADJUST) && - nr_online_nodes <= 4) + topology_max_packages() <= 4) tsc_disable_clocksource_watchdog(); } @@ -1290,7 +1288,7 @@ int unsynchronized_tsc(void) */ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) { /* assume multi socket systems are not synchronized: */ - if (num_possible_cpus() > 1) + if (topology_max_packages() > 1) return 1; } diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c index e91500a80963..575f863f3c75 100644 --- a/arch/x86/mm/cpu_entry_area.c +++ b/arch/x86/mm/cpu_entry_area.c @@ -164,7 +164,7 @@ static void __init percpu_setup_exception_stacks(unsigned int cpu) } } #else -static inline void percpu_setup_exception_stacks(unsigned int cpu) +static void __init percpu_setup_exception_stacks(unsigned int cpu) { struct cpu_entry_area *cea = get_cpu_entry_area(cpu); diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c index c45127265f2f..437e96fb4977 100644 --- a/arch/x86/mm/ident_map.c +++ b/arch/x86/mm/ident_map.c @@ -99,18 +99,31 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, for (; addr < end; addr = next) { pud_t *pud = pud_page + pud_index(addr); pmd_t *pmd; + bool use_gbpage; next = (addr & PUD_MASK) + PUD_SIZE; if (next > end) next = end; - if (info->direct_gbpages) { + /* if this is already a gbpage, this portion is already mapped */ + if (pud_leaf(*pud)) + continue; + + /* Is using a gbpage allowed? */ + use_gbpage = info->direct_gbpages; + + /* Don't use gbpage if it maps more than the requested region. */ + /* at the begining: */ + use_gbpage &= ((addr & ~PUD_MASK) == 0); + /* ... or at the end: */ + use_gbpage &= ((next & ~PUD_MASK) == 0); + + /* Never overwrite existing mappings */ + use_gbpage &= !pud_present(*pud); + + if (use_gbpage) { pud_t pudval; - if (pud_present(*pud)) - continue; - - addr &= PUD_MASK; pudval = __pud((addr - info->offset) | info->page_flag); set_pud(pud, pudval); continue; diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index aa7d279321ea..70b02fc61d93 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -457,7 +458,7 @@ void iounmap(volatile void __iomem *addr) { struct vm_struct *p, *o; - if ((void __force *)addr <= high_memory) + if (WARN_ON_ONCE(!is_ioremap_addr((void __force *)addr))) return; /* diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c index 9c52a95937ad..6f8e0f21c710 100644 --- a/arch/x86/mm/srat.c +++ b/arch/x86/mm/srat.c @@ -57,8 +57,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa) } set_apicid_to_node(apic_id, node); node_set(node, numa_nodes_parsed); - printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%04x -> Node %u\n", - pxm, apic_id, node); + pr_debug("SRAT: PXM %u -> APIC 0x%04x -> Node %u\n", pxm, apic_id, node); } /* Callback for Proximity Domain -> LAPIC mapping */ @@ -98,8 +97,7 @@ acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa) set_apicid_to_node(apic_id, node); node_set(node, numa_nodes_parsed); - printk(KERN_INFO "SRAT: PXM %u -> APIC 0x%02x -> Node %u\n", - pxm, apic_id, node); + pr_debug("SRAT: PXM %u -> APIC 0x%02x -> Node %u\n", pxm, apic_id, node); } int __init x86_acpi_numa_init(void) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 44ac64f3a047..86593d1b787d 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include @@ -85,9 +86,6 @@ * */ -/* There are 12 bits of space for ASIDS in CR3 */ -#define CR3_HW_ASID_BITS 12 - /* * When enabled, MITIGATION_PAGE_TABLE_ISOLATION consumes a single bit for * user/kernel switches @@ -160,7 +158,6 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam) unsigned long cr3 = __sme_pa(pgd) | lam; if (static_cpu_has(X86_FEATURE_PCID)) { - VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE); cr3 |= kern_pcid(asid); } else { VM_WARN_ON_ONCE(asid != 0); @@ -503,9 +500,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, { struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - unsigned long new_lam = mm_lam_cr3_mask(next); bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy); unsigned cpu = smp_processor_id(); + unsigned long new_lam; u64 next_tlb_gen; bool need_flush; u16 new_asid; @@ -619,9 +616,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, cpumask_clear_cpu(cpu, mm_cpumask(prev)); } - /* - * Start remote flushes and then read tlb_gen. - */ + /* Start receiving IPIs and then read tlb_gen (and LAM below) */ if (next != &init_mm) cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); @@ -633,7 +628,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, barrier(); } - set_tlbstate_lam_mode(next); + new_lam = mm_lam_cr3_mask(next); if (need_flush) { this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); @@ -652,6 +647,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, this_cpu_write(cpu_tlbstate.loaded_mm, next); this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); + cpu_tlbstate_update_lam(new_lam, mm_untag_mask(next)); if (next != prev) { cr4_update_pce_mm(next); @@ -698,6 +694,7 @@ void initialize_tlbstate_and_flush(void) int i; struct mm_struct *mm = this_cpu_read(cpu_tlbstate.loaded_mm); u64 tlb_gen = atomic64_read(&init_mm.context.tlb_gen); + unsigned long lam = mm_lam_cr3_mask(mm); unsigned long cr3 = __read_cr3(); /* Assert that CR3 already references the right mm. */ @@ -705,7 +702,7 @@ void initialize_tlbstate_and_flush(void) /* LAM expected to be disabled */ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57)); - WARN_ON(mm_lam_cr3_mask(mm)); + WARN_ON(lam); /* * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization @@ -724,7 +721,7 @@ void initialize_tlbstate_and_flush(void) this_cpu_write(cpu_tlbstate.next_asid, 1); this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen); - set_tlbstate_lam_mode(mm); + cpu_tlbstate_update_lam(lam, mm_untag_mask(mm)); for (i = 1; i < TLB_NR_DYN_ASIDS; i++) this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0); diff --git a/drivers/clocksource/i8253.c b/drivers/clocksource/i8253.c index d4350bb10b83..39f7c2d736d1 100644 --- a/drivers/clocksource/i8253.c +++ b/drivers/clocksource/i8253.c @@ -20,13 +20,6 @@ DEFINE_RAW_SPINLOCK(i8253_lock); EXPORT_SYMBOL(i8253_lock); -/* - * Handle PIT quirk in pit_shutdown() where zeroing the counter register - * restarts the PIT, negating the shutdown. On platforms with the quirk, - * platform specific code can set this to false. - */ -bool i8253_clear_counter_on_shutdown __ro_after_init = true; - #ifdef CONFIG_CLKSRC_I8253 /* * Since the PIT overflows every tick, its not very useful @@ -108,21 +101,47 @@ int __init clocksource_i8253_init(void) #endif #ifdef CONFIG_CLKEVT_I8253 +void clockevent_i8253_disable(void) +{ + raw_spin_lock(&i8253_lock); + + /* + * Writing the MODE register should stop the counter, according to + * the datasheet. This appears to work on real hardware (well, on + * modern Intel and AMD boxes; I didn't dig the Pegasos out of the + * shed). + * + * However, some virtual implementations differ, and the MODE change + * doesn't have any effect until either the counter is written (KVM + * in-kernel PIT) or the next interrupt (QEMU). And in those cases, + * it may not stop the *count*, only the interrupts. Although in + * the virt case, that probably doesn't matter, as the value of the + * counter will only be calculated on demand if the guest reads it; + * it's the interrupts which cause steal time. + * + * Hyper-V apparently has a bug where even in mode 0, the IRQ keeps + * firing repeatedly if the counter is running. But it *does* do the + * right thing when the MODE register is written. + * + * So: write the MODE and then load the counter, which ensures that + * the IRQ is stopped on those buggy virt implementations. And then + * write the MODE again, which is the right way to stop it. + */ + outb_p(0x30, PIT_MODE); + outb_p(0, PIT_CH0); + outb_p(0, PIT_CH0); + + outb_p(0x30, PIT_MODE); + + raw_spin_unlock(&i8253_lock); +} + static int pit_shutdown(struct clock_event_device *evt) { if (!clockevent_state_oneshot(evt) && !clockevent_state_periodic(evt)) return 0; - raw_spin_lock(&i8253_lock); - - outb_p(0x30, PIT_MODE); - - if (i8253_clear_counter_on_shutdown) { - outb_p(0, PIT_CH0); - outb_p(0, PIT_CH0); - } - - raw_spin_unlock(&i8253_lock); + clockevent_i8253_disable(); return 0; } diff --git a/drivers/hwmon/k10temp.c b/drivers/hwmon/k10temp.c index 543526bac042..f96b91e43312 100644 --- a/drivers/hwmon/k10temp.c +++ b/drivers/hwmon/k10temp.c @@ -548,6 +548,7 @@ static const struct pci_device_id k10temp_id_table[] = { { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) }, { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3) }, { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3) }, + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3) }, { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_AMD_17H_DF_F3) }, {} }; diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index e090ca07364b..7a6d188e3bea 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1352,12 +1352,11 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data, case X86_IRQ_ALLOC_TYPE_IOAPIC: /* Set source-id of interrupt request */ set_ioapic_sid(irte, info->devid); - apic_printk(APIC_VERBOSE, KERN_DEBUG "IOAPIC[%d]: Set IRTE entry (P:%d FPD:%d Dst_Mode:%d Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X Avail:%X Vector:%02X Dest:%08X SID:%04X SQ:%X SVT:%X)\n", - info->devid, irte->present, irte->fpd, - irte->dst_mode, irte->redir_hint, - irte->trigger_mode, irte->dlvry_mode, - irte->avail, irte->vector, irte->dest_id, - irte->sid, irte->sq, irte->svt); + apic_pr_verbose("IOAPIC[%d]: Set IRTE entry (P:%d FPD:%d Dst_Mode:%d Redir_hint:%d Trig_Mode:%d Dlvry_Mode:%X Avail:%X Vector:%02X Dest:%08X SID:%04X SQ:%X SVT:%X)\n", + info->devid, irte->present, irte->fpd, irte->dst_mode, + irte->redir_hint, irte->trigger_mode, irte->dlvry_mode, + irte->avail, irte->vector, irte->dest_id, irte->sid, + irte->sq, irte->svt); sub_handle = info->ioapic.pin; break; case X86_IRQ_ALLOC_TYPE_HPET: diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 04e748c5955f..12a0fd890659 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -2041,7 +2041,7 @@ static int elf_core_dump(struct coredump_params *cprm) { size_t sz = info.size; - /* For cell spufs */ + /* For cell spufs and x86 xstate */ sz += elf_coredump_extra_notes_size(); phdr4note = kmalloc(sizeof(*phdr4note), GFP_KERNEL); @@ -2105,7 +2105,7 @@ static int elf_core_dump(struct coredump_params *cprm) if (!write_note_info(&info, cprm)) goto end_coredump; - /* For cell spufs */ + /* For cell spufs and x86 xstate */ if (elf_coredump_extra_notes_write(cprm)) goto end_coredump; diff --git a/include/linux/i8253.h b/include/linux/i8253.h index 8336b2f6f834..56c280eb2d4f 100644 --- a/include/linux/i8253.h +++ b/include/linux/i8253.h @@ -21,9 +21,9 @@ #define PIT_LATCH ((PIT_TICK_RATE + HZ/2) / HZ) extern raw_spinlock_t i8253_lock; -extern bool i8253_clear_counter_on_shutdown; extern struct clock_event_device i8253_clockevent; extern void clockevent_i8253_init(bool oneshot); +extern void clockevent_i8253_disable(void); extern void setup_pit_timer(void); diff --git a/include/linux/ioremap.h b/include/linux/ioremap.h index f0e99fc7dd8b..2bd1661fe9ad 100644 --- a/include/linux/ioremap.h +++ b/include/linux/ioremap.h @@ -4,6 +4,7 @@ #include #include +#include #if defined(CONFIG_HAS_IOMEM) || defined(CONFIG_GENERIC_IOREMAP) /* diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index e388c8b1cbc2..91182aa1d2ec 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -580,6 +580,7 @@ #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb #define PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3 0x12c3 #define PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3 0x16fb +#define PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3 0x124b #define PCI_DEVICE_ID_AMD_1AH_M70H_DF_F3 0x12bb #define PCI_DEVICE_ID_AMD_MI200_DF_F3 0x14d3 #define PCI_DEVICE_ID_AMD_MI300_DF_F3 0x152b diff --git a/include/linux/ubsan.h b/include/linux/ubsan.h index bff7445498de..d8219cbe09ff 100644 --- a/include/linux/ubsan.h +++ b/include/linux/ubsan.h @@ -4,6 +4,11 @@ #ifdef CONFIG_UBSAN_TRAP const char *report_ubsan_failure(struct pt_regs *regs, u32 check_type); +#else +static inline const char *report_ubsan_failure(struct pt_regs *regs, u32 check_type) +{ + return NULL; +} #endif #endif diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index 81762ff3c99e..b9935988da5c 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -411,6 +411,7 @@ typedef struct elf64_shdr { #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ /* Old binutils treats 0x203 as a CET state */ #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ +#define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */ #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ #define NT_S390_TIMER 0x301 /* s390 timer register */ #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */ diff --git a/kernel/kcov.c b/kernel/kcov.c index 274b6b7c718d..28a6be6e64fd 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -1067,6 +1068,32 @@ u64 kcov_common_handle(void) } EXPORT_SYMBOL(kcov_common_handle); +#ifdef CONFIG_KCOV_SELFTEST +static void __init selftest(void) +{ + unsigned long start; + + pr_err("running self test\n"); + /* + * Test that interrupts don't produce spurious coverage. + * The coverage callback filters out interrupt code, but only + * after the handler updates preempt count. Some code periodically + * leaks out of that section and leads to spurious coverage. + * It's hard to call the actual interrupt handler directly, + * so we just loop here for a bit waiting for a timer interrupt. + * We set kcov_mode to enable tracing, but don't setup the area, + * so any attempt to trace will crash. Note: we must not call any + * potentially traced functions in this region. + */ + start = jiffies; + current->kcov_mode = KCOV_MODE_TRACE_PC; + while ((jiffies - start) * MSEC_PER_SEC / HZ < 300) + ; + current->kcov_mode = 0; + pr_err("done running self test\n"); +} +#endif + static int __init kcov_init(void) { int cpu; @@ -1086,6 +1113,10 @@ static int __init kcov_init(void) */ debugfs_create_file_unsafe("kcov", 0600, NULL, NULL, &kcov_fops); +#ifdef CONFIG_KCOV_SELFTEST + selftest(); +#endif + return 0; } diff --git a/kernel/module/Makefile b/kernel/module/Makefile index 28b695888a4b..1ad496a776ed 100644 --- a/kernel/module/Makefile +++ b/kernel/module/Makefile @@ -5,7 +5,7 @@ # These are called from save_stack_trace() on slub debug path, # and produce insane amounts of uninteresting coverage. -KCOV_INSTRUMENT_module.o := n +KCOV_INSTRUMENT_main.o := n obj-y += main.o obj-y += strict_rwx.o diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a40aa606cd04..26354671b37d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2173,6 +2173,14 @@ config KCOV_IRQ_AREA_SIZE soft interrupts. This specifies the size of those areas in the number of unsigned long words. +config KCOV_SELFTEST + bool "Perform short selftests on boot" + depends on KCOV + help + Run short KCOV coverage collection selftests on boot. + On test failure, causes the kernel to panic. Recommended to be + enabled, ensuring critical functionality works as intended. + menuconfig RUNTIME_TESTING_MENU bool "Runtime Testing" default y diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan index bdda600f8dfb..1d4aa7a83b3a 100644 --- a/lib/Kconfig.ubsan +++ b/lib/Kconfig.ubsan @@ -29,8 +29,8 @@ config UBSAN_TRAP Also note that selecting Y will cause your kernel to Oops with an "illegal instruction" error with no further details - when a UBSAN violation occurs. (Except on arm64, which will - report which Sanitizer failed.) This may make it hard to + when a UBSAN violation occurs. (Except on arm64 and x86, which + will report which Sanitizer failed.) This may make it hard to determine whether an Oops was caused by UBSAN or to figure out the details of a UBSAN violation. It makes the kernel log output less useful for bug reports. diff --git a/scripts/checktransupdate.py b/scripts/checktransupdate.py index 5a0fc99e3f93..578c3fecfdfd 100755 --- a/scripts/checktransupdate.py +++ b/scripts/checktransupdate.py @@ -10,31 +10,28 @@ differences occur, report the file and commits that need to be updated. The usage is as follows: - ./scripts/checktransupdate.py -l zh_CN -This will print all the files that need to be updated in the zh_CN locale. +This will print all the files that need to be updated or translated in the zh_CN locale. - ./scripts/checktransupdate.py Documentation/translations/zh_CN/dev-tools/testing-overview.rst This will only print the status of the specified file. The output is something like: -Documentation/translations/zh_CN/dev-tools/testing-overview.rst (1 commits) +Documentation/dev-tools/kfence.rst +No translation in the locale of zh_CN + +Documentation/translations/zh_CN/dev-tools/testing-overview.rst commit 42fb9cfd5b18 ("Documentation: dev-tools: Add link to RV docs") +1 commits needs resolving in total """ import os -from argparse import ArgumentParser, BooleanOptionalAction +import time +import logging +from argparse import ArgumentParser, ArgumentTypeError, BooleanOptionalAction from datetime import datetime -flag_p_c = False -flag_p_uf = False -flag_debug = False - - -def dprint(*args, **kwargs): - if flag_debug: - print("[DEBUG] ", end="") - print(*args, **kwargs) - def get_origin_path(file_path): + """Get the origin path from the translation path""" paths = file_path.split("/") tidx = paths.index("translations") opaths = paths[:tidx] @@ -43,17 +40,16 @@ def get_origin_path(file_path): def get_latest_commit_from(file_path, commit): - command = "git log --pretty=format:%H%n%aD%n%cD%n%n%B {} -1 -- {}".format( - commit, file_path - ) - dprint(command) + """Get the latest commit from the specified commit for the specified file""" + command = f"git log --pretty=format:%H%n%aD%n%cD%n%n%B {commit} -1 -- {file_path}" + logging.debug(command) pipe = os.popen(command) result = pipe.read() result = result.split("\n") if len(result) <= 1: return None - dprint("Result: {}".format(result[0])) + logging.debug("Result: %s", result[0]) return { "hash": result[0], @@ -64,17 +60,19 @@ def get_latest_commit_from(file_path, commit): def get_origin_from_trans(origin_path, t_from_head): + """Get the latest origin commit from the translation commit""" o_from_t = get_latest_commit_from(origin_path, t_from_head["hash"]) while o_from_t is not None and o_from_t["author_date"] > t_from_head["author_date"]: o_from_t = get_latest_commit_from(origin_path, o_from_t["hash"] + "^") if o_from_t is not None: - dprint("tracked origin commit id: {}".format(o_from_t["hash"])) + logging.debug("tracked origin commit id: %s", o_from_t["hash"]) return o_from_t def get_commits_count_between(opath, commit1, commit2): - command = "git log --pretty=format:%H {}...{} -- {}".format(commit1, commit2, opath) - dprint(command) + """Get the commits count between two commits for the specified file""" + command = f"git log --pretty=format:%H {commit1}...{commit2} -- {opath}" + logging.debug(command) pipe = os.popen(command) result = pipe.read().split("\n") # filter out empty lines @@ -83,50 +81,120 @@ def get_commits_count_between(opath, commit1, commit2): def pretty_output(commit): - command = "git log --pretty='format:%h (\"%s\")' -1 {}".format(commit) - dprint(command) + """Pretty print the commit message""" + command = f"git log --pretty='format:%h (\"%s\")' -1 {commit}" + logging.debug(command) pipe = os.popen(command) return pipe.read() +def valid_commit(commit): + """Check if the commit is valid or not""" + msg = pretty_output(commit) + return "Merge tag" not in msg + def check_per_file(file_path): + """Check the translation status for the specified file""" opath = get_origin_path(file_path) if not os.path.isfile(opath): - dprint("Error: Cannot find the origin path for {}".format(file_path)) + logging.error("Cannot find the origin path for {file_path}") return o_from_head = get_latest_commit_from(opath, "HEAD") t_from_head = get_latest_commit_from(file_path, "HEAD") if o_from_head is None or t_from_head is None: - print("Error: Cannot find the latest commit for {}".format(file_path)) + logging.error("Cannot find the latest commit for %s", file_path) return o_from_t = get_origin_from_trans(opath, t_from_head) if o_from_t is None: - print("Error: Cannot find the latest origin commit for {}".format(file_path)) + logging.error("Error: Cannot find the latest origin commit for %s", file_path) return if o_from_head["hash"] == o_from_t["hash"]: - if flag_p_uf: - print("No update needed for {}".format(file_path)) - return + logging.debug("No update needed for %s", file_path) else: - print("{}".format(file_path), end="\t") + logging.info(file_path) commits = get_commits_count_between( opath, o_from_t["hash"], o_from_head["hash"] ) - print("({} commits)".format(len(commits))) - if flag_p_c: - for commit in commits: - msg = pretty_output(commit) - if "Merge tag" not in msg: - print("commit", msg) + count = 0 + for commit in commits: + if valid_commit(commit): + logging.info("commit %s", pretty_output(commit)) + count += 1 + logging.info("%d commits needs resolving in total\n", count) + + +def valid_locales(locale): + """Check if the locale is valid or not""" + script_path = os.path.dirname(os.path.abspath(__file__)) + linux_path = os.path.join(script_path, "..") + if not os.path.isdir(f"{linux_path}/Documentation/translations/{locale}"): + raise ArgumentTypeError("Invalid locale: {locale}") + return locale + + +def list_files_with_excluding_folders(folder, exclude_folders, include_suffix): + """List all files with the specified suffix in the folder and its subfolders""" + files = [] + stack = [folder] + + while stack: + pwd = stack.pop() + # filter out the exclude folders + if os.path.basename(pwd) in exclude_folders: + continue + # list all files and folders + for item in os.listdir(pwd): + ab_item = os.path.join(pwd, item) + if os.path.isdir(ab_item): + stack.append(ab_item) + else: + if ab_item.endswith(include_suffix): + files.append(ab_item) + + return files + + +class DmesgFormatter(logging.Formatter): + """Custom dmesg logging formatter""" + def format(self, record): + timestamp = time.time() + formatted_time = f"[{timestamp:>10.6f}]" + log_message = f"{formatted_time} {record.getMessage()}" + return log_message + + +def config_logging(log_level, log_file="checktransupdate.log"): + """configure logging based on the log level""" + # set up the root logger + logger = logging.getLogger() + logger.setLevel(log_level) + + # Create console handler + console_handler = logging.StreamHandler() + console_handler.setLevel(log_level) + + # Create file handler + file_handler = logging.FileHandler(log_file) + file_handler.setLevel(log_level) + + # Create formatter and add it to the handlers + formatter = DmesgFormatter() + console_handler.setFormatter(formatter) + file_handler.setFormatter(formatter) + + # Add the handler to the logger + logger.addHandler(console_handler) + logger.addHandler(file_handler) def main(): + """Main function of the script""" script_path = os.path.dirname(os.path.abspath(__file__)) linux_path = os.path.join(script_path, "..") @@ -134,62 +202,62 @@ def main(): parser.add_argument( "-l", "--locale", + default="zh_CN", + type=valid_locales, help="Locale to check when files are not specified", ) + parser.add_argument( - "--print-commits", + "--print-missing-translations", action=BooleanOptionalAction, default=True, - help="Print commits between the origin and the translation", + help="Print files that do not have translations", ) parser.add_argument( - "--print-updated-files", - action=BooleanOptionalAction, - default=False, - help="Print files that do no need to be updated", - ) + '--log', + default='INFO', + choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], + help='Set the logging level') parser.add_argument( - "--debug", - action=BooleanOptionalAction, - help="Print debug information", - default=False, - ) + '--logfile', + default='checktransupdate.log', + help='Set the logging file (default: checktransupdate.log)') parser.add_argument( "files", nargs="*", help="Files to check, if not specified, check all files" ) args = parser.parse_args() - global flag_p_c, flag_p_uf, flag_debug - flag_p_c = args.print_commits - flag_p_uf = args.print_updated_files - flag_debug = args.debug + # Configure logging based on the --log argument + log_level = getattr(logging, args.log.upper(), logging.INFO) + config_logging(log_level) - # get files related to linux path + # Get files related to linux path files = args.files if len(files) == 0: - if args.locale is not None: - files = ( - os.popen( - "find {}/Documentation/translations/{} -type f".format( - linux_path, args.locale - ) - ) - .read() - .split("\n") - ) - else: - files = ( - os.popen( - "find {}/Documentation/translations -type f".format(linux_path) - ) - .read() - .split("\n") - ) + offical_files = list_files_with_excluding_folders( + os.path.join(linux_path, "Documentation"), ["translations", "output"], "rst" + ) + + for file in offical_files: + # split the path into parts + path_parts = file.split(os.sep) + # find the index of the "Documentation" directory + kindex = path_parts.index("Documentation") + # insert the translations and locale after the Documentation directory + new_path_parts = path_parts[:kindex + 1] + ["translations", args.locale] \ + + path_parts[kindex + 1 :] + # join the path parts back together + new_file = os.sep.join(new_path_parts) + if os.path.isfile(new_file): + files.append(new_file) + else: + if args.print_missing_translations: + logging.info(os.path.relpath(os.path.abspath(file), linux_path)) + logging.info("No translation in the locale of %s\n", args.locale) - files = list(filter(lambda x: x != "", files)) files = list(map(lambda x: os.path.relpath(os.path.abspath(x), linux_path), files)) # cd to linux root directory diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl index ee1aed7e090c..5ac02e198737 100755 --- a/scripts/get_maintainer.pl +++ b/scripts/get_maintainer.pl @@ -54,6 +54,7 @@ my $output_section_maxlen = 50; my $scm = 0; my $tree = 1; my $web = 0; +my $bug = 0; my $subsystem = 0; my $status = 0; my $letters = ""; @@ -271,6 +272,7 @@ if (!GetOptions( 'scm!' => \$scm, 'tree!' => \$tree, 'web!' => \$web, + 'bug!' => \$bug, 'letters=s' => \$letters, 'pattern-depth=i' => \$pattern_depth, 'k|keywords!' => \$keywords, @@ -320,13 +322,14 @@ if ($sections || $letters ne "") { $status = 0; $subsystem = 0; $web = 0; + $bug = 0; $keywords = 0; $keywords_in_file = 0; $interactive = 0; } else { - my $selections = $email + $scm + $status + $subsystem + $web; + my $selections = $email + $scm + $status + $subsystem + $web + $bug; if ($selections == 0) { - die "$P: Missing required option: email, scm, status, subsystem or web\n"; + die "$P: Missing required option: email, scm, status, subsystem, web or bug\n"; } } @@ -631,6 +634,7 @@ my %hash_list_to; my @list_to = (); my @scm = (); my @web = (); +my @bug = (); my @subsystem = (); my @status = (); my %deduplicate_name_hash = (); @@ -662,6 +666,11 @@ if ($web) { output(@web); } +if ($bug) { + @bug = uniq(@bug); + output(@bug); +} + exit($exit); sub self_test { @@ -847,6 +856,7 @@ sub get_maintainers { @list_to = (); @scm = (); @web = (); + @bug = (); @subsystem = (); @status = (); %deduplicate_name_hash = (); @@ -1069,6 +1079,7 @@ MAINTAINER field selection options: --status => print status if any --subsystem => print subsystem name if any --web => print website(s) if any + --bug => print bug reporting info if any Output type options: --separator [, ] => separator for multiple entries on 1 line @@ -1382,6 +1393,8 @@ sub add_categories { push(@scm, $pvalue . $suffix); } elsif ($ptype eq "W") { push(@web, $pvalue . $suffix); + } elsif ($ptype eq "B") { + push(@bug, $pvalue . $suffix); } elsif ($ptype eq "S") { push(@status, $pvalue . $suffix); } diff --git a/scripts/sphinx-pre-install b/scripts/sphinx-pre-install index c1121f098542..ad9945ccb0cf 100755 --- a/scripts/sphinx-pre-install +++ b/scripts/sphinx-pre-install @@ -300,8 +300,6 @@ sub check_sphinx() } $cur_version = get_sphinx_version($sphinx); - die ("$sphinx returned an error") if (!$cur_version); - die "$sphinx didn't return its version" if (!$cur_version); if ($cur_version lt $min_version) { diff --git a/tools/arch/x86/kcpuid/cpuid.csv b/tools/arch/x86/kcpuid/cpuid.csv index e0c25b75327e..d751eb8585d0 100644 --- a/tools/arch/x86/kcpuid/cpuid.csv +++ b/tools/arch/x86/kcpuid/cpuid.csv @@ -1,451 +1,1053 @@ +# SPDX-License-Identifier: CC0-1.0 +# Generator: x86-cpuid-db v1.0 + +# +# Auto-generated file. +# Please submit all updates and bugfixes to https://x86-cpuid.org +# + # The basic row format is: -# LEAF, SUBLEAF, register_name, bits, short_name, long_description +# LEAF, SUBLEAVES, reg, bits, short_name , long_description -# Leaf 00H - 0, 0, EAX, 31:0, max_basic_leafs, Max input value for supported subleafs +# Leaf 0H +# Maximum standard leaf number + CPU vendor string -# Leaf 01H - 1, 0, EAX, 3:0, stepping, Stepping ID - 1, 0, EAX, 7:4, model, Model - 1, 0, EAX, 11:8, family, Family ID - 1, 0, EAX, 13:12, processor, Processor Type - 1, 0, EAX, 19:16, model_ext, Extended Model ID - 1, 0, EAX, 27:20, family_ext, Extended Family ID + 0, 0, eax, 31:0, max_std_leaf , Highest cpuid standard leaf supported + 0, 0, ebx, 31:0, cpu_vendorid_0 , CPU vendor ID string bytes 0 - 3 + 0, 0, ecx, 31:0, cpu_vendorid_2 , CPU vendor ID string bytes 8 - 11 + 0, 0, edx, 31:0, cpu_vendorid_1 , CPU vendor ID string bytes 4 - 7 - 1, 0, EBX, 7:0, brand, Brand Index - 1, 0, EBX, 15:8, clflush_size, CLFLUSH line size (value * 8) in bytes - 1, 0, EBX, 23:16, max_cpu_id, Maxim number of addressable logic cpu in this package - 1, 0, EBX, 31:24, apic_id, Initial APIC ID +# Leaf 1H +# CPU FMS (Family/Model/Stepping) + standard feature flags - 1, 0, ECX, 0, sse3, Streaming SIMD Extensions 3(SSE3) - 1, 0, ECX, 1, pclmulqdq, PCLMULQDQ instruction supported - 1, 0, ECX, 2, dtes64, DS area uses 64-bit layout - 1, 0, ECX, 3, mwait, MONITOR/MWAIT supported - 1, 0, ECX, 4, ds_cpl, CPL Qualified Debug Store which allows for branch message storage qualified by CPL - 1, 0, ECX, 5, vmx, Virtual Machine Extensions supported - 1, 0, ECX, 6, smx, Safer Mode Extension supported - 1, 0, ECX, 7, eist, Enhanced Intel SpeedStep Technology - 1, 0, ECX, 8, tm2, Thermal Monitor 2 - 1, 0, ECX, 9, ssse3, Supplemental Streaming SIMD Extensions 3 (SSSE3) - 1, 0, ECX, 10, l1_ctx_id, L1 data cache could be set to either adaptive mode or shared mode (check IA32_MISC_ENABLE bit 24 definition) - 1, 0, ECX, 11, sdbg, IA32_DEBUG_INTERFACE MSR for silicon debug supported - 1, 0, ECX, 12, fma, FMA extensions using YMM state supported - 1, 0, ECX, 13, cmpxchg16b, 'CMPXCHG16B - Compare and Exchange Bytes' supported - 1, 0, ECX, 14, xtpr_update, xTPR Update Control supported - 1, 0, ECX, 15, pdcm, Perfmon and Debug Capability present - 1, 0, ECX, 17, pcid, Process-Context Identifiers feature present - 1, 0, ECX, 18, dca, Prefetching data from a memory mapped device supported - 1, 0, ECX, 19, sse4_1, SSE4.1 feature present - 1, 0, ECX, 20, sse4_2, SSE4.2 feature present - 1, 0, ECX, 21, x2apic, x2APIC supported - 1, 0, ECX, 22, movbe, MOVBE instruction supported - 1, 0, ECX, 23, popcnt, POPCNT instruction supported - 1, 0, ECX, 24, tsc_deadline_timer, LAPIC supports one-shot operation using a TSC deadline value - 1, 0, ECX, 25, aesni, AESNI instruction supported - 1, 0, ECX, 26, xsave, XSAVE/XRSTOR processor extended states (XSETBV/XGETBV/XCR0) - 1, 0, ECX, 27, osxsave, OS has set CR4.OSXSAVE bit to enable XSETBV/XGETBV/XCR0 - 1, 0, ECX, 28, avx, AVX instruction supported - 1, 0, ECX, 29, f16c, 16-bit floating-point conversion instruction supported - 1, 0, ECX, 30, rdrand, RDRAND instruction supported + 1, 0, eax, 3:0, stepping , Stepping ID + 1, 0, eax, 7:4, base_model , Base CPU model ID + 1, 0, eax, 11:8, base_family_id , Base CPU family ID + 1, 0, eax, 13:12, cpu_type , CPU type + 1, 0, eax, 19:16, ext_model , Extended CPU model ID + 1, 0, eax, 27:20, ext_family , Extended CPU family ID + 1, 0, ebx, 7:0, brand_id , Brand index + 1, 0, ebx, 15:8, clflush_size , CLFLUSH instruction cache line size + 1, 0, ebx, 23:16, n_logical_cpu , Logical CPU (HW threads) count + 1, 0, ebx, 31:24, local_apic_id , Initial local APIC physical ID + 1, 0, ecx, 0, pni , Streaming SIMD Extensions 3 (SSE3) + 1, 0, ecx, 1, pclmulqdq , PCLMULQDQ instruction support + 1, 0, ecx, 2, dtes64 , 64-bit DS save area + 1, 0, ecx, 3, monitor , MONITOR/MWAIT support + 1, 0, ecx, 4, ds_cpl , CPL Qualified Debug Store + 1, 0, ecx, 5, vmx , Virtual Machine Extensions + 1, 0, ecx, 6, smx , Safer Mode Extensions + 1, 0, ecx, 7, est , Enhanced Intel SpeedStep + 1, 0, ecx, 8, tm2 , Thermal Monitor 2 + 1, 0, ecx, 9, ssse3 , Supplemental SSE3 + 1, 0, ecx, 10, cid , L1 Context ID + 1, 0, ecx, 11, sdbg , Sillicon Debug + 1, 0, ecx, 12, fma , FMA extensions using YMM state + 1, 0, ecx, 13, cx16 , CMPXCHG16B instruction support + 1, 0, ecx, 14, xtpr , xTPR Update Control + 1, 0, ecx, 15, pdcm , Perfmon and Debug Capability + 1, 0, ecx, 17, pcid , Process-context identifiers + 1, 0, ecx, 18, dca , Direct Cache Access + 1, 0, ecx, 19, sse4_1 , SSE4.1 + 1, 0, ecx, 20, sse4_2 , SSE4.2 + 1, 0, ecx, 21, x2apic , X2APIC support + 1, 0, ecx, 22, movbe , MOVBE instruction support + 1, 0, ecx, 23, popcnt , POPCNT instruction support + 1, 0, ecx, 24, tsc_deadline_timer , APIC timer one-shot operation + 1, 0, ecx, 25, aes , AES instructions + 1, 0, ecx, 26, xsave , XSAVE (and related instructions) support + 1, 0, ecx, 27, osxsave , XSAVE (and related instructions) are enabled by OS + 1, 0, ecx, 28, avx , AVX instructions support + 1, 0, ecx, 29, f16c , Half-precision floating-point conversion support + 1, 0, ecx, 30, rdrand , RDRAND instruction support + 1, 0, ecx, 31, guest_status , System is running as guest; (para-)virtualized system + 1, 0, edx, 0, fpu , Floating-Point Unit on-chip (x87) + 1, 0, edx, 1, vme , Virtual-8086 Mode Extensions + 1, 0, edx, 2, de , Debugging Extensions + 1, 0, edx, 3, pse , Page Size Extension + 1, 0, edx, 4, tsc , Time Stamp Counter + 1, 0, edx, 5, msr , Model-Specific Registers (RDMSR and WRMSR support) + 1, 0, edx, 6, pae , Physical Address Extensions + 1, 0, edx, 7, mce , Machine Check Exception + 1, 0, edx, 8, cx8 , CMPXCHG8B instruction + 1, 0, edx, 9, apic , APIC on-chip + 1, 0, edx, 11, sep , SYSENTER, SYSEXIT, and associated MSRs + 1, 0, edx, 12, mtrr , Memory Type Range Registers + 1, 0, edx, 13, pge , Page Global Extensions + 1, 0, edx, 14, mca , Machine Check Architecture + 1, 0, edx, 15, cmov , Conditional Move Instruction + 1, 0, edx, 16, pat , Page Attribute Table + 1, 0, edx, 17, pse36 , Page Size Extension (36-bit) + 1, 0, edx, 18, pn , Processor Serial Number + 1, 0, edx, 19, clflush , CLFLUSH instruction + 1, 0, edx, 21, dts , Debug Store + 1, 0, edx, 22, acpi , Thermal monitor and clock control + 1, 0, edx, 23, mmx , MMX instructions + 1, 0, edx, 24, fxsr , FXSAVE and FXRSTOR instructions + 1, 0, edx, 25, sse , SSE instructions + 1, 0, edx, 26, sse2 , SSE2 instructions + 1, 0, edx, 27, ss , Self Snoop + 1, 0, edx, 28, ht , Hyper-threading + 1, 0, edx, 29, tm , Thermal Monitor + 1, 0, edx, 30, ia64 , Legacy IA-64 (Itanium) support bit, now resreved + 1, 0, edx, 31, pbe , Pending Break Enable - 1, 0, EDX, 0, fpu, x87 FPU on chip - 1, 0, EDX, 1, vme, Virtual-8086 Mode Enhancement - 1, 0, EDX, 2, de, Debugging Extensions - 1, 0, EDX, 3, pse, Page Size Extensions - 1, 0, EDX, 4, tsc, Time Stamp Counter - 1, 0, EDX, 5, msr, RDMSR and WRMSR Support - 1, 0, EDX, 6, pae, Physical Address Extensions - 1, 0, EDX, 7, mce, Machine Check Exception - 1, 0, EDX, 8, cx8, CMPXCHG8B instr - 1, 0, EDX, 9, apic, APIC on Chip - 1, 0, EDX, 11, sep, SYSENTER and SYSEXIT instrs - 1, 0, EDX, 12, mtrr, Memory Type Range Registers - 1, 0, EDX, 13, pge, Page Global Bit - 1, 0, EDX, 14, mca, Machine Check Architecture - 1, 0, EDX, 15, cmov, Conditional Move Instrs - 1, 0, EDX, 16, pat, Page Attribute Table - 1, 0, EDX, 17, pse36, 36-Bit Page Size Extension - 1, 0, EDX, 18, psn, Processor Serial Number - 1, 0, EDX, 19, clflush, CLFLUSH instr -# 1, 0, EDX, 20, - 1, 0, EDX, 21, ds, Debug Store - 1, 0, EDX, 22, acpi, Thermal Monitor and Software Controlled Clock Facilities - 1, 0, EDX, 23, mmx, Intel MMX Technology - 1, 0, EDX, 24, fxsr, XSAVE and FXRSTOR Instrs - 1, 0, EDX, 25, sse, SSE - 1, 0, EDX, 26, sse2, SSE2 - 1, 0, EDX, 27, ss, Self Snoop - 1, 0, EDX, 28, hit, Max APIC IDs - 1, 0, EDX, 29, tm, Thermal Monitor -# 1, 0, EDX, 30, - 1, 0, EDX, 31, pbe, Pending Break Enable +# Leaf 2H +# Intel cache and TLB information one-byte descriptors -# Leaf 02H -# cache and TLB descriptor info + 2, 0, eax, 7:0, iteration_count , Number of times this CPUD leaf must be queried + 2, 0, eax, 15:8, desc1 , Descriptor #1 + 2, 0, eax, 23:16, desc2 , Descriptor #2 + 2, 0, eax, 30:24, desc3 , Descriptor #3 + 2, 0, eax, 31, eax_invalid , Descriptors 1-3 are invalid if set + 2, 0, ebx, 7:0, desc4 , Descriptor #4 + 2, 0, ebx, 15:8, desc5 , Descriptor #5 + 2, 0, ebx, 23:16, desc6 , Descriptor #6 + 2, 0, ebx, 30:24, desc7 , Descriptor #7 + 2, 0, ebx, 31, ebx_invalid , Descriptors 4-7 are invalid if set + 2, 0, ecx, 7:0, desc8 , Descriptor #8 + 2, 0, ecx, 15:8, desc9 , Descriptor #9 + 2, 0, ecx, 23:16, desc10 , Descriptor #10 + 2, 0, ecx, 30:24, desc11 , Descriptor #11 + 2, 0, ecx, 31, ecx_invalid , Descriptors 8-11 are invalid if set + 2, 0, edx, 7:0, desc12 , Descriptor #12 + 2, 0, edx, 15:8, desc13 , Descriptor #13 + 2, 0, edx, 23:16, desc14 , Descriptor #14 + 2, 0, edx, 30:24, desc15 , Descriptor #15 + 2, 0, edx, 31, edx_invalid , Descriptors 12-15 are invalid if set -# Leaf 03H -# Precessor Serial Number, introduced on Pentium III, not valid for -# latest models +# Leaf 4H +# Intel deterministic cache parameters -# Leaf 04H -# thread/core and cache topology - 4, 0, EAX, 4:0, cache_type, Cache type like instr/data or unified - 4, 0, EAX, 7:5, cache_level, Cache Level (starts at 1) - 4, 0, EAX, 8, cache_self_init, Cache Self Initialization - 4, 0, EAX, 9, fully_associate, Fully Associative cache -# 4, 0, EAX, 13:10, resvd, resvd - 4, 0, EAX, 25:14, max_logical_id, Max number of addressable IDs for logical processors sharing the cache - 4, 0, EAX, 31:26, max_phy_id, Max number of addressable IDs for processors in phy package + 4, 31:0, eax, 4:0, cache_type , Cache type field + 4, 31:0, eax, 7:5, cache_level , Cache level (1-based) + 4, 31:0, eax, 8, cache_self_init , Self-initialializing cache level + 4, 31:0, eax, 9, fully_associative , Fully-associative cache + 4, 31:0, eax, 25:14, num_threads_sharing , Number logical CPUs sharing this cache + 4, 31:0, eax, 31:26, num_cores_on_die , Number of cores in the physical package + 4, 31:0, ebx, 11:0, cache_linesize , System coherency line size (0-based) + 4, 31:0, ebx, 21:12, cache_npartitions , Physical line partitions (0-based) + 4, 31:0, ebx, 31:22, cache_nways , Ways of associativity (0-based) + 4, 31:0, ecx, 30:0, cache_nsets , Cache number of sets (0-based) + 4, 31:0, edx, 0, wbinvd_rll_no_guarantee, WBINVD/INVD not guaranteed for Remote Lower-Level caches + 4, 31:0, edx, 1, ll_inclusive , Cache is inclusive of Lower-Level caches + 4, 31:0, edx, 2, complex_indexing , Not a direct-mapped cache (complex function) - 4, 0, EBX, 11:0, cache_linesize, Size of a cache line in bytes - 4, 0, EBX, 21:12, cache_partition, Physical Line partitions - 4, 0, EBX, 31:22, cache_ways, Ways of associativity - 4, 0, ECX, 31:0, cache_sets, Number of Sets - 1 - 4, 0, EDX, 0, c_wbinvd, 1 means WBINVD/INVD is not ganranteed to act upon lower level caches of non-originating threads sharing this cache - 4, 0, EDX, 1, c_incl, Whether cache is inclusive of lower cache level - 4, 0, EDX, 2, c_comp_index, Complex Cache Indexing +# Leaf 5H +# MONITOR/MWAIT instructions enumeration -# Leaf 05H -# MONITOR/MWAIT - 5, 0, EAX, 15:0, min_mon_size, Smallest monitor line size in bytes - 5, 0, EBX, 15:0, max_mon_size, Largest monitor line size in bytes - 5, 0, ECX, 0, mwait_ext, Enum of Monitor-Mwait extensions supported - 5, 0, ECX, 1, mwait_irq_break, Largest monitor line size in bytes - 5, 0, EDX, 3:0, c0_sub_stats, Number of C0* sub C-states supported using MWAIT - 5, 0, EDX, 7:4, c1_sub_stats, Number of C1* sub C-states supported using MWAIT - 5, 0, EDX, 11:8, c2_sub_stats, Number of C2* sub C-states supported using MWAIT - 5, 0, EDX, 15:12, c3_sub_stats, Number of C3* sub C-states supported using MWAIT - 5, 0, EDX, 19:16, c4_sub_stats, Number of C4* sub C-states supported using MWAIT - 5, 0, EDX, 23:20, c5_sub_stats, Number of C5* sub C-states supported using MWAIT - 5, 0, EDX, 27:24, c6_sub_stats, Number of C6* sub C-states supported using MWAIT - 5, 0, EDX, 31:28, c7_sub_stats, Number of C7* sub C-states supported using MWAIT + 5, 0, eax, 15:0, min_mon_size , Smallest monitor-line size, in bytes + 5, 0, ebx, 15:0, max_mon_size , Largest monitor-line size, in bytes + 5, 0, ecx, 0, mwait_ext , Enumeration of MONITOR/MWAIT extensions is supported + 5, 0, ecx, 1, mwait_irq_break , Interrupts as a break-event for MWAIT is supported + 5, 0, edx, 3:0, n_c0_substates , Number of C0 sub C-states supported using MWAIT + 5, 0, edx, 7:4, n_c1_substates , Number of C1 sub C-states supported using MWAIT + 5, 0, edx, 11:8, n_c2_substates , Number of C2 sub C-states supported using MWAIT + 5, 0, edx, 15:12, n_c3_substates , Number of C3 sub C-states supported using MWAIT + 5, 0, edx, 19:16, n_c4_substates , Number of C4 sub C-states supported using MWAIT + 5, 0, edx, 23:20, n_c5_substates , Number of C5 sub C-states supported using MWAIT + 5, 0, edx, 27:24, n_c6_substates , Number of C6 sub C-states supported using MWAIT + 5, 0, edx, 31:28, n_c7_substates , Number of C7 sub C-states supported using MWAIT -# Leaf 06H -# Thermal & Power Management +# Leaf 6H +# Thermal and Power Management enumeration - 6, 0, EAX, 0, dig_temp, Digital temperature sensor supported - 6, 0, EAX, 1, turbo, Intel Turbo Boost - 6, 0, EAX, 2, arat, Always running APIC timer -# 6, 0, EAX, 3, resv, Reserved - 6, 0, EAX, 4, pln, Power limit notifications supported - 6, 0, EAX, 5, ecmd, Clock modulation duty cycle extension supported - 6, 0, EAX, 6, ptm, Package thermal management supported - 6, 0, EAX, 7, hwp, HWP base register - 6, 0, EAX, 8, hwp_notify, HWP notification - 6, 0, EAX, 9, hwp_act_window, HWP activity window - 6, 0, EAX, 10, hwp_energy, HWP energy performance preference - 6, 0, EAX, 11, hwp_pkg_req, HWP package level request -# 6, 0, EAX, 12, resv, Reserved - 6, 0, EAX, 13, hdc, HDC base registers supported - 6, 0, EAX, 14, turbo3, Turbo Boost Max 3.0 - 6, 0, EAX, 15, hwp_cap, Highest Performance change supported - 6, 0, EAX, 16, hwp_peci, HWP PECI override is supported - 6, 0, EAX, 17, hwp_flex, Flexible HWP is supported - 6, 0, EAX, 18, hwp_fast, Fast access mode for the IA32_HWP_REQUEST MSR is supported -# 6, 0, EAX, 19, resv, Reserved - 6, 0, EAX, 20, hwp_ignr, Ignoring Idle Logical Processor HWP request is supported + 6, 0, eax, 0, dtherm , Digital temprature sensor + 6, 0, eax, 1, turbo_boost , Intel Turbo Boost + 6, 0, eax, 2, arat , Always-Running APIC Timer (not affected by p-state) + 6, 0, eax, 4, pln , Power Limit Notification (PLN) event + 6, 0, eax, 5, ecmd , Clock modulation duty cycle extension + 6, 0, eax, 6, pts , Package thermal management + 6, 0, eax, 7, hwp , HWP (Hardware P-states) base registers are supported + 6, 0, eax, 8, hwp_notify , HWP notification (IA32_HWP_INTERRUPT MSR) + 6, 0, eax, 9, hwp_act_window , HWP activity window (IA32_HWP_REQUEST[bits 41:32]) supported + 6, 0, eax, 10, hwp_epp , HWP Energy Performance Preference + 6, 0, eax, 11, hwp_pkg_req , HWP Package Level Request + 6, 0, eax, 13, hdc_base_regs , HDC base registers are supported + 6, 0, eax, 14, turbo_boost_3_0 , Intel Turbo Boost Max 3.0 + 6, 0, eax, 15, hwp_capabilities , HWP Highest Performance change + 6, 0, eax, 16, hwp_peci_override , HWP PECI override + 6, 0, eax, 17, hwp_flexible , Flexible HWP + 6, 0, eax, 18, hwp_fast , IA32_HWP_REQUEST MSR fast access mode + 6, 0, eax, 19, hfi , HW_FEEDBACK MSRs supported + 6, 0, eax, 20, hwp_ignore_idle , Ignoring idle logical CPU HWP req is supported + 6, 0, eax, 23, thread_director , Intel thread director support + 6, 0, eax, 24, therm_interrupt_bit25 , IA32_THERM_INTERRUPT MSR bit 25 is supported + 6, 0, ebx, 3:0, n_therm_thresholds , Digital thermometer thresholds + 6, 0, ecx, 0, aperfmperf , MPERF/APERF MSRs (effective frequency interface) + 6, 0, ecx, 3, epb , IA32_ENERGY_PERF_BIAS MSR support + 6, 0, ecx, 15:8, thrd_director_nclasses , Number of classes, Intel thread director + 6, 0, edx, 0, perfcap_reporting , Performance capability reporting + 6, 0, edx, 1, encap_reporting , Energy efficiency capability reporting + 6, 0, edx, 11:8, feedback_sz , HW feedback interface struct size, in 4K pages + 6, 0, edx, 31:16, this_lcpu_hwfdbk_idx , This logical CPU index @ HW feedback struct, 0-based - 6, 0, EBX, 3:0, therm_irq_thresh, Number of Interrupt Thresholds in Digital Thermal Sensor - 6, 0, ECX, 0, aperfmperf, Presence of IA32_MPERF and IA32_APERF - 6, 0, ECX, 3, energ_bias, Performance-energy bias preference supported +# Leaf 7H +# Extended CPU features enumeration -# Leaf 07H -# ECX == 0 -# AVX512 refers to https://en.wikipedia.org/wiki/AVX-512 -# XXX: Do we really need to enumerate each and every AVX512 sub features + 7, 0, eax, 31:0, leaf7_n_subleaves , Number of cpuid 0x7 subleaves + 7, 0, ebx, 0, fsgsbase , FSBASE/GSBASE read/write support + 7, 0, ebx, 1, tsc_adjust , IA32_TSC_ADJUST MSR supported + 7, 0, ebx, 2, sgx , Intel SGX (Software Guard Extensions) + 7, 0, ebx, 3, bmi1 , Bit manipulation extensions group 1 + 7, 0, ebx, 4, hle , Hardware Lock Elision + 7, 0, ebx, 5, avx2 , AVX2 instruction set + 7, 0, ebx, 6, fdp_excptn_only , FPU Data Pointer updated only on x87 exceptions + 7, 0, ebx, 7, smep , Supervisor Mode Execution Protection + 7, 0, ebx, 8, bmi2 , Bit manipulation extensions group 2 + 7, 0, ebx, 9, erms , Enhanced REP MOVSB/STOSB + 7, 0, ebx, 10, invpcid , INVPCID instruction (Invalidate Processor Context ID) + 7, 0, ebx, 11, rtm , Intel restricted transactional memory + 7, 0, ebx, 12, cqm , Intel RDT-CMT / AMD Platform-QoS cache monitoring + 7, 0, ebx, 13, zero_fcs_fds , Deprecated FPU CS/DS (stored as zero) + 7, 0, ebx, 14, mpx , Intel memory protection extensions + 7, 0, ebx, 15, rdt_a , Intel RDT / AMD Platform-QoS Enforcemeent + 7, 0, ebx, 16, avx512f , AVX-512 foundation instructions + 7, 0, ebx, 17, avx512dq , AVX-512 double/quadword instructions + 7, 0, ebx, 18, rdseed , RDSEED instruction + 7, 0, ebx, 19, adx , ADCX/ADOX instructions + 7, 0, ebx, 20, smap , Supervisor mode access prevention + 7, 0, ebx, 21, avx512ifma , AVX-512 integer fused multiply add + 7, 0, ebx, 23, clflushopt , CLFLUSHOPT instruction + 7, 0, ebx, 24, clwb , CLWB instruction + 7, 0, ebx, 25, intel_pt , Intel processor trace + 7, 0, ebx, 26, avx512pf , AVX-512 prefetch instructions + 7, 0, ebx, 27, avx512er , AVX-512 exponent/reciprocal instrs + 7, 0, ebx, 28, avx512cd , AVX-512 conflict detection instrs + 7, 0, ebx, 29, sha_ni , SHA/SHA256 instructions + 7, 0, ebx, 30, avx512bw , AVX-512 BW (byte/word granular) instructions + 7, 0, ebx, 31, avx512vl , AVX-512 VL (128/256 vector length) extensions + 7, 0, ecx, 0, prefetchwt1 , PREFETCHWT1 (Intel Xeon Phi only) + 7, 0, ecx, 1, avx512vbmi , AVX-512 Vector byte manipulation instrs + 7, 0, ecx, 2, umip , User mode instruction protection + 7, 0, ecx, 3, pku , Protection keys for user-space + 7, 0, ecx, 4, ospke , OS protection keys enable + 7, 0, ecx, 5, waitpkg , WAITPKG instructions + 7, 0, ecx, 6, avx512_vbmi2 , AVX-512 vector byte manipulation instrs group 2 + 7, 0, ecx, 7, cet_ss , CET shadow stack features + 7, 0, ecx, 8, gfni , Galois field new instructions + 7, 0, ecx, 9, vaes , Vector AES instrs + 7, 0, ecx, 10, vpclmulqdq , VPCLMULQDQ 256-bit instruction support + 7, 0, ecx, 11, avx512_vnni , Vector neural network instructions + 7, 0, ecx, 12, avx512_bitalg , AVX-512 bit count/shiffle + 7, 0, ecx, 13, tme , Intel total memory encryption + 7, 0, ecx, 14, avx512_vpopcntdq , AVX-512: POPCNT for vectors of DW/QW + 7, 0, ecx, 16, la57 , 57-bit linear addreses (five-level paging) + 7, 0, ecx, 21:17, mawau_val_lm , BNDLDX/BNDSTX MAWAU value in 64-bit mode + 7, 0, ecx, 22, rdpid , RDPID instruction + 7, 0, ecx, 23, key_locker , Intel key locker support + 7, 0, ecx, 24, bus_lock_detect , OS bus-lock detection + 7, 0, ecx, 25, cldemote , CLDEMOTE instruction + 7, 0, ecx, 27, movdiri , MOVDIRI instruction + 7, 0, ecx, 28, movdir64b , MOVDIR64B instruction + 7, 0, ecx, 29, enqcmd , Enqueue stores supported (ENQCMD{,S}) + 7, 0, ecx, 30, sgx_lc , Intel SGX launch configuration + 7, 0, ecx, 31, pks , Protection keys for supervisor-mode pages + 7, 0, edx, 1, sgx_keys , Intel SGX attestation services + 7, 0, edx, 2, avx512_4vnniw , AVX-512 neural network instructions + 7, 0, edx, 3, avx512_4fmaps , AVX-512 multiply accumulation single precision + 7, 0, edx, 4, fsrm , Fast short REP MOV + 7, 0, edx, 5, uintr , CPU supports user interrupts + 7, 0, edx, 8, avx512_vp2intersect , VP2INTERSECT{D,Q} instructions + 7, 0, edx, 9, srdbs_ctrl , SRBDS mitigation MSR available + 7, 0, edx, 10, md_clear , VERW MD_CLEAR microcode support + 7, 0, edx, 11, rtm_always_abort , XBEGIN (RTM transaction) always aborts + 7, 0, edx, 13, tsx_force_abort , MSR TSX_FORCE_ABORT, RTM_ABORT bit, supported + 7, 0, edx, 14, serialize , SERIALIZE instruction + 7, 0, edx, 15, hybrid_cpu , The CPU is identified as a 'hybrid part' + 7, 0, edx, 16, tsxldtrk , TSX suspend/resume load address tracking + 7, 0, edx, 18, pconfig , PCONFIG instruction + 7, 0, edx, 19, arch_lbr , Intel architectural LBRs + 7, 0, edx, 20, ibt , CET indirect branch tracking + 7, 0, edx, 22, amx_bf16 , AMX-BF16: tile bfloat16 support + 7, 0, edx, 23, avx512_fp16 , AVX-512 FP16 instructions + 7, 0, edx, 24, amx_tile , AMX-TILE: tile architecture support + 7, 0, edx, 25, amx_int8 , AMX-INT8: tile 8-bit integer support + 7, 0, edx, 26, spec_ctrl , Speculation Control (IBRS/IBPB: indirect branch restrictions) + 7, 0, edx, 27, intel_stibp , Single thread indirect branch predictors + 7, 0, edx, 28, flush_l1d , FLUSH L1D cache: IA32_FLUSH_CMD MSR + 7, 0, edx, 29, arch_capabilities , Intel IA32_ARCH_CAPABILITIES MSR + 7, 0, edx, 30, core_capabilities , IA32_CORE_CAPABILITIES MSR + 7, 0, edx, 31, spec_ctrl_ssbd , Speculative store bypass disable + 7, 1, eax, 4, avx_vnni , AVX-VNNI instructions + 7, 1, eax, 5, avx512_bf16 , AVX-512 bFloat16 instructions + 7, 1, eax, 6, lass , Linear address space separation + 7, 1, eax, 7, cmpccxadd , CMPccXADD instructions + 7, 1, eax, 8, arch_perfmon_ext , ArchPerfmonExt: CPUID leaf 0x23 is supported + 7, 1, eax, 10, fzrm , Fast zero-length REP MOVSB + 7, 1, eax, 11, fsrs , Fast short REP STOSB + 7, 1, eax, 12, fsrc , Fast Short REP CMPSB/SCASB + 7, 1, eax, 17, fred , FRED: Flexible return and event delivery transitions + 7, 1, eax, 18, lkgs , LKGS: Load 'kernel' (userspace) GS + 7, 1, eax, 19, wrmsrns , WRMSRNS instr (WRMSR-non-serializing) + 7, 1, eax, 21, amx_fp16 , AMX-FP16: FP16 tile operations + 7, 1, eax, 22, hreset , History reset support + 7, 1, eax, 23, avx_ifma , Integer fused multiply add + 7, 1, eax, 26, lam , Linear address masking + 7, 1, eax, 27, rd_wr_msrlist , RDMSRLIST/WRMSRLIST instructions + 7, 1, ebx, 0, intel_ppin , Protected processor inventory number (PPIN{,_CTL} MSRs) + 7, 1, edx, 4, avx_vnni_int8 , AVX-VNNI-INT8 instructions + 7, 1, edx, 5, avx_ne_convert , AVX-NE-CONVERT instructions + 7, 1, edx, 8, amx_complex , AMX-COMPLEX instructions (starting from Granite Rapids) + 7, 1, edx, 14, prefetchit_0_1 , PREFETCHIT0/1 instructions + 7, 1, edx, 18, cet_sss , CET supervisor shadow stacks safe to use + 7, 2, edx, 0, intel_psfd , Intel predictive store forward disable + 7, 2, edx, 1, ipred_ctrl , MSR bits IA32_SPEC_CTRL.IPRED_DIS_{U,S} + 7, 2, edx, 2, rrsba_ctrl , MSR bits IA32_SPEC_CTRL.RRSBA_DIS_{U,S} + 7, 2, edx, 3, ddp_ctrl , MSR bit IA32_SPEC_CTRL.DDPD_U + 7, 2, edx, 4, bhi_ctrl , MSR bit IA32_SPEC_CTRL.BHI_DIS_S + 7, 2, edx, 5, mcdt_no , MCDT mitigation not needed + 7, 2, edx, 6, uclock_disable , UC-lock disable is supported - 7, 0, EBX, 0, fsgsbase, RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE supported - 7, 0, EBX, 1, tsc_adjust, TSC_ADJUST MSR supported - 7, 0, EBX, 2, sgx, Software Guard Extensions - 7, 0, EBX, 3, bmi1, BMI1 - 7, 0, EBX, 4, hle, Hardware Lock Elision - 7, 0, EBX, 5, avx2, AVX2 -# 7, 0, EBX, 6, fdp_excp_only, x87 FPU Data Pointer updated only on x87 exceptions - 7, 0, EBX, 7, smep, Supervisor-Mode Execution Prevention - 7, 0, EBX, 8, bmi2, BMI2 - 7, 0, EBX, 9, rep_movsb, Enhanced REP MOVSB/STOSB - 7, 0, EBX, 10, invpcid, INVPCID instruction - 7, 0, EBX, 11, rtm, Restricted Transactional Memory - 7, 0, EBX, 12, rdt_m, Intel RDT Monitoring capability - 7, 0, EBX, 13, depc_fpu_cs_ds, Deprecates FPU CS and FPU DS - 7, 0, EBX, 14, mpx, Memory Protection Extensions - 7, 0, EBX, 15, rdt_a, Intel RDT Allocation capability - 7, 0, EBX, 16, avx512f, AVX512 Foundation instr - 7, 0, EBX, 17, avx512dq, AVX512 Double and Quadword AVX512 instr - 7, 0, EBX, 18, rdseed, RDSEED instr - 7, 0, EBX, 19, adx, ADX instr - 7, 0, EBX, 20, smap, Supervisor Mode Access Prevention - 7, 0, EBX, 21, avx512ifma, AVX512 Integer Fused Multiply Add -# 7, 0, EBX, 22, resvd, resvd - 7, 0, EBX, 23, clflushopt, CLFLUSHOPT instr - 7, 0, EBX, 24, clwb, CLWB instr - 7, 0, EBX, 25, intel_pt, Intel Processor Trace instr - 7, 0, EBX, 26, avx512pf, Prefetch - 7, 0, EBX, 27, avx512er, AVX512 Exponent Reciproca instr - 7, 0, EBX, 28, avx512cd, AVX512 Conflict Detection instr - 7, 0, EBX, 29, sha, Intel Secure Hash Algorithm Extensions instr - 7, 0, EBX, 30, avx512bw, AVX512 Byte & Word instr - 7, 0, EBX, 31, avx512vl, AVX512 Vector Length Extentions (VL) - 7, 0, ECX, 0, prefetchwt1, X - 7, 0, ECX, 1, avx512vbmi, AVX512 Vector Byte Manipulation Instructions - 7, 0, ECX, 2, umip, User-mode Instruction Prevention +# Leaf 9H +# Intel DCA (Direct Cache Access) enumeration - 7, 0, ECX, 3, pku, Protection Keys for User-mode pages - 7, 0, ECX, 4, ospke, CR4 PKE set to enable protection keys -# 7, 0, ECX, 16:5, resvd, resvd - 7, 0, ECX, 21:17, mawau, The value of MAWAU used by the BNDLDX and BNDSTX instructions in 64-bit mode - 7, 0, ECX, 22, rdpid, RDPID and IA32_TSC_AUX -# 7, 0, ECX, 29:23, resvd, resvd - 7, 0, ECX, 30, sgx_lc, SGX Launch Configuration -# 7, 0, ECX, 31, resvd, resvd + 9, 0, eax, 0, dca_enabled_in_bios , DCA is enabled in BIOS -# Leaf 08H -# +# Leaf AH +# Intel PMU (Performance Monitoring Unit) enumeration + 0xa, 0, eax, 7:0, pmu_version , Performance monitoring unit version ID + 0xa, 0, eax, 15:8, pmu_n_gcounters , Number of general PMU counters per logical CPU + 0xa, 0, eax, 23:16, pmu_gcounters_nbits , Bitwidth of PMU general counters + 0xa, 0, eax, 31:24, pmu_cpuid_ebx_bits , Length of cpuid leaf 0xa EBX bit vector + 0xa, 0, ebx, 0, no_core_cycle_evt , Core cycle event not available + 0xa, 0, ebx, 1, no_insn_retired_evt , Instruction retired event not available + 0xa, 0, ebx, 2, no_refcycle_evt , Reference cycles event not available + 0xa, 0, ebx, 3, no_llc_ref_evt , LLC-reference event not available + 0xa, 0, ebx, 4, no_llc_miss_evt , LLC-misses event not available + 0xa, 0, ebx, 5, no_br_insn_ret_evt , Branch instruction retired event not available + 0xa, 0, ebx, 6, no_br_mispredict_evt , Branch mispredict retired event not available + 0xa, 0, ebx, 7, no_td_slots_evt , Topdown slots event not available + 0xa, 0, ecx, 31:0, pmu_fcounters_bitmap , Fixed-function PMU counters support bitmap + 0xa, 0, edx, 4:0, pmu_n_fcounters , Number of fixed PMU counters + 0xa, 0, edx, 12:5, pmu_fcounters_nbits , Bitwidth of PMU fixed counters + 0xa, 0, edx, 15, anythread_depr , AnyThread deprecation -# Leaf 09H -# Direct Cache Access (DCA) information - 9, 0, ECX, 31:0, dca_cap, The value of IA32_PLATFORM_DCA_CAP +# Leaf BH +# CPUs v1 extended topology enumeration -# Leaf 0AH -# Architectural Performance Monitoring -# -# Do we really need to print out the PMU related stuff? -# Does normal user really care about it? -# - 0xA, 0, EAX, 7:0, pmu_ver, Performance Monitoring Unit version - 0xA, 0, EAX, 15:8, pmu_gp_cnt_num, Numer of general-purose PMU counters per logical CPU - 0xA, 0, EAX, 23:16, pmu_cnt_bits, Bit wideth of PMU counter - 0xA, 0, EAX, 31:24, pmu_ebx_bits, Length of EBX bit vector to enumerate PMU events + 0xb, 1:0, eax, 4:0, x2apic_id_shift , Bit width of this level (previous levels inclusive) + 0xb, 1:0, ebx, 15:0, domain_lcpus_count , Logical CPUs count across all instances of this domain + 0xb, 1:0, ecx, 7:0, domain_nr , This domain level (subleaf ID) + 0xb, 1:0, ecx, 15:8, domain_type , This domain type + 0xb, 1:0, edx, 31:0, x2apic_id , x2APIC ID of current logical CPU - 0xA, 0, EBX, 0, pmu_no_core_cycle_evt, Core cycle event not available - 0xA, 0, EBX, 1, pmu_no_instr_ret_evt, Instruction retired event not available - 0xA, 0, EBX, 2, pmu_no_ref_cycle_evt, Reference cycles event not available - 0xA, 0, EBX, 3, pmu_no_llc_ref_evt, Last-level cache reference event not available - 0xA, 0, EBX, 4, pmu_no_llc_mis_evt, Last-level cache misses event not available - 0xA, 0, EBX, 5, pmu_no_br_instr_ret_evt, Branch instruction retired event not available - 0xA, 0, EBX, 6, pmu_no_br_mispredict_evt, Branch mispredict retired event not available +# Leaf DH +# Processor extended state enumeration - 0xA, 0, ECX, 4:0, pmu_fixed_cnt_num, Performance Monitoring Unit version - 0xA, 0, ECX, 12:5, pmu_fixed_cnt_bits, Numer of PMU counters per logical CPU + 0xd, 0, eax, 0, xcr0_x87 , XCR0.X87 (bit 0) supported + 0xd, 0, eax, 1, xcr0_sse , XCR0.SEE (bit 1) supported + 0xd, 0, eax, 2, xcr0_avx , XCR0.AVX (bit 2) supported + 0xd, 0, eax, 3, xcr0_mpx_bndregs , XCR0.BNDREGS (bit 3) supported (MPX BND0-BND3 regs) + 0xd, 0, eax, 4, xcr0_mpx_bndcsr , XCR0.BNDCSR (bit 4) supported (MPX BNDCFGU/BNDSTATUS regs) + 0xd, 0, eax, 5, xcr0_avx512_opmask , XCR0.OPMASK (bit 5) supported (AVX-512 k0-k7 regs) + 0xd, 0, eax, 6, xcr0_avx512_zmm_hi256 , XCR0.ZMM_Hi256 (bit 6) supported (AVX-512 ZMM0->ZMM7/15 regs) + 0xd, 0, eax, 7, xcr0_avx512_hi16_zmm , XCR0.HI16_ZMM (bit 7) supported (AVX-512 ZMM16->ZMM31 regs) + 0xd, 0, eax, 9, xcr0_pkru , XCR0.PKRU (bit 9) supported (XSAVE PKRU reg) + 0xd, 0, eax, 11, xcr0_cet_u , AMD XCR0.CET_U (bit 11) supported (CET supervisor state) + 0xd, 0, eax, 12, xcr0_cet_s , AMD XCR0.CET_S (bit 12) support (CET user state) + 0xd, 0, eax, 17, xcr0_tileconfig , XCR0.TILECONFIG (bit 17) supported (AMX can manage TILECONFIG) + 0xd, 0, eax, 18, xcr0_tiledata , XCR0.TILEDATA (bit 18) supported (AMX can manage TILEDATA) + 0xd, 0, ebx, 31:0, xsave_sz_xcr0_enabled , XSAVE/XRSTR area byte size, for XCR0 enabled features + 0xd, 0, ecx, 31:0, xsave_sz_max , XSAVE/XRSTR area max byte size, all CPU features + 0xd, 0, edx, 30, xcr0_lwp , AMD XCR0.LWP (bit 62) supported (Light-weight Profiling) + 0xd, 1, eax, 0, xsaveopt , XSAVEOPT instruction + 0xd, 1, eax, 1, xsavec , XSAVEC instruction + 0xd, 1, eax, 2, xgetbv1 , XGETBV instruction with ECX = 1 + 0xd, 1, eax, 3, xsaves , XSAVES/XRSTORS instructions (and XSS MSR) + 0xd, 1, eax, 4, xfd , Extended feature disable support + 0xd, 1, ebx, 31:0, xsave_sz_xcr0_xmms_enabled, XSAVE area size, all XCR0 and XMMS features enabled + 0xd, 1, ecx, 8, xss_pt , PT state, supported + 0xd, 1, ecx, 10, xss_pasid , PASID state, supported + 0xd, 1, ecx, 11, xss_cet_u , CET user state, supported + 0xd, 1, ecx, 12, xss_cet_p , CET supervisor state, supported + 0xd, 1, ecx, 13, xss_hdc , HDC state, supported + 0xd, 1, ecx, 14, xss_uintr , UINTR state, supported + 0xd, 1, ecx, 15, xss_lbr , LBR state, supported + 0xd, 1, ecx, 16, xss_hwp , HWP state, supported + 0xd, 63:2, eax, 31:0, xsave_sz , Size of save area for subleaf-N feature, in bytes + 0xd, 63:2, ebx, 31:0, xsave_offset , Offset of save area for subleaf-N feature, in bytes + 0xd, 63:2, ecx, 0, is_xss_bit , Subleaf N describes an XSS bit, otherwise XCR0 bit + 0xd, 63:2, ecx, 1, compacted_xsave_64byte_aligned, When compacted, subleaf-N feature xsave area is 64-byte aligned -# Leaf 0BH -# Extended Topology Enumeration Leaf -# +# Leaf FH +# Intel RDT / AMD PQoS resource monitoring - 0xB, 0, EAX, 4:0, id_shift, Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type - 0xB, 0, EBX, 15:0, cpu_nr, Number of logical processors at this level type - 0xB, 0, ECX, 15:8, lvl_type, 0-Invalid 1-SMT 2-Core - 0xB, 0, EDX, 31:0, x2apic_id, x2APIC ID the current logical processor - - -# Leaf 0DH -# Processor Extended State - - 0xD, 0, EAX, 0, x87, X87 state - 0xD, 0, EAX, 1, sse, SSE state - 0xD, 0, EAX, 2, avx, AVX state - 0xD, 0, EAX, 4:3, mpx, MPX state - 0xD, 0, EAX, 7:5, avx512, AVX-512 state - 0xD, 0, EAX, 9, pkru, PKRU state - - 0xD, 0, EBX, 31:0, max_sz_xcr0, Maximum size (bytes) required by enabled features in XCR0 - 0xD, 0, ECX, 31:0, max_sz_xsave, Maximum size (bytes) of the XSAVE/XRSTOR save area - - 0xD, 1, EAX, 0, xsaveopt, XSAVEOPT available - 0xD, 1, EAX, 1, xsavec, XSAVEC and compacted form supported - 0xD, 1, EAX, 2, xgetbv, XGETBV supported - 0xD, 1, EAX, 3, xsaves, XSAVES/XRSTORS and IA32_XSS supported - - 0xD, 1, EBX, 31:0, max_sz_xcr0, Maximum size (bytes) required by enabled features in XCR0 - 0xD, 1, ECX, 8, pt, PT state - 0xD, 1, ECX, 11, cet_usr, CET user state - 0xD, 1, ECX, 12, cet_supv, CET supervisor state - 0xD, 1, ECX, 13, hdc, HDC state - 0xD, 1, ECX, 16, hwp, HWP state - -# Leaf 0FH -# Intel RDT Monitoring - - 0xF, 0, EBX, 31:0, rmid_range, Maximum range (zero-based) of RMID within this physical processor of all types - 0xF, 0, EDX, 1, l3c_rdt_mon, L3 Cache RDT Monitoring supported - - 0xF, 1, ECX, 31:0, rmid_range, Maximum range (zero-based) of RMID of this types - 0xF, 1, EDX, 0, l3c_ocp_mon, L3 Cache occupancy Monitoring supported - 0xF, 1, EDX, 1, l3c_tbw_mon, L3 Cache Total Bandwidth Monitoring supported - 0xF, 1, EDX, 2, l3c_lbw_mon, L3 Cache Local Bandwidth Monitoring supported + 0xf, 0, ebx, 31:0, core_rmid_max , RMID max, within this core, all types (0-based) + 0xf, 0, edx, 1, cqm_llc , LLC QoS-monitoring supported + 0xf, 1, eax, 7:0, l3c_qm_bitwidth , L3 QoS-monitoring counter bitwidth (24-based) + 0xf, 1, eax, 8, l3c_qm_overflow_bit , QM_CTR MSR bit 61 is an overflow bit + 0xf, 1, ebx, 31:0, l3c_qm_conver_factor , QM_CTR MSR conversion factor to bytes + 0xf, 1, ecx, 31:0, l3c_qm_rmid_max , L3 QoS-monitoring max RMID + 0xf, 1, edx, 0, cqm_occup_llc , L3 QoS occupancy monitoring supported + 0xf, 1, edx, 1, cqm_mbm_total , L3 QoS total bandwidth monitoring supported + 0xf, 1, edx, 2, cqm_mbm_local , L3 QoS local bandwidth monitoring supported # Leaf 10H -# Intel RDT Allocation - - 0x10, 0, EBX, 1, l3c_rdt_alloc, L3 Cache Allocation supported - 0x10, 0, EBX, 2, l2c_rdt_alloc, L2 Cache Allocation supported - 0x10, 0, EBX, 3, mem_bw_alloc, Memory Bandwidth Allocation supported +# Intel RDT / AMD PQoS allocation enumeration + 0x10, 0, ebx, 1, cat_l3 , L3 Cache Allocation Technology supported + 0x10, 0, ebx, 2, cat_l2 , L2 Cache Allocation Technology supported + 0x10, 0, ebx, 3, mba , Memory Bandwidth Allocation supported + 0x10, 2:1, eax, 4:0, cat_cbm_len , L3/L2_CAT capacity bitmask length, minus-one notation + 0x10, 2:1, ebx, 31:0, cat_units_bitmap , L3/L2_CAT bitmap of allocation units + 0x10, 2:1, ecx, 1, l3_cat_cos_infreq_updates, L3_CAT COS updates should be infrequent + 0x10, 2:1, ecx, 2, cdp_l3 , L3/L2_CAT CDP (Code and Data Prioritization) + 0x10, 2:1, ecx, 3, cat_sparse_1s , L3/L2_CAT non-contiguous 1s value supported + 0x10, 2:1, edx, 15:0, cat_cos_max , L3/L2_CAT max COS (Class of Service) supported + 0x10, 3, eax, 11:0, mba_max_delay , Max MBA throttling value; minus-one notation + 0x10, 3, ecx, 0, per_thread_mba , Per-thread MBA controls are supported + 0x10, 3, ecx, 2, mba_delay_linear , Delay values are linear + 0x10, 3, edx, 15:0, mba_cos_max , MBA max Class of Service supported # Leaf 12H -# SGX Capability -# -# Some detailed SGX features not added yet - - 0x12, 0, EAX, 0, sgx1, L3 Cache Allocation supported - 0x12, 1, EAX, 0, sgx2, L3 Cache Allocation supported +# Intel Software Guard Extensions (SGX) enumeration + 0x12, 0, eax, 0, sgx1 , SGX1 leaf functions supported + 0x12, 0, eax, 1, sgx2 , SGX2 leaf functions supported + 0x12, 0, eax, 5, enclv_leaves , ENCLV leaves (E{INC,DEC}VIRTCHILD, ESETCONTEXT) supported + 0x12, 0, eax, 6, encls_leaves , ENCLS leaves (ENCLS ETRACKC, ERDINFO, ELDBC, ELDUC) supported + 0x12, 0, eax, 7, enclu_everifyreport2 , ENCLU leaf EVERIFYREPORT2 supported + 0x12, 0, eax, 10, encls_eupdatesvn , ENCLS leaf EUPDATESVN supported + 0x12, 0, eax, 11, sgx_edeccssa , ENCLU leaf EDECCSSA supported + 0x12, 0, ebx, 0, miscselect_exinfo , SSA.MISC frame: reporting #PF and #GP exceptions inside enclave supported + 0x12, 0, ebx, 1, miscselect_cpinfo , SSA.MISC frame: reporting #CP exceptions inside enclave supported + 0x12, 0, edx, 7:0, max_enclave_sz_not64 , Maximum enclave size in non-64-bit mode (log2) + 0x12, 0, edx, 15:8, max_enclave_sz_64 , Maximum enclave size in 64-bit mode (log2) + 0x12, 1, eax, 0, secs_attr_init , ATTRIBUTES.INIT supported (enclave initialized by EINIT) + 0x12, 1, eax, 1, secs_attr_debug , ATTRIBUTES.DEBUG supported (enclave permits debugger read/write) + 0x12, 1, eax, 2, secs_attr_mode64bit , ATTRIBUTES.MODE64BIT supported (enclave runs in 64-bit mode) + 0x12, 1, eax, 4, secs_attr_provisionkey , ATTRIBUTES.PROVISIONKEY supported (provisioning key available) + 0x12, 1, eax, 5, secs_attr_einittoken_key, ATTRIBUTES.EINITTOKEN_KEY supported (EINIT token key available) + 0x12, 1, eax, 6, secs_attr_cet , ATTRIBUTES.CET supported (enable CET attributes) + 0x12, 1, eax, 7, secs_attr_kss , ATTRIBUTES.KSS supported (Key Separation and Sharing enabled) + 0x12, 1, eax, 10, secs_attr_aexnotify , ATTRIBUTES.AEXNOTIFY supported (enclave threads may get AEX notifications + 0x12, 1, ecx, 0, xfrm_x87 , Enclave XFRM.X87 (bit 0) supported + 0x12, 1, ecx, 1, xfrm_sse , Enclave XFRM.SEE (bit 1) supported + 0x12, 1, ecx, 2, xfrm_avx , Enclave XFRM.AVX (bit 2) supported + 0x12, 1, ecx, 3, xfrm_mpx_bndregs , Enclave XFRM.BNDREGS (bit 3) supported (MPX BND0-BND3 regs) + 0x12, 1, ecx, 4, xfrm_mpx_bndcsr , Enclave XFRM.BNDCSR (bit 4) supported (MPX BNDCFGU/BNDSTATUS regs) + 0x12, 1, ecx, 5, xfrm_avx512_opmask , Enclave XFRM.OPMASK (bit 5) supported (AVX-512 k0-k7 regs) + 0x12, 1, ecx, 6, xfrm_avx512_zmm_hi256 , Enclave XFRM.ZMM_Hi256 (bit 6) supported (AVX-512 ZMM0->ZMM7/15 regs) + 0x12, 1, ecx, 7, xfrm_avx512_hi16_zmm , Enclave XFRM.HI16_ZMM (bit 7) supported (AVX-512 ZMM16->ZMM31 regs) + 0x12, 1, ecx, 9, xfrm_pkru , Enclave XFRM.PKRU (bit 9) supported (XSAVE PKRU reg) + 0x12, 1, ecx, 17, xfrm_tileconfig , Enclave XFRM.TILECONFIG (bit 17) supported (AMX can manage TILECONFIG) + 0x12, 1, ecx, 18, xfrm_tiledata , Enclave XFRM.TILEDATA (bit 18) supported (AMX can manage TILEDATA) + 0x12, 31:2, eax, 3:0, subleaf_type , Subleaf type (dictates output layout) + 0x12, 31:2, eax, 31:12, epc_sec_base_addr_0 , EPC section base addr, bits[12:31] + 0x12, 31:2, ebx, 19:0, epc_sec_base_addr_1 , EPC section base addr, bits[32:51] + 0x12, 31:2, ecx, 3:0, epc_sec_type , EPC section type / property encoding + 0x12, 31:2, ecx, 31:12, epc_sec_size_0 , EPC section size, bits[12:31] + 0x12, 31:2, edx, 19:0, epc_sec_size_1 , EPC section size, bits[32:51] # Leaf 14H -# Intel Processor Tracer -# +# Intel Processor Trace enumeration + + 0x14, 0, eax, 31:0, pt_max_subleaf , Max cpuid 0x14 subleaf + 0x14, 0, ebx, 0, cr3_filtering , IA32_RTIT_CR3_MATCH is accessible + 0x14, 0, ebx, 1, psb_cyc , Configurable PSB and cycle-accurate mode + 0x14, 0, ebx, 2, ip_filtering , IP/TraceStop filtering; Warm-reset PT MSRs preservation + 0x14, 0, ebx, 3, mtc_timing , MTC timing packet; COFI-based packets suppression + 0x14, 0, ebx, 4, ptwrite , PTWRITE support + 0x14, 0, ebx, 5, power_event_trace , Power Event Trace support + 0x14, 0, ebx, 6, psb_pmi_preserve , PSB and PMI preservation support + 0x14, 0, ebx, 7, event_trace , Event Trace packet generation through IA32_RTIT_CTL.EventEn + 0x14, 0, ebx, 8, tnt_disable , TNT packet generation disable through IA32_RTIT_CTL.DisTNT + 0x14, 0, ecx, 0, topa_output , ToPA output scheme support + 0x14, 0, ecx, 1, topa_multiple_entries , ToPA tables can hold multiple entries + 0x14, 0, ecx, 2, single_range_output , Single-range output scheme supported + 0x14, 0, ecx, 3, trance_transport_output, Trace Transport subsystem output support + 0x14, 0, ecx, 31, ip_payloads_lip , IP payloads have LIP values (CS base included) + 0x14, 1, eax, 2:0, num_address_ranges , Filtering number of configurable Address Ranges + 0x14, 1, eax, 31:16, mtc_periods_bmp , Bitmap of supported MTC period encodings + 0x14, 1, ebx, 15:0, cycle_thresholds_bmp , Bitmap of supported Cycle Threshold encodings + 0x14, 1, ebx, 31:16, psb_periods_bmp , Bitmap of supported Configurable PSB frequency encodings # Leaf 15H -# Time Stamp Counter and Nominal Core Crystal Clock Information +# Intel TSC (Time Stamp Counter) enumeration - 0x15, 0, EAX, 31:0, tsc_denominator, The denominator of the TSC/”core crystal clock” ratio - 0x15, 0, EBX, 31:0, tsc_numerator, The numerator of the TSC/”core crystal clock” ratio - 0x15, 0, ECX, 31:0, nom_freq, Nominal frequency of the core crystal clock in Hz + 0x15, 0, eax, 31:0, tsc_denominator , Denominator of the TSC/'core crystal clock' ratio + 0x15, 0, ebx, 31:0, tsc_numerator , Numerator of the TSC/'core crystal clock' ratio + 0x15, 0, ecx, 31:0, cpu_crystal_hz , Core crystal clock nominal frequency, in Hz # Leaf 16H -# Processor Frequency Information +# Intel processor fequency enumeration - 0x16, 0, EAX, 15:0, cpu_base_freq, Processor Base Frequency in MHz - 0x16, 0, EBX, 15:0, cpu_max_freq, Maximum Frequency in MHz - 0x16, 0, ECX, 15:0, bus_freq, Bus (Reference) Frequency in MHz + 0x16, 0, eax, 15:0, cpu_base_mhz , Processor base frequency, in MHz + 0x16, 0, ebx, 15:0, cpu_max_mhz , Processor max frequency, in MHz + 0x16, 0, ecx, 15:0, bus_mhz , Bus reference frequency, in MHz # Leaf 17H -# System-On-Chip Vendor Attribute +# Intel SoC vendor attributes enumeration - 0x17, 0, EAX, 31:0, max_socid, Maximum input value of supported sub-leaf - 0x17, 0, EBX, 15:0, soc_vid, SOC Vendor ID - 0x17, 0, EBX, 16, std_vid, SOC Vendor ID is assigned via an industry standard scheme - 0x17, 0, ECX, 31:0, soc_pid, SOC Project ID assigned by vendor - 0x17, 0, EDX, 31:0, soc_sid, SOC Stepping ID + 0x17, 0, eax, 31:0, soc_max_subleaf , Max cpuid leaf 0x17 subleaf + 0x17, 0, ebx, 15:0, soc_vendor_id , SoC vendor ID + 0x17, 0, ebx, 16, is_vendor_scheme , Assigned by industry enumaeratoion scheme (not Intel) + 0x17, 0, ecx, 31:0, soc_proj_id , SoC project ID, assigned by vendor + 0x17, 0, edx, 31:0, soc_stepping_id , Soc project stepping ID, assigned by vendor + 0x17, 3:1, eax, 31:0, vendor_brand_a , Vendor Brand ID string, bytes subleaf_nr * (0 -> 3) + 0x17, 3:1, ebx, 31:0, vendor_brand_b , Vendor Brand ID string, bytes subleaf_nr * (4 -> 7) + 0x17, 3:1, ecx, 31:0, vendor_brand_c , Vendor Brand ID string, bytes subleaf_nr * (8 -> 11) + 0x17, 3:1, edx, 31:0, vendor_brand_d , Vendor Brand ID string, bytes subleaf_nr * (12 -> 15) # Leaf 18H -# Deterministic Address Translation Parameters +# Intel determenestic address translation (TLB) parameters + 0x18, 31:0, eax, 31:0, tlb_max_subleaf , Max cpuid 0x18 subleaf + 0x18, 31:0, ebx, 0, tlb_4k_page , TLB 4KB-page entries supported + 0x18, 31:0, ebx, 1, tlb_2m_page , TLB 2MB-page entries supported + 0x18, 31:0, ebx, 2, tlb_4m_page , TLB 4MB-page entries supported + 0x18, 31:0, ebx, 3, tlb_1g_page , TLB 1GB-page entries supported + 0x18, 31:0, ebx, 10:8, hard_partitioning , (Hard/Soft) partitioning between logical CPUs sharing this struct + 0x18, 31:0, ebx, 31:16, n_way_associative , Ways of associativity + 0x18, 31:0, ecx, 31:0, n_sets , Number of sets + 0x18, 31:0, edx, 4:0, tlb_type , Translation cache type (TLB type) + 0x18, 31:0, edx, 7:5, tlb_cache_level , Translation cache level (1-based) + 0x18, 31:0, edx, 8, is_fully_associative , Fully-associative structure + 0x18, 31:0, edx, 25:14, tlb_max_addressible_ids, Max num of addressible IDs for logical CPUs sharing this TLB - 1 # Leaf 19H -# Key Locker Leaf +# Intel Key Locker enumeration + 0x19, 0, eax, 0, kl_cpl0_only , CPL0-only key Locker restriction supported + 0x19, 0, eax, 1, kl_no_encrypt , No-encrypt key locker restriction supported + 0x19, 0, eax, 2, kl_no_decrypt , No-decrypt key locker restriction supported + 0x19, 0, ebx, 0, aes_keylocker , AES key locker instructions supported + 0x19, 0, ebx, 2, aes_keylocker_wide , AES wide key locker instructions supported + 0x19, 0, ebx, 4, kl_msr_iwkey , Key locker MSRs and IWKEY backups supported + 0x19, 0, ecx, 0, loadiwkey_no_backup , LOADIWKEY NoBackup parameter supported + 0x19, 0, ecx, 1, iwkey_rand , IWKEY randomization (KeySource encoding 1) supported # Leaf 1AH -# Hybrid Information +# Intel hybrid CPUs identification (e.g. Atom, Core) - 0x1A, 0, EAX, 31:24, core_type, 20H-Intel_Atom 40H-Intel_Core + 0x1a, 0, eax, 23:0, core_native_model , This core's native model ID + 0x1a, 0, eax, 31:24, core_type , This core's type +# Leaf 1BH +# Intel PCONFIG (Platform configuration) enumeration + + 0x1b, 31:0, eax, 11:0, pconfig_subleaf_type , CPUID 0x1b subleaf type + 0x1b, 31:0, ebx, 31:0, pconfig_target_id_x , A supported PCONFIG target ID + 0x1b, 31:0, ecx, 31:0, pconfig_target_id_y , A supported PCONFIG target ID + 0x1b, 31:0, edx, 31:0, pconfig_target_id_z , A supported PCONFIG target ID + +# Leaf 1CH +# Intel LBR (Last Branch Record) enumeration + + 0x1c, 0, eax, 0, lbr_depth_8 , Max stack depth (number of LBR entries) = 8 + 0x1c, 0, eax, 1, lbr_depth_16 , Max stack depth (number of LBR entries) = 16 + 0x1c, 0, eax, 2, lbr_depth_24 , Max stack depth (number of LBR entries) = 24 + 0x1c, 0, eax, 3, lbr_depth_32 , Max stack depth (number of LBR entries) = 32 + 0x1c, 0, eax, 4, lbr_depth_40 , Max stack depth (number of LBR entries) = 40 + 0x1c, 0, eax, 5, lbr_depth_48 , Max stack depth (number of LBR entries) = 48 + 0x1c, 0, eax, 6, lbr_depth_56 , Max stack depth (number of LBR entries) = 56 + 0x1c, 0, eax, 7, lbr_depth_64 , Max stack depth (number of LBR entries) = 64 + 0x1c, 0, eax, 30, lbr_deep_c_reset , LBRs maybe cleared on MWAIT C-state > C1 + 0x1c, 0, eax, 31, lbr_ip_is_lip , LBR IP contain Last IP, otherwise effective IP + 0x1c, 0, ebx, 0, lbr_cpl , CPL filtering (non-zero IA32_LBR_CTL[2:1]) supported + 0x1c, 0, ebx, 1, lbr_branch_filter , Branch filtering (non-zero IA32_LBR_CTL[22:16]) supported + 0x1c, 0, ebx, 2, lbr_call_stack , Call-stack mode (IA32_LBR_CTL[3] = 1) supported + 0x1c, 0, ecx, 0, lbr_mispredict , Branch misprediction bit supported (IA32_LBR_x_INFO[63]) + 0x1c, 0, ecx, 1, lbr_timed_lbr , Timed LBRs (CPU cycles since last LBR entry) supported + 0x1c, 0, ecx, 2, lbr_branch_type , Branch type field (IA32_LBR_INFO_x[59:56]) supported + 0x1c, 0, ecx, 19:16, lbr_events_gpc_bmp , LBR PMU-events logging support; bitmap for first 4 GP (general-purpose) Counters + +# Leaf 1DH +# Intel AMX (Advanced Matrix Extensions) tile information + + 0x1d, 0, eax, 31:0, amx_max_palette , Highest palette ID / subleaf ID + 0x1d, 1, eax, 15:0, amx_palette_size , AMX palette total tiles size, in bytes + 0x1d, 1, eax, 31:16, amx_tile_size , AMX single tile's size, in bytes + 0x1d, 1, ebx, 15:0, amx_tile_row_size , AMX tile single row's size, in bytes + 0x1d, 1, ebx, 31:16, amx_palette_nr_tiles , AMX palette number of tiles + 0x1d, 1, ecx, 15:0, amx_tile_nr_rows , AMX tile max number of rows + +# Leaf 1EH +# Intel AMX, TMUL (Tile-matrix MULtiply) accelerator unit enumeration + + 0x1e, 0, ebx, 7:0, tmul_maxk , TMUL unit maximum height, K (rows or columns) + 0x1e, 0, ebx, 23:8, tmul_maxn , TMUL unit maxiumum SIMD dimension, N (column bytes) # Leaf 1FH -# V2 Extended Topology - A preferred superset to leaf 0BH +# Intel extended topology enumeration v2 + 0x1f, 5:0, eax, 4:0, x2apic_id_shift , Bit width of this level (previous levels inclusive) + 0x1f, 5:0, ebx, 15:0, domain_lcpus_count , Logical CPUs count across all instances of this domain + 0x1f, 5:0, ecx, 7:0, domain_level , This domain level (subleaf ID) + 0x1f, 5:0, ecx, 15:8, domain_type , This domain type + 0x1f, 5:0, edx, 31:0, x2apic_id , x2APIC ID of current logical CPU -# According to SDM -# 40000000H - 4FFFFFFFH is invalid range +# Leaf 20H +# Intel HRESET (History Reset) enumeration + + 0x20, 0, eax, 31:0, hreset_nr_subleaves , CPUID 0x20 max subleaf + 1 + 0x20, 0, ebx, 0, hreset_thread_director , HRESET of Intel thread director is supported + +# Leaf 21H +# Intel TD (Trust Domain) guest execution environment enumeration + + 0x21, 0, ebx, 31:0, tdx_vendorid_0 , TDX vendor ID string bytes 0 - 3 + 0x21, 0, ecx, 31:0, tdx_vendorid_2 , CPU vendor ID string bytes 8 - 11 + 0x21, 0, edx, 31:0, tdx_vendorid_1 , CPU vendor ID string bytes 4 - 7 + +# Leaf 23H +# Intel Architectural Performance Monitoring Extended (ArchPerfmonExt) + + 0x23, 0, eax, 1, subleaf_1_counters , Subleaf 1, PMU counters bitmaps, is valid + 0x23, 0, eax, 3, subleaf_3_events , Subleaf 3, PMU events bitmaps, is valid + 0x23, 0, ebx, 0, unitmask2 , IA32_PERFEVTSELx MSRs UnitMask2 is supported + 0x23, 0, ebx, 1, zbit , IA32_PERFEVTSELx MSRs Z-bit is supported + 0x23, 1, eax, 31:0, pmu_gp_counters_bitmap , General-purpose PMU counters bitmap + 0x23, 1, ebx, 31:0, pmu_f_counters_bitmap , Fixed PMU counters bitmap + 0x23, 3, eax, 0, core_cycles_evt , Core cycles event supported + 0x23, 3, eax, 1, insn_retired_evt , Instructions retired event supported + 0x23, 3, eax, 2, ref_cycles_evt , Reference cycles event supported + 0x23, 3, eax, 3, llc_refs_evt , Last-level cache references event supported + 0x23, 3, eax, 4, llc_misses_evt , Last-level cache misses event supported + 0x23, 3, eax, 5, br_insn_ret_evt , Branch instruction retired event supported + 0x23, 3, eax, 6, br_mispr_evt , Branch mispredict retired event supported + 0x23, 3, eax, 7, td_slots_evt , Topdown slots event supported + 0x23, 3, eax, 8, td_backend_bound_evt , Topdown backend bound event supported + 0x23, 3, eax, 9, td_bad_spec_evt , Topdown bad speculation event supported + 0x23, 3, eax, 10, td_frontend_bound_evt , Topdown frontend bound event supported + 0x23, 3, eax, 11, td_retiring_evt , Topdown retiring event support + +# Leaf 40000000H +# Maximum hypervisor standard leaf + hypervisor vendor string + +0x40000000, 0, eax, 31:0, max_hyp_leaf , Maximum hypervisor standard leaf number +0x40000000, 0, ebx, 31:0, hypervisor_id_0 , Hypervisor ID string bytes 0 - 3 +0x40000000, 0, ecx, 31:0, hypervisor_id_1 , Hypervisor ID string bytes 4 - 7 +0x40000000, 0, edx, 31:0, hypervisor_id_2 , Hypervisor ID string bytes 8 - 11 + +# Leaf 80000000H +# Maximum extended leaf number + CPU vendor string (AMD) + +0x80000000, 0, eax, 31:0, max_ext_leaf , Maximum extended cpuid leaf supported +0x80000000, 0, ebx, 31:0, cpu_vendorid_0 , Vendor ID string bytes 0 - 3 +0x80000000, 0, ecx, 31:0, cpu_vendorid_2 , Vendor ID string bytes 8 - 11 +0x80000000, 0, edx, 31:0, cpu_vendorid_1 , Vendor ID string bytes 4 - 7 # Leaf 80000001H -# Extended Processor Signature and Feature Bits +# Extended CPU feature identifiers -0x80000001, 0, EAX, 27:20, extfamily, Extended family -0x80000001, 0, EAX, 19:16, extmodel, Extended model -0x80000001, 0, EAX, 11:8, basefamily, Description of Family -0x80000001, 0, EAX, 11:8, basemodel, Model numbers vary with product -0x80000001, 0, EAX, 3:0, stepping, Processor stepping (revision) for a specific model +0x80000001, 0, eax, 3:0, e_stepping_id , Stepping ID +0x80000001, 0, eax, 7:4, e_base_model , Base processor model +0x80000001, 0, eax, 11:8, e_base_family , Base processor family +0x80000001, 0, eax, 19:16, e_ext_model , Extended processor model +0x80000001, 0, eax, 27:20, e_ext_family , Extended processor family +0x80000001, 0, ebx, 15:0, brand_id , Brand ID +0x80000001, 0, ebx, 31:28, pkg_type , Package type +0x80000001, 0, ecx, 0, lahf_lm , LAHF and SAHF in 64-bit mode +0x80000001, 0, ecx, 1, cmp_legacy , Multi-processing legacy mode (No HT) +0x80000001, 0, ecx, 2, svm , Secure Virtual Machine +0x80000001, 0, ecx, 3, extapic , Extended APIC space +0x80000001, 0, ecx, 4, cr8_legacy , LOCK MOV CR0 means MOV CR8 +0x80000001, 0, ecx, 5, abm , LZCNT advanced bit manipulation +0x80000001, 0, ecx, 6, sse4a , SSE4A support +0x80000001, 0, ecx, 7, misalignsse , Misaligned SSE mode +0x80000001, 0, ecx, 8, 3dnowprefetch , 3DNow PREFETCH/PREFETCHW support +0x80000001, 0, ecx, 9, osvw , OS visible workaround +0x80000001, 0, ecx, 10, ibs , Instruction based sampling +0x80000001, 0, ecx, 11, xop , XOP: extended operation (AVX instructions) +0x80000001, 0, ecx, 12, skinit , SKINIT/STGI support +0x80000001, 0, ecx, 13, wdt , Watchdog timer support +0x80000001, 0, ecx, 15, lwp , Lightweight profiling +0x80000001, 0, ecx, 16, fma4 , 4-operand FMA instruction +0x80000001, 0, ecx, 17, tce , Translation cache extension +0x80000001, 0, ecx, 19, nodeid_msr , NodeId MSR (0xc001100c) +0x80000001, 0, ecx, 21, tbm , Trailing bit manipulations +0x80000001, 0, ecx, 22, topoext , Topology Extensions (cpuid leaf 0x8000001d) +0x80000001, 0, ecx, 23, perfctr_core , Core performance counter extensions +0x80000001, 0, ecx, 24, perfctr_nb , NB/DF performance counter extensions +0x80000001, 0, ecx, 26, bpext , Data access breakpoint extension +0x80000001, 0, ecx, 27, ptsc , Performance time-stamp counter +0x80000001, 0, ecx, 28, perfctr_llc , LLC (L3) performance counter extensions +0x80000001, 0, ecx, 29, mwaitx , MWAITX/MONITORX support +0x80000001, 0, ecx, 30, addr_mask_ext , Breakpoint address mask extension (to bit 31) +0x80000001, 0, edx, 0, e_fpu , Floating-Point Unit on-chip (x87) +0x80000001, 0, edx, 1, e_vme , Virtual-8086 Mode Extensions +0x80000001, 0, edx, 2, e_de , Debugging Extensions +0x80000001, 0, edx, 3, e_pse , Page Size Extension +0x80000001, 0, edx, 4, e_tsc , Time Stamp Counter +0x80000001, 0, edx, 5, e_msr , Model-Specific Registers (RDMSR and WRMSR support) +0x80000001, 0, edx, 6, pae , Physical Address Extensions +0x80000001, 0, edx, 7, mce , Machine Check Exception +0x80000001, 0, edx, 8, cx8 , CMPXCHG8B instruction +0x80000001, 0, edx, 9, apic , APIC on-chip +0x80000001, 0, edx, 11, syscall , SYSCALL and SYSRET instructions +0x80000001, 0, edx, 12, mtrr , Memory Type Range Registers +0x80000001, 0, edx, 13, pge , Page Global Extensions +0x80000001, 0, edx, 14, mca , Machine Check Architecture +0x80000001, 0, edx, 15, cmov , Conditional Move Instruction +0x80000001, 0, edx, 16, pat , Page Attribute Table +0x80000001, 0, edx, 17, pse36 , Page Size Extension (36-bit) +0x80000001, 0, edx, 19, mp , Out-of-spec AMD Multiprocessing bit +0x80000001, 0, edx, 20, nx , No-execute page protection +0x80000001, 0, edx, 22, mmxext , AMD MMX extensions +0x80000001, 0, edx, 24, e_fxsr , FXSAVE and FXRSTOR instructions +0x80000001, 0, edx, 25, fxsr_opt , FXSAVE and FXRSTOR optimizations +0x80000001, 0, edx, 26, pdpe1gb , 1-GB large page support +0x80000001, 0, edx, 27, rdtscp , RDTSCP instruction +0x80000001, 0, edx, 29, lm , Long mode (x86-64, 64-bit support) +0x80000001, 0, edx, 30, 3dnowext , AMD 3DNow extensions +0x80000001, 0, edx, 31, 3dnow , 3DNow instructions -0x80000001, 0, EBX, 31:28, pkgtype, Specifies the package type +# Leaf 80000002H +# CPU brand ID string, bytes 0 - 15 -0x80000001, 0, ECX, 0, lahf_lm, LAHF/SAHF available in 64-bit mode -0x80000001, 0, ECX, 1, cmplegacy, Core multi-processing legacy mode -0x80000001, 0, ECX, 2, svm, Indicates support for: VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and INVLPGA -0x80000001, 0, ECX, 3, extapicspace, Extended APIC register space -0x80000001, 0, ECX, 4, altmovecr8, Indicates support for LOCK MOV CR0 means MOV CR8 -0x80000001, 0, ECX, 5, lzcnt, LZCNT -0x80000001, 0, ECX, 6, sse4a, EXTRQ, INSERTQ, MOVNTSS, and MOVNTSD instruction support -0x80000001, 0, ECX, 7, misalignsse, Misaligned SSE Mode -0x80000001, 0, ECX, 8, prefetchw, PREFETCHW -0x80000001, 0, ECX, 9, osvw, OS Visible Work-around support -0x80000001, 0, ECX, 10, ibs, Instruction Based Sampling -0x80000001, 0, ECX, 11, xop, Extended operation support -0x80000001, 0, ECX, 12, skinit, SKINIT and STGI support -0x80000001, 0, ECX, 13, wdt, Watchdog timer support -0x80000001, 0, ECX, 15, lwp, Lightweight profiling support -0x80000001, 0, ECX, 16, fma4, Four-operand FMA instruction support -0x80000001, 0, ECX, 17, tce, Translation cache extension -0x80000001, 0, ECX, 22, TopologyExtensions, Indicates support for Core::X86::Cpuid::CachePropEax0 and Core::X86::Cpuid::ExtApicId -0x80000001, 0, ECX, 23, perfctrextcore, Indicates support for Core::X86::Msr::PERF_CTL0 - 5 and Core::X86::Msr::PERF_CTR -0x80000001, 0, ECX, 24, perfctrextdf, Indicates support for Core::X86::Msr::DF_PERF_CTL and Core::X86::Msr::DF_PERF_CTR -0x80000001, 0, ECX, 26, databreakpointextension, Indicates data breakpoint support for Core::X86::Msr::DR0_ADDR_MASK, Core::X86::Msr::DR1_ADDR_MASK, Core::X86::Msr::DR2_ADDR_MASK and Core::X86::Msr::DR3_ADDR_MASK -0x80000001, 0, ECX, 27, perftsc, Performance time-stamp counter supported -0x80000001, 0, ECX, 28, perfctrextllc, Indicates support for L3 performance counter extensions -0x80000001, 0, ECX, 29, mwaitextended, MWAITX and MONITORX capability is supported -0x80000001, 0, ECX, 30, admskextn, Indicates support for address mask extension (to 32 bits and to all 4 DRs) for instruction breakpoints +0x80000002, 0, eax, 31:0, cpu_brandid_0 , CPU brand ID string, bytes 0 - 3 +0x80000002, 0, ebx, 31:0, cpu_brandid_1 , CPU brand ID string, bytes 4 - 7 +0x80000002, 0, ecx, 31:0, cpu_brandid_2 , CPU brand ID string, bytes 8 - 11 +0x80000002, 0, edx, 31:0, cpu_brandid_3 , CPU brand ID string, bytes 12 - 15 -0x80000001, 0, EDX, 0, fpu, x87 floating point unit on-chip -0x80000001, 0, EDX, 1, vme, Virtual-mode enhancements -0x80000001, 0, EDX, 2, de, Debugging extensions, IO breakpoints, CR4.DE -0x80000001, 0, EDX, 3, pse, Page-size extensions (4 MB pages) -0x80000001, 0, EDX, 4, tsc, Time stamp counter, RDTSC/RDTSCP instructions, CR4.TSD -0x80000001, 0, EDX, 5, msr, Model-specific registers (MSRs), with RDMSR and WRMSR instructions -0x80000001, 0, EDX, 6, pae, Physical-address extensions (PAE) -0x80000001, 0, EDX, 7, mce, Machine Check Exception, CR4.MCE -0x80000001, 0, EDX, 8, cmpxchg8b, CMPXCHG8B instruction -0x80000001, 0, EDX, 9, apic, advanced programmable interrupt controller (APIC) exists and is enabled -0x80000001, 0, EDX, 11, sysret, SYSCALL/SYSRET supported -0x80000001, 0, EDX, 12, mtrr, Memory-type range registers -0x80000001, 0, EDX, 13, pge, Page global extension, CR4.PGE -0x80000001, 0, EDX, 14, mca, Machine check architecture, MCG_CAP -0x80000001, 0, EDX, 15, cmov, Conditional move instructions, CMOV, FCOMI, FCMOV -0x80000001, 0, EDX, 16, pat, Page attribute table -0x80000001, 0, EDX, 17, pse36, Page-size extensions -0x80000001, 0, EDX, 20, exec_dis, Execute Disable Bit available -0x80000001, 0, EDX, 22, mmxext, AMD extensions to MMX instructions -0x80000001, 0, EDX, 23, mmx, MMX instructions -0x80000001, 0, EDX, 24, fxsr, FXSAVE and FXRSTOR instructions -0x80000001, 0, EDX, 25, ffxsr, FXSAVE and FXRSTOR instruction optimizations -0x80000001, 0, EDX, 26, 1gb_page, 1GB page supported -0x80000001, 0, EDX, 27, rdtscp, RDTSCP and IA32_TSC_AUX are available -0x80000001, 0, EDX, 29, lm, 64b Architecture supported -0x80000001, 0, EDX, 30, threednowext, AMD extensions to 3DNow! instructions -0x80000001, 0, EDX, 31, threednow, 3DNow! instructions +# Leaf 80000003H +# CPU brand ID string, bytes 16 - 31 -# Leaf 80000002H/80000003H/80000004H -# Processor Brand String +0x80000003, 0, eax, 31:0, cpu_brandid_4 , CPU brand ID string bytes, 16 - 19 +0x80000003, 0, ebx, 31:0, cpu_brandid_5 , CPU brand ID string bytes, 20 - 23 +0x80000003, 0, ecx, 31:0, cpu_brandid_6 , CPU brand ID string bytes, 24 - 27 +0x80000003, 0, edx, 31:0, cpu_brandid_7 , CPU brand ID string bytes, 28 - 31 + +# Leaf 80000004H +# CPU brand ID string, bytes 32 - 47 + +0x80000004, 0, eax, 31:0, cpu_brandid_8 , CPU brand ID string, bytes 32 - 35 +0x80000004, 0, ebx, 31:0, cpu_brandid_9 , CPU brand ID string, bytes 36 - 39 +0x80000004, 0, ecx, 31:0, cpu_brandid_10 , CPU brand ID string, bytes 40 - 43 +0x80000004, 0, edx, 31:0, cpu_brandid_11 , CPU brand ID string, bytes 44 - 47 # Leaf 80000005H -# Reserved +# AMD L1 cache and L1 TLB enumeration + +0x80000005, 0, eax, 7:0, l1_itlb_2m_4m_nentries , L1 ITLB #entires, 2M and 4M pages +0x80000005, 0, eax, 15:8, l1_itlb_2m_4m_assoc , L1 ITLB associativity, 2M and 4M pages +0x80000005, 0, eax, 23:16, l1_dtlb_2m_4m_nentries , L1 DTLB #entires, 2M and 4M pages +0x80000005, 0, eax, 31:24, l1_dtlb_2m_4m_assoc , L1 DTLB associativity, 2M and 4M pages +0x80000005, 0, ebx, 7:0, l1_itlb_4k_nentries , L1 ITLB #entries, 4K pages +0x80000005, 0, ebx, 15:8, l1_itlb_4k_assoc , L1 ITLB associativity, 4K pages +0x80000005, 0, ebx, 23:16, l1_dtlb_4k_nentries , L1 DTLB #entries, 4K pages +0x80000005, 0, ebx, 31:24, l1_dtlb_4k_assoc , L1 DTLB associativity, 4K pages +0x80000005, 0, ecx, 7:0, l1_dcache_line_size , L1 dcache line size, in bytes +0x80000005, 0, ecx, 15:8, l1_dcache_nlines , L1 dcache lines per tag +0x80000005, 0, ecx, 23:16, l1_dcache_assoc , L1 dcache associativity +0x80000005, 0, ecx, 31:24, l1_dcache_size_kb , L1 dcache size, in KB +0x80000005, 0, edx, 7:0, l1_icache_line_size , L1 icache line size, in bytes +0x80000005, 0, edx, 15:8, l1_icache_nlines , L1 icache lines per tag +0x80000005, 0, edx, 23:16, l1_icache_assoc , L1 icache associativity +0x80000005, 0, edx, 31:24, l1_icache_size_kb , L1 icache size, in KB # Leaf 80000006H -# Extended L2 Cache Features - -0x80000006, 0, ECX, 7:0, clsize, Cache Line size in bytes -0x80000006, 0, ECX, 15:12, l2c_assoc, L2 Associativity -0x80000006, 0, ECX, 31:16, csize, Cache size in 1K units +# (Mostly AMD) L2 TLB, L2 cache, and L3 cache enumeration +0x80000006, 0, eax, 11:0, l2_itlb_2m_4m_nentries , L2 iTLB #entries, 2M and 4M pages +0x80000006, 0, eax, 15:12, l2_itlb_2m_4m_assoc , L2 iTLB associativity, 2M and 4M pages +0x80000006, 0, eax, 27:16, l2_dtlb_2m_4m_nentries , L2 dTLB #entries, 2M and 4M pages +0x80000006, 0, eax, 31:28, l2_dtlb_2m_4m_assoc , L2 dTLB associativity, 2M and 4M pages +0x80000006, 0, ebx, 11:0, l2_itlb_4k_nentries , L2 iTLB #entries, 4K pages +0x80000006, 0, ebx, 15:12, l2_itlb_4k_assoc , L2 iTLB associativity, 4K pages +0x80000006, 0, ebx, 27:16, l2_dtlb_4k_nentries , L2 dTLB #entries, 4K pages +0x80000006, 0, ebx, 31:28, l2_dtlb_4k_assoc , L2 dTLB associativity, 4K pages +0x80000006, 0, ecx, 7:0, l2_line_size , L2 cache line size, in bytes +0x80000006, 0, ecx, 11:8, l2_nlines , L2 cache number of lines per tag +0x80000006, 0, ecx, 15:12, l2_assoc , L2 cache associativity +0x80000006, 0, ecx, 31:16, l2_size_kb , L2 cache size, in KB +0x80000006, 0, edx, 7:0, l3_line_size , L3 cache line size, in bytes +0x80000006, 0, edx, 11:8, l3_nlines , L3 cache number of lines per tag +0x80000006, 0, edx, 15:12, l3_assoc , L3 cache associativity +0x80000006, 0, edx, 31:18, l3_size_range , L3 cache size range # Leaf 80000007H +# CPU power management (mostly AMD) and AMD RAS enumeration -0x80000007, 0, EDX, 8, nonstop_tsc, Invariant TSC available - +0x80000007, 0, ebx, 0, overflow_recov , MCA overflow conditions not fatal +0x80000007, 0, ebx, 1, succor , Software containment of UnCORRectable errors +0x80000007, 0, ebx, 2, hw_assert , Hardware assert MSRs +0x80000007, 0, ebx, 3, smca , Scalable MCA (MCAX MSRs) +0x80000007, 0, ecx, 31:0, cpu_pwr_sample_ratio , CPU power sample time ratio +0x80000007, 0, edx, 0, digital_temp , Digital temprature sensor +0x80000007, 0, edx, 1, powernow_freq_id , PowerNOW! frequency scaling +0x80000007, 0, edx, 2, powernow_volt_id , PowerNOW! voltage scaling +0x80000007, 0, edx, 3, thermal_trip , THERMTRIP (Thermal Trip) +0x80000007, 0, edx, 4, hw_thermal_control , Hardware thermal control +0x80000007, 0, edx, 5, sw_thermal_control , Software thermal control +0x80000007, 0, edx, 6, 100mhz_steps , 100 MHz multiplier control +0x80000007, 0, edx, 7, hw_pstate , Hardware P-state control +0x80000007, 0, edx, 8, constant_tsc , TSC ticks at constant rate across all P and C states +0x80000007, 0, edx, 9, cpb , Core performance boost +0x80000007, 0, edx, 10, eff_freq_ro , Read-only effective frequency interface +0x80000007, 0, edx, 11, proc_feedback , Processor feedback interface (deprecated) +0x80000007, 0, edx, 12, acc_power , Processor power reporting interface +0x80000007, 0, edx, 13, connected_standby , CPU Connected Standby support +0x80000007, 0, edx, 14, rapl , Runtime Average Power Limit interface # Leaf 80000008H +# CPU capacity parameters and extended feature flags (mostly AMD) -0x80000008, 0, EAX, 7:0, phy_adr_bits, Physical Address Bits -0x80000008, 0, EAX, 15:8, lnr_adr_bits, Linear Address Bits -0x80000007, 0, EBX, 9, wbnoinvd, WBNOINVD +0x80000008, 0, eax, 7:0, phys_addr_bits , Max physical address bits +0x80000008, 0, eax, 15:8, virt_addr_bits , Max virtual address bits +0x80000008, 0, eax, 23:16, guest_phys_addr_bits , Max nested-paging guest physical address bits +0x80000008, 0, ebx, 0, clzero , CLZERO supported +0x80000008, 0, ebx, 1, irperf , Instruction retired counter MSR +0x80000008, 0, ebx, 2, xsaveerptr , XSAVE/XRSTOR always saves/restores FPU error pointers +0x80000008, 0, ebx, 3, invlpgb , INVLPGB broadcasts a TLB invalidate to all threads +0x80000008, 0, ebx, 4, rdpru , RDPRU (Read Processor Register at User level) supported +0x80000008, 0, ebx, 6, mba , Memory Bandwidth Allocation (AMD bit) +0x80000008, 0, ebx, 8, mcommit , MCOMMIT (Memory commit) supported +0x80000008, 0, ebx, 9, wbnoinvd , WBNOINVD supported +0x80000008, 0, ebx, 12, amd_ibpb , Indirect Branch Prediction Barrier +0x80000008, 0, ebx, 13, wbinvd_int , Interruptible WBINVD/WBNOINVD +0x80000008, 0, ebx, 14, amd_ibrs , Indirect Branch Restricted Speculation +0x80000008, 0, ebx, 15, amd_stibp , Single Thread Indirect Branch Prediction mode +0x80000008, 0, ebx, 16, ibrs_always_on , IBRS always-on preferred +0x80000008, 0, ebx, 17, amd_stibp_always_on , STIBP always-on preferred +0x80000008, 0, ebx, 18, ibrs_fast , IBRS is preferred over software solution +0x80000008, 0, ebx, 19, ibrs_same_mode , IBRS provides same mode protection +0x80000008, 0, ebx, 20, no_efer_lmsle , EFER[LMSLE] bit (Long-Mode Segment Limit Enable) unsupported +0x80000008, 0, ebx, 21, tlb_flush_nested , INVLPGB RAX[5] bit can be set (nested translations) +0x80000008, 0, ebx, 23, amd_ppin , Protected Processor Inventory Number +0x80000008, 0, ebx, 24, amd_ssbd , Speculative Store Bypass Disable +0x80000008, 0, ebx, 25, virt_ssbd , virtualized SSBD (Speculative Store Bypass Disable) +0x80000008, 0, ebx, 26, amd_ssb_no , SSBD not needed (fixed in HW) +0x80000008, 0, ebx, 27, cppc , Collaborative Processor Performance Control +0x80000008, 0, ebx, 28, amd_psfd , Predictive Store Forward Disable +0x80000008, 0, ebx, 29, btc_no , CPU not affected by Branch Type Confusion +0x80000008, 0, ebx, 30, ibpb_ret , IBPB clears RSB/RAS too +0x80000008, 0, ebx, 31, brs , Branch Sampling supported +0x80000008, 0, ecx, 7:0, cpu_nthreads , Number of physical threads - 1 +0x80000008, 0, ecx, 15:12, apicid_coreid_len , Number of thread core ID bits (shift) in APIC ID +0x80000008, 0, ecx, 17:16, perf_tsc_len , Performance time-stamp counter size +0x80000008, 0, edx, 15:0, invlpgb_max_pages , INVLPGB maximum page count +0x80000008, 0, edx, 31:16, rdpru_max_reg_id , RDPRU max register ID (ECX input) -# 0x8000001E -# EAX: Extended APIC ID -0x8000001E, 0, EAX, 31:0, extended_apic_id, Extended APIC ID -# EBX: Core Identifiers -0x8000001E, 0, EBX, 7:0, core_id, Identifies the logical core ID -0x8000001E, 0, EBX, 15:8, threads_per_core, The number of threads per core is threads_per_core + 1 -# ECX: Node Identifiers -0x8000001E, 0, ECX, 7:0, node_id, Node ID -0x8000001E, 0, ECX, 10:8, nodes_per_processor, Nodes per processor { 0: 1 node, else reserved } +# Leaf 8000000AH +# AMD SVM (Secure Virtual Machine) enumeration -# 8000001F: AMD Secure Encryption -0x8000001F, 0, EAX, 0, sme, Secure Memory Encryption -0x8000001F, 0, EAX, 1, sev, Secure Encrypted Virtualization -0x8000001F, 0, EAX, 2, vmpgflush, VM Page Flush MSR -0x8000001F, 0, EAX, 3, seves, SEV Encrypted State -0x8000001F, 0, EBX, 5:0, c-bit, Page table bit number used to enable memory encryption -0x8000001F, 0, EBX, 11:6, mem_encrypt_physaddr_width, Reduction of physical address space in bits with SME enabled -0x8000001F, 0, ECX, 31:0, num_encrypted_guests, Maximum ASID value that may be used for an SEV-enabled guest -0x8000001F, 0, EDX, 31:0, minimum_sev_asid, Minimum ASID value that must be used for an SEV-enabled, SEV-ES-disabled guest +0x8000000a, 0, eax, 7:0, svm_version , SVM revision number +0x8000000a, 0, ebx, 31:0, svm_nasid , Number of address space identifiers (ASID) +0x8000000a, 0, edx, 0, npt , Nested paging +0x8000000a, 0, edx, 1, lbrv , LBR virtualization +0x8000000a, 0, edx, 2, svm_lock , SVM lock +0x8000000a, 0, edx, 3, nrip_save , NRIP save support on #VMEXIT +0x8000000a, 0, edx, 4, tsc_scale , MSR based TSC rate control +0x8000000a, 0, edx, 5, vmcb_clean , VMCB clean bits support +0x8000000a, 0, edx, 6, flushbyasid , Flush by ASID + Extended VMCB TLB_Control +0x8000000a, 0, edx, 7, decodeassists , Decode Assists support +0x8000000a, 0, edx, 10, pausefilter , Pause intercept filter +0x8000000a, 0, edx, 12, pfthreshold , Pause filter threshold +0x8000000a, 0, edx, 13, avic , Advanced virtual interrupt controller +0x8000000a, 0, edx, 15, v_vmsave_vmload , Virtual VMSAVE/VMLOAD (nested virt) +0x8000000a, 0, edx, 16, vgif , Virtualize the Global Interrupt Flag +0x8000000a, 0, edx, 17, gmet , Guest mode execution trap +0x8000000a, 0, edx, 18, x2avic , Virtual x2APIC +0x8000000a, 0, edx, 19, sss_check , Supervisor Shadow Stack restrictions +0x8000000a, 0, edx, 20, v_spec_ctrl , Virtual SPEC_CTRL +0x8000000a, 0, edx, 21, ro_gpt , Read-Only guest page table support +0x8000000a, 0, edx, 23, h_mce_override , Host MCE override +0x8000000a, 0, edx, 24, tlbsync_int , TLBSYNC intercept + INVLPGB/TLBSYNC in VMCB +0x8000000a, 0, edx, 25, vnmi , NMI virtualization +0x8000000a, 0, edx, 26, ibs_virt , IBS Virtualization +0x8000000a, 0, edx, 27, ext_lvt_off_chg , Extended LVT offset fault change +0x8000000a, 0, edx, 28, svme_addr_chk , Guest SVME addr check + +# Leaf 80000019H +# AMD TLB 1G-pages enumeration + +0x80000019, 0, eax, 11:0, l1_itlb_1g_nentries , L1 iTLB #entries, 1G pages +0x80000019, 0, eax, 15:12, l1_itlb_1g_assoc , L1 iTLB associativity, 1G pages +0x80000019, 0, eax, 27:16, l1_dtlb_1g_nentries , L1 dTLB #entries, 1G pages +0x80000019, 0, eax, 31:28, l1_dtlb_1g_assoc , L1 dTLB associativity, 1G pages +0x80000019, 0, ebx, 11:0, l2_itlb_1g_nentries , L2 iTLB #entries, 1G pages +0x80000019, 0, ebx, 15:12, l2_itlb_1g_assoc , L2 iTLB associativity, 1G pages +0x80000019, 0, ebx, 27:16, l2_dtlb_1g_nentries , L2 dTLB #entries, 1G pages +0x80000019, 0, ebx, 31:28, l2_dtlb_1g_assoc , L2 dTLB associativity, 1G pages + +# Leaf 8000001AH +# AMD instruction optimizations enumeration + +0x8000001a, 0, eax, 0, fp_128 , Internal FP/SIMD exec data path is 128-bits wide +0x8000001a, 0, eax, 1, movu_preferred , SSE: MOVU* better than MOVL*/MOVH* +0x8000001a, 0, eax, 2, fp_256 , internal FP/SSE exec data path is 256-bits wide + +# Leaf 8000001BH +# AMD IBS (Instruction-Based Sampling) enumeration + +0x8000001b, 0, eax, 0, ibs_flags_valid , IBS feature flags valid +0x8000001b, 0, eax, 1, ibs_fetch_sampling , IBS fetch sampling supported +0x8000001b, 0, eax, 2, ibs_op_sampling , IBS execution sampling supported +0x8000001b, 0, eax, 3, ibs_rdwr_op_counter , IBS read/write of op counter supported +0x8000001b, 0, eax, 4, ibs_op_count , IBS OP counting mode supported +0x8000001b, 0, eax, 5, ibs_branch_target , IBS branch target address reporting supported +0x8000001b, 0, eax, 6, ibs_op_counters_ext , IBS IbsOpCurCnt/IbsOpMaxCnt extend by 7 bits +0x8000001b, 0, eax, 7, ibs_rip_invalid_chk , IBS invalid RIP indication supported +0x8000001b, 0, eax, 8, ibs_op_branch_fuse , IBS fused branch micro-op indication supported +0x8000001b, 0, eax, 9, ibs_fetch_ctl_ext , IBS Fetch Control Extended MSR (0xc001103c) supported +0x8000001b, 0, eax, 10, ibs_op_data_4 , IBS op data 4 MSR supported +0x8000001b, 0, eax, 11, ibs_l3_miss_filter , IBS L3-miss filtering supported (Zen4+) + +# Leaf 8000001CH +# AMD LWP (Lightweight Profiling) + +0x8000001c, 0, eax, 0, os_lwp_avail , LWP is available to application programs (supported by OS) +0x8000001c, 0, eax, 1, os_lpwval , LWPVAL instruction (EventId=1) is supported by OS +0x8000001c, 0, eax, 2, os_lwp_ire , Instructions Retired Event (EventId=2) is supported by OS +0x8000001c, 0, eax, 3, os_lwp_bre , Branch Retired Event (EventId=3) is supported by OS +0x8000001c, 0, eax, 4, os_lwp_dme , DCache Miss Event (EventId=4) is supported by OS +0x8000001c, 0, eax, 5, os_lwp_cnh , CPU Clocks Not Halted event (EventId=5) is supported by OS +0x8000001c, 0, eax, 6, os_lwp_rnh , CPU Reference clocks Not Halted event (EventId=6) is supported by OS +0x8000001c, 0, eax, 29, os_lwp_cont , LWP sampling in continuous mode is supported by OS +0x8000001c, 0, eax, 30, os_lwp_ptsc , Performance Time Stamp Counter in event records is supported by OS +0x8000001c, 0, eax, 31, os_lwp_int , Interrupt on threshold overflow is supported by OS +0x8000001c, 0, ebx, 7:0, lwp_lwpcb_sz , LWP Control Block size, in quadwords +0x8000001c, 0, ebx, 15:8, lwp_event_sz , LWP event record size, in bytes +0x8000001c, 0, ebx, 23:16, lwp_max_events , LWP max supported EventId value (EventID 255 not included) +0x8000001c, 0, ebx, 31:24, lwp_event_offset , LWP events area offset in the LWP Control Block +0x8000001c, 0, ecx, 4:0, lwp_latency_max , Num of bits in cache latency counters (10 to 31) +0x8000001c, 0, ecx, 5, lwp_data_adddr , Cache miss events report the data address of the reference +0x8000001c, 0, ecx, 8:6, lwp_latency_rnd , Amount by which cache latency is rounded +0x8000001c, 0, ecx, 15:9, lwp_version , LWP implementation version +0x8000001c, 0, ecx, 23:16, lwp_buf_min_sz , LWP event ring buffer min size, in units of 32 event records +0x8000001c, 0, ecx, 28, lwp_branch_predict , Branches Retired events can be filtered +0x8000001c, 0, ecx, 29, lwp_ip_filtering , IP filtering (IPI, IPF, BaseIP, and LimitIP @ LWPCP) supported +0x8000001c, 0, ecx, 30, lwp_cache_levels , Cache-related events can be filtered by cache level +0x8000001c, 0, ecx, 31, lwp_cache_latency , Cache-related events can be filtered by latency +0x8000001c, 0, edx, 0, hw_lwp_avail , LWP is available in Hardware +0x8000001c, 0, edx, 1, hw_lpwval , LWPVAL instruction (EventId=1) is available in HW +0x8000001c, 0, edx, 2, hw_lwp_ire , Instructions Retired Event (EventId=2) is available in HW +0x8000001c, 0, edx, 3, hw_lwp_bre , Branch Retired Event (EventId=3) is available in HW +0x8000001c, 0, edx, 4, hw_lwp_dme , DCache Miss Event (EventId=4) is available in HW +0x8000001c, 0, edx, 5, hw_lwp_cnh , CPU Clocks Not Halted event (EventId=5) is available in HW +0x8000001c, 0, edx, 6, hw_lwp_rnh , CPU Reference clocks Not Halted event (EventId=6) is available in HW +0x8000001c, 0, edx, 29, hw_lwp_cont , LWP sampling in continuous mode is available in HW +0x8000001c, 0, edx, 30, hw_lwp_ptsc , Performance Time Stamp Counter in event records is available in HW +0x8000001c, 0, edx, 31, hw_lwp_int , Interrupt on threshold overflow is available in HW + +# Leaf 8000001DH +# AMD deterministic cache parameters + +0x8000001d, 31:0, eax, 4:0, cache_type , Cache type field +0x8000001d, 31:0, eax, 7:5, cache_level , Cache level (1-based) +0x8000001d, 31:0, eax, 8, cache_self_init , Self-initializing cache level +0x8000001d, 31:0, eax, 9, fully_associative , Fully-associative cache +0x8000001d, 31:0, eax, 25:14, num_threads_sharing , Number of logical CPUs sharing cache +0x8000001d, 31:0, ebx, 11:0, cache_linesize , System coherency line size (0-based) +0x8000001d, 31:0, ebx, 21:12, cache_npartitions , Physical line partitions (0-based) +0x8000001d, 31:0, ebx, 31:22, cache_nways , Ways of associativity (0-based) +0x8000001d, 31:0, ecx, 30:0, cache_nsets , Cache number of sets (0-based) +0x8000001d, 31:0, edx, 0, wbinvd_rll_no_guarantee, WBINVD/INVD not guaranteed for Remote Lower-Level caches +0x8000001d, 31:0, edx, 1, ll_inclusive , Cache is inclusive of Lower-Level caches + +# Leaf 8000001EH +# AMD CPU topology enumeration + +0x8000001e, 0, eax, 31:0, ext_apic_id , Extended APIC ID +0x8000001e, 0, ebx, 7:0, core_id , Unique per-socket logical core unit ID +0x8000001e, 0, ebx, 15:8, core_nthreas , #Threads per core (zero-based) +0x8000001e, 0, ecx, 7:0, node_id , Node (die) ID of invoking logical CPU +0x8000001e, 0, ecx, 10:8, nnodes_per_socket , #nodes in invoking logical CPU's package/socket + +# Leaf 8000001FH +# AMD encrypted memory capabilities enumeration (SME/SEV) + +0x8000001f, 0, eax, 0, sme , Secure Memory Encryption supported +0x8000001f, 0, eax, 1, sev , Secure Encrypted Virtualization supported +0x8000001f, 0, eax, 2, vm_page_flush , VM Page Flush MSR (0xc001011e) available +0x8000001f, 0, eax, 3, sev_es , SEV Encrypted State supported +0x8000001f, 0, eax, 4, sev_nested_paging , SEV secure nested paging supported +0x8000001f, 0, eax, 5, vm_permission_levels , VMPL supported +0x8000001f, 0, eax, 6, rpmquery , RPMQUERY instruction supported +0x8000001f, 0, eax, 7, vmpl_sss , VMPL supervisor shadwo stack supported +0x8000001f, 0, eax, 8, secure_tsc , Secure TSC supported +0x8000001f, 0, eax, 9, v_tsc_aux , Hardware virtualizes TSC_AUX +0x8000001f, 0, eax, 10, sme_coherent , HW enforces cache coherency across encryption domains +0x8000001f, 0, eax, 11, req_64bit_hypervisor , SEV guest mandates 64-bit hypervisor +0x8000001f, 0, eax, 12, restricted_injection , Restricted Injection supported +0x8000001f, 0, eax, 13, alternate_injection , Alternate Injection supported +0x8000001f, 0, eax, 14, debug_swap , SEV-ES: full debug state swap is supported +0x8000001f, 0, eax, 15, disallow_host_ibs , SEV-ES: Disallowing IBS use by the host is supported +0x8000001f, 0, eax, 16, virt_transparent_enc , Virtual Transparent Encryption +0x8000001f, 0, eax, 17, vmgexit_paremeter , VmgexitParameter is supported in SEV_FEATURES +0x8000001f, 0, eax, 18, virt_tom_msr , Virtual TOM MSR is supported +0x8000001f, 0, eax, 19, virt_ibs , IBS state virtualization is supported for SEV-ES guests +0x8000001f, 0, eax, 24, vmsa_reg_protection , VMSA register protection is supported +0x8000001f, 0, eax, 25, smt_protection , SMT protection is supported +0x8000001f, 0, eax, 28, svsm_page_msr , SVSM communication page MSR (0xc001f000h) is supported +0x8000001f, 0, eax, 29, nested_virt_snp_msr , VIRT_RMPUPDATE/VIRT_PSMASH MSRs are supported +0x8000001f, 0, ebx, 5:0, pte_cbit_pos , PTE bit number used to enable memory encryption +0x8000001f, 0, ebx, 11:6, phys_addr_reduction_nbits, Reduction of phys address space when encryption is enabled, in bits +0x8000001f, 0, ebx, 15:12, vmpl_count , Number of VM permission levels (VMPL) supported +0x8000001f, 0, ecx, 31:0, enc_guests_max , Max supported number of simultaneous encrypted guests +0x8000001f, 0, edx, 31:0, min_sev_asid_no_sev_es , Mininum ASID for SEV-enabled SEV-ES-disabled guest + +# Leaf 80000020H +# AMD Platform QoS extended feature IDs + +0x80000020, 0, ebx, 1, mba , Memory Bandwidth Allocation support +0x80000020, 0, ebx, 2, smba , Slow Memory Bandwidth Allocation support +0x80000020, 0, ebx, 3, bmec , Bandwidth Monitoring Event Configuration support +0x80000020, 0, ebx, 4, l3rr , L3 Range Reservation support +0x80000020, 1, eax, 31:0, mba_limit_len , MBA enforcement limit size +0x80000020, 1, edx, 31:0, mba_cos_max , MBA max Class of Service number (zero-based) +0x80000020, 2, eax, 31:0, smba_limit_len , SMBA enforcement limit size +0x80000020, 2, edx, 31:0, smba_cos_max , SMBA max Class of Service number (zero-based) +0x80000020, 3, ebx, 7:0, bmec_num_events , BMEC number of bandwidth events available +0x80000020, 3, ecx, 0, bmec_local_reads , Local NUMA reads can be tracked +0x80000020, 3, ecx, 1, bmec_remote_reads , Remote NUMA reads can be tracked +0x80000020, 3, ecx, 2, bmec_local_nontemp_wr , Local NUMA non-temporal writes can be tracked +0x80000020, 3, ecx, 3, bmec_remote_nontemp_wr , Remote NUMA non-temporal writes can be tracked +0x80000020, 3, ecx, 4, bmec_local_slow_mem_rd , Local NUMA slow-memory reads can be tracked +0x80000020, 3, ecx, 5, bmec_remote_slow_mem_rd, Remote NUMA slow-memory reads can be tracked +0x80000020, 3, ecx, 6, bmec_all_dirty_victims , Dirty QoS victims to all types of memory can be tracked + +# Leaf 80000021H +# AMD extended features enumeration 2 + +0x80000021, 0, eax, 0, no_nested_data_bp , No nested data breakpoints +0x80000021, 0, eax, 1, fsgs_non_serializing , WRMSR to {FS,GS,KERNEL_GS}_BASE is non-serializing +0x80000021, 0, eax, 2, lfence_rdtsc , LFENCE always serializing / synchronizes RDTSC +0x80000021, 0, eax, 3, smm_page_cfg_lock , SMM paging configuration lock is supported +0x80000021, 0, eax, 6, null_sel_clr_base , Null selector clears base +0x80000021, 0, eax, 7, upper_addr_ignore , EFER MSR Upper Address Ignore Enable bit supported +0x80000021, 0, eax, 8, autoibrs , EFER MSR Automatic IBRS enable bit supported +0x80000021, 0, eax, 9, no_smm_ctl_msr , SMM_CTL MSR (0xc0010116) is not present +0x80000021, 0, eax, 10, fsrs_supported , Fast Short Rep Stosb (FSRS) is supported +0x80000021, 0, eax, 11, fsrc_supported , Fast Short Repe Cmpsb (FSRC) is supported +0x80000021, 0, eax, 13, prefetch_ctl_msr , Prefetch control MSR is supported +0x80000021, 0, eax, 17, user_cpuid_disable , #GP when executing CPUID at CPL > 0 is supported +0x80000021, 0, eax, 18, epsf_supported , Enhanced Predictive Store Forwarding (EPSF) is supported +0x80000021, 0, ebx, 11:0, microcode_patch_size , Size of microcode patch, in 16-byte units + +# Leaf 80000022H +# AMD Performance Monitoring v2 enumeration + +0x80000022, 0, eax, 0, perfmon_v2 , Performance monitoring v2 supported +0x80000022, 0, eax, 1, lbr_v2 , Last Branch Record v2 extensions (LBR Stack) +0x80000022, 0, eax, 2, lbr_pmc_freeze , Freezing core performance counters / LBR Stack supported +0x80000022, 0, ebx, 3:0, n_pmc_core , Number of core perfomance counters +0x80000022, 0, ebx, 9:4, lbr_v2_stack_size , Number of available LBR stack entries +0x80000022, 0, ebx, 15:10, n_pmc_northbridge , Number of available northbridge (data fabric) performance counters +0x80000022, 0, ebx, 21:16, n_pmc_umc , Number of available UMC performance counters +0x80000022, 0, ecx, 31:0, active_umc_bitmask , Active UMCs bitmask + +# Leaf 80000023H +# AMD Secure Multi-key Encryption enumeration + +0x80000023, 0, eax, 0, mem_hmk_mode , MEM-HMK encryption mode is supported +0x80000023, 0, ebx, 15:0, mem_hmk_avail_keys , MEM-HMK mode: total num of available encryption keys + +# Leaf 80000026H +# AMD extended topology enumeration v2 + +0x80000026, 3:0, eax, 4:0, x2apic_id_shift , Bit width of this level (previous levels inclusive) +0x80000026, 3:0, eax, 29, core_has_pwreff_ranking, This core has a power efficiency ranking +0x80000026, 3:0, eax, 30, domain_has_hybrid_cores, This domain level has hybrid (E, P) cores +0x80000026, 3:0, eax, 31, domain_core_count_asymm, The 'Core' domain has asymmetric cores count +0x80000026, 3:0, ebx, 15:0, domain_lcpus_count , Number of logical CPUs at this domain instance +0x80000026, 3:0, ebx, 23:16, core_pwreff_ranking , This core's static power efficiency ranking +0x80000026, 3:0, ebx, 27:24, core_native_model_id , This core's native model ID +0x80000026, 3:0, ebx, 31:28, core_type , This core's type +0x80000026, 3:0, ecx, 7:0, domain_level , This domain level (subleaf ID) +0x80000026, 3:0, ecx, 15:8, domain_type , This domain type +0x80000026, 3:0, edx, 31:0, x2apic_id , x2APIC ID of current logical CPU diff --git a/tools/arch/x86/kcpuid/kcpuid.c b/tools/arch/x86/kcpuid/kcpuid.c index 24b7d017ec2c..1b25c0a95d3f 100644 --- a/tools/arch/x86/kcpuid/kcpuid.c +++ b/tools/arch/x86/kcpuid/kcpuid.c @@ -7,7 +7,8 @@ #include #include -#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#define min(a, b) (((a) < (b)) ? (a) : (b)) typedef unsigned int u32; typedef unsigned long long u64; @@ -76,7 +77,6 @@ struct cpuid_range { */ struct cpuid_range *leafs_basic, *leafs_ext; -static int num_leafs; static bool is_amd; static bool show_details; static bool show_raw; @@ -98,27 +98,17 @@ static inline void cpuid(u32 *eax, u32 *ebx, u32 *ecx, u32 *edx) static inline bool has_subleafs(u32 f) { - if (f == 0x7 || f == 0xd) - return true; + u32 with_subleaves[] = { + 0x4, 0x7, 0xb, 0xd, 0xf, 0x10, 0x12, + 0x14, 0x17, 0x18, 0x1b, 0x1d, 0x1f, 0x23, + 0x8000001d, 0x80000020, 0x80000026, + }; - if (is_amd) { - if (f == 0x8000001d) + for (unsigned i = 0; i < ARRAY_SIZE(with_subleaves); i++) + if (f == with_subleaves[i]) return true; - return false; - } - switch (f) { - case 0x4: - case 0xb: - case 0xf: - case 0x10: - case 0x14: - case 0x18: - case 0x1f: - return true; - default: - return false; - } + return false; } static void leaf_print_raw(struct subleaf *leaf) @@ -204,15 +194,12 @@ static void raw_dump_range(struct cpuid_range *range) } } -#define MAX_SUBLEAF_NUM 32 +#define MAX_SUBLEAF_NUM 64 struct cpuid_range *setup_cpuid_range(u32 input_eax) { - u32 max_func, idx_func; - int subleaf; + u32 max_func, idx_func, subleaf, max_subleaf; + u32 eax, ebx, ecx, edx, f = input_eax; struct cpuid_range *range; - u32 eax, ebx, ecx, edx; - u32 f = input_eax; - int max_subleaf; bool allzero; eax = input_eax; @@ -246,7 +233,6 @@ struct cpuid_range *setup_cpuid_range(u32 input_eax) allzero = cpuid_store(range, f, subleaf, eax, ebx, ecx, edx); if (allzero) continue; - num_leafs++; if (!has_subleafs(f)) continue; @@ -257,11 +243,18 @@ struct cpuid_range *setup_cpuid_range(u32 input_eax) * Some can provide the exact number of subleafs, * others have to be tried (0xf) */ - if (f == 0x7 || f == 0x14 || f == 0x17 || f == 0x18) - max_subleaf = (eax & 0xff) + 1; - + if (f == 0x7 || f == 0x14 || f == 0x17 || f == 0x18 || f == 0x1d) + max_subleaf = min((eax & 0xff) + 1, max_subleaf); if (f == 0xb) max_subleaf = 2; + if (f == 0x1f) + max_subleaf = 6; + if (f == 0x23) + max_subleaf = 4; + if (f == 0x80000020) + max_subleaf = 4; + if (f == 0x80000026) + max_subleaf = 5; for (subleaf = 1; subleaf < max_subleaf; subleaf++) { eax = f; @@ -272,7 +265,6 @@ struct cpuid_range *setup_cpuid_range(u32 input_eax) eax, ebx, ecx, edx); if (allzero) continue; - num_leafs++; } } @@ -313,6 +305,8 @@ static int parse_line(char *line) struct bits_desc *bdesc; int reg_index; char *start, *end; + u32 subleaf_start, subleaf_end; + unsigned bit_start, bit_end; /* Skip comments and NULL line */ if (line[0] == '#' || line[0] == '\n') @@ -351,13 +345,25 @@ static int parse_line(char *line) return 0; /* subleaf */ - sub = strtoul(tokens[1], NULL, 0); - if ((int)sub > func->nr) - return -1; + buf = tokens[1]; + end = strtok(buf, ":"); + start = strtok(NULL, ":"); + subleaf_end = strtoul(end, NULL, 0); - leaf = &func->leafs[sub]; + /* A subleaf range is given? */ + if (start) { + subleaf_start = strtoul(start, NULL, 0); + subleaf_end = min(subleaf_end, (u32)(func->nr - 1)); + if (subleaf_start > subleaf_end) + return 0; + } else { + subleaf_start = subleaf_end; + if (subleaf_start > (u32)(func->nr - 1)) + return 0; + } + + /* register */ buf = tokens[2]; - if (strcasestr(buf, "EAX")) reg_index = R_EAX; else if (strcasestr(buf, "EBX")) @@ -369,23 +375,23 @@ static int parse_line(char *line) else goto err_exit; - reg = &leaf->info[reg_index]; - bdesc = ®->descs[reg->nr++]; - /* bit flag or bits field */ buf = tokens[3]; - end = strtok(buf, ":"); - bdesc->end = strtoul(end, NULL, 0); - bdesc->start = bdesc->end; - - /* start != NULL means it is bit fields */ start = strtok(NULL, ":"); - if (start) - bdesc->start = strtoul(start, NULL, 0); + bit_end = strtoul(end, NULL, 0); + bit_start = (start) ? strtoul(start, NULL, 0) : bit_end; - strcpy(bdesc->simp, tokens[4]); - strcpy(bdesc->detail, tokens[5]); + for (sub = subleaf_start; sub <= subleaf_end; sub++) { + leaf = &func->leafs[sub]; + reg = &leaf->info[reg_index]; + bdesc = ®->descs[reg->nr++]; + + bdesc->end = bit_end; + bdesc->start = bit_start; + strcpy(bdesc->simp, strtok(tokens[4], " \t")); + strcpy(bdesc->detail, tokens[5]); + } return 0; err_exit: @@ -452,8 +458,9 @@ static void decode_bits(u32 value, struct reg_desc *rdesc, enum cpuid_reg reg) if (start == end) { /* single bit flag */ if (value & (1 << start)) - printf("\t%-20s %s%s\n", + printf("\t%-20s %s%s%s\n", bdesc->simp, + show_flags_only ? "" : "\t\t\t", show_details ? "-" : "", show_details ? bdesc->detail : "" ); diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 5f2ca591c956..e10b87376fde 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -90,6 +90,7 @@ CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_64bit_pr CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie) VMTARGETS := protection_keys +VMTARGETS += pkey_sighandler_tests BINARIES_32 := $(VMTARGETS:%=%_32) BINARIES_64 := $(VMTARGETS:%=%_64) diff --git a/tools/testing/selftests/mm/pkey-helpers.h b/tools/testing/selftests/mm/pkey-helpers.h index 15608350fc01..9ab6a3ee153b 100644 --- a/tools/testing/selftests/mm/pkey-helpers.h +++ b/tools/testing/selftests/mm/pkey-helpers.h @@ -79,7 +79,18 @@ extern void abort_hooks(void); } \ } while (0) -__attribute__((noinline)) int read_ptr(int *ptr); +#define barrier() __asm__ __volatile__("": : :"memory") +#ifndef noinline +# define noinline __attribute__((noinline)) +#endif + +noinline int read_ptr(int *ptr) +{ + /* Keep GCC from optimizing this away somehow */ + barrier(); + return *ptr; +} + void expected_pkey_fault(int pkey); int sys_pkey_alloc(unsigned long flags, unsigned long init_val); int sys_pkey_free(unsigned long pkey); diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c new file mode 100644 index 000000000000..a8088b645ad6 --- /dev/null +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c @@ -0,0 +1,481 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Tests Memory Protection Keys (see Documentation/core-api/protection-keys.rst) + * + * The testcases in this file exercise various flows related to signal handling, + * using an alternate signal stack, with the default pkey (pkey 0) disabled. + * + * Compile with: + * gcc -mxsave -o pkey_sighandler_tests -O2 -g -std=gnu99 -pthread -Wall pkey_sighandler_tests.c -I../../../../tools/include -lrt -ldl -lm + * gcc -mxsave -m32 -o pkey_sighandler_tests -O2 -g -std=gnu99 -pthread -Wall pkey_sighandler_tests.c -I../../../../tools/include -lrt -ldl -lm + */ +#define _GNU_SOURCE +#define __SANE_USERSPACE_TYPES__ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pkey-helpers.h" + +#define STACK_SIZE PTHREAD_STACK_MIN + +void expected_pkey_fault(int pkey) {} + +pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; +pthread_cond_t cond = PTHREAD_COND_INITIALIZER; +siginfo_t siginfo = {0}; + +/* + * We need to use inline assembly instead of glibc's syscall because glibc's + * syscall will attempt to access the PLT in order to call a library function + * which is protected by MPK 0 which we don't have access to. + */ +static inline __always_inline +long syscall_raw(long n, long a1, long a2, long a3, long a4, long a5, long a6) +{ + unsigned long ret; +#ifdef __x86_64__ + register long r10 asm("r10") = a4; + register long r8 asm("r8") = a5; + register long r9 asm("r9") = a6; + asm volatile ("syscall" + : "=a"(ret) + : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10), "r"(r8), "r"(r9) + : "rcx", "r11", "memory"); +#elif defined __i386__ + asm volatile ("int $0x80" + : "=a"(ret) + : "a"(n), "b"(a1), "c"(a2), "d"(a3), "S"(a4), "D"(a5) + : "memory"); +#else +# error syscall_raw() not implemented +#endif + return ret; +} + +static void sigsegv_handler(int signo, siginfo_t *info, void *ucontext) +{ + pthread_mutex_lock(&mutex); + + memcpy(&siginfo, info, sizeof(siginfo_t)); + + pthread_cond_signal(&cond); + pthread_mutex_unlock(&mutex); + + syscall_raw(SYS_exit, 0, 0, 0, 0, 0, 0); +} + +static void sigusr1_handler(int signo, siginfo_t *info, void *ucontext) +{ + pthread_mutex_lock(&mutex); + + memcpy(&siginfo, info, sizeof(siginfo_t)); + + pthread_cond_signal(&cond); + pthread_mutex_unlock(&mutex); +} + +static void sigusr2_handler(int signo, siginfo_t *info, void *ucontext) +{ + /* + * pkru should be the init_pkru value which enabled MPK 0 so + * we can use library functions. + */ + printf("%s invoked.\n", __func__); +} + +static void raise_sigusr2(void) +{ + pid_t tid = 0; + + tid = syscall_raw(SYS_gettid, 0, 0, 0, 0, 0, 0); + + syscall_raw(SYS_tkill, tid, SIGUSR2, 0, 0, 0, 0); + + /* + * We should return from the signal handler here and be able to + * return to the interrupted thread. + */ +} + +static void *thread_segv_with_pkey0_disabled(void *ptr) +{ + /* Disable MPK 0 (and all others too) */ + __write_pkey_reg(0x55555555); + + /* Segfault (with SEGV_MAPERR) */ + *(int *) (0x1) = 1; + return NULL; +} + +static void *thread_segv_pkuerr_stack(void *ptr) +{ + /* Disable MPK 0 (and all others too) */ + __write_pkey_reg(0x55555555); + + /* After we disable MPK 0, we can't access the stack to return */ + return NULL; +} + +static void *thread_segv_maperr_ptr(void *ptr) +{ + stack_t *stack = ptr; + int *bad = (int *)1; + + /* + * Setup alternate signal stack, which should be pkey_mprotect()ed by + * MPK 0. The thread's stack cannot be used for signals because it is + * not accessible by the default init_pkru value of 0x55555554. + */ + syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0); + + /* Disable MPK 0. Only MPK 1 is enabled. */ + __write_pkey_reg(0x55555551); + + /* Segfault */ + *bad = 1; + syscall_raw(SYS_exit, 0, 0, 0, 0, 0, 0); + return NULL; +} + +/* + * Verify that the sigsegv handler is invoked when pkey 0 is disabled. + * Note that the new thread stack and the alternate signal stack is + * protected by MPK 0. + */ +static void test_sigsegv_handler_with_pkey0_disabled(void) +{ + struct sigaction sa; + pthread_attr_t attr; + pthread_t thr; + + sa.sa_flags = SA_SIGINFO; + + sa.sa_sigaction = sigsegv_handler; + sigemptyset(&sa.sa_mask); + if (sigaction(SIGSEGV, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + memset(&siginfo, 0, sizeof(siginfo)); + + pthread_attr_init(&attr); + pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); + + pthread_create(&thr, &attr, thread_segv_with_pkey0_disabled, NULL); + + pthread_mutex_lock(&mutex); + while (siginfo.si_signo == 0) + pthread_cond_wait(&cond, &mutex); + pthread_mutex_unlock(&mutex); + + ksft_test_result(siginfo.si_signo == SIGSEGV && + siginfo.si_code == SEGV_MAPERR && + siginfo.si_addr == (void *)1, + "%s\n", __func__); +} + +/* + * Verify that the sigsegv handler is invoked when pkey 0 is disabled. + * Note that the new thread stack and the alternate signal stack is + * protected by MPK 0, which renders them inaccessible when MPK 0 + * is disabled. So just the return from the thread should cause a + * segfault with SEGV_PKUERR. + */ +static void test_sigsegv_handler_cannot_access_stack(void) +{ + struct sigaction sa; + pthread_attr_t attr; + pthread_t thr; + + sa.sa_flags = SA_SIGINFO; + + sa.sa_sigaction = sigsegv_handler; + sigemptyset(&sa.sa_mask); + if (sigaction(SIGSEGV, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + memset(&siginfo, 0, sizeof(siginfo)); + + pthread_attr_init(&attr); + pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); + + pthread_create(&thr, &attr, thread_segv_pkuerr_stack, NULL); + + pthread_mutex_lock(&mutex); + while (siginfo.si_signo == 0) + pthread_cond_wait(&cond, &mutex); + pthread_mutex_unlock(&mutex); + + ksft_test_result(siginfo.si_signo == SIGSEGV && + siginfo.si_code == SEGV_PKUERR, + "%s\n", __func__); +} + +/* + * Verify that the sigsegv handler that uses an alternate signal stack + * is correctly invoked for a thread which uses a non-zero MPK to protect + * its own stack, and disables all other MPKs (including 0). + */ +static void test_sigsegv_handler_with_different_pkey_for_stack(void) +{ + struct sigaction sa; + static stack_t sigstack; + void *stack; + int pkey; + int parent_pid = 0; + int child_pid = 0; + + sa.sa_flags = SA_SIGINFO | SA_ONSTACK; + + sa.sa_sigaction = sigsegv_handler; + + sigemptyset(&sa.sa_mask); + if (sigaction(SIGSEGV, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + stack = mmap(0, STACK_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + assert(stack != MAP_FAILED); + + /* Allow access to MPK 0 and MPK 1 */ + __write_pkey_reg(0x55555550); + + /* Protect the new stack with MPK 1 */ + pkey = pkey_alloc(0, 0); + pkey_mprotect(stack, STACK_SIZE, PROT_READ | PROT_WRITE, pkey); + + /* Set up alternate signal stack that will use the default MPK */ + sigstack.ss_sp = mmap(0, STACK_SIZE, PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + sigstack.ss_flags = 0; + sigstack.ss_size = STACK_SIZE; + + memset(&siginfo, 0, sizeof(siginfo)); + + /* Use clone to avoid newer glibcs using rseq on new threads */ + long ret = syscall_raw(SYS_clone, + CLONE_VM | CLONE_FS | CLONE_FILES | + CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | + CLONE_DETACHED, + (long) ((char *)(stack) + STACK_SIZE), + (long) &parent_pid, + (long) &child_pid, 0, 0); + + if (ret < 0) { + errno = -ret; + perror("clone"); + } else if (ret == 0) { + thread_segv_maperr_ptr(&sigstack); + syscall_raw(SYS_exit, 0, 0, 0, 0, 0, 0); + } + + pthread_mutex_lock(&mutex); + while (siginfo.si_signo == 0) + pthread_cond_wait(&cond, &mutex); + pthread_mutex_unlock(&mutex); + + ksft_test_result(siginfo.si_signo == SIGSEGV && + siginfo.si_code == SEGV_MAPERR && + siginfo.si_addr == (void *)1, + "%s\n", __func__); +} + +/* + * Verify that the PKRU value set by the application is correctly + * restored upon return from signal handling. + */ +static void test_pkru_preserved_after_sigusr1(void) +{ + struct sigaction sa; + unsigned long pkru = 0x45454544; + + sa.sa_flags = SA_SIGINFO; + + sa.sa_sigaction = sigusr1_handler; + sigemptyset(&sa.sa_mask); + if (sigaction(SIGUSR1, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + memset(&siginfo, 0, sizeof(siginfo)); + + __write_pkey_reg(pkru); + + raise(SIGUSR1); + + pthread_mutex_lock(&mutex); + while (siginfo.si_signo == 0) + pthread_cond_wait(&cond, &mutex); + pthread_mutex_unlock(&mutex); + + /* Ensure the pkru value is the same after returning from signal. */ + ksft_test_result(pkru == __read_pkey_reg() && + siginfo.si_signo == SIGUSR1, + "%s\n", __func__); +} + +static noinline void *thread_sigusr2_self(void *ptr) +{ + /* + * A const char array like "Resuming after SIGUSR2" won't be stored on + * the stack and the code could access it via an offset from the program + * counter. This makes sure it's on the function's stack frame. + */ + char str[] = {'R', 'e', 's', 'u', 'm', 'i', 'n', 'g', ' ', + 'a', 'f', 't', 'e', 'r', ' ', + 'S', 'I', 'G', 'U', 'S', 'R', '2', + '.', '.', '.', '\n', '\0'}; + stack_t *stack = ptr; + + /* + * Setup alternate signal stack, which should be pkey_mprotect()ed by + * MPK 0. The thread's stack cannot be used for signals because it is + * not accessible by the default init_pkru value of 0x55555554. + */ + syscall(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0); + + /* Disable MPK 0. Only MPK 2 is enabled. */ + __write_pkey_reg(0x55555545); + + raise_sigusr2(); + + /* Do something, to show the thread resumed execution after the signal */ + syscall_raw(SYS_write, 1, (long) str, sizeof(str) - 1, 0, 0, 0); + + /* + * We can't return to test_pkru_sigreturn because it + * will attempt to use a %rbp value which is on the stack + * of the main thread. + */ + syscall_raw(SYS_exit, 0, 0, 0, 0, 0, 0); + return NULL; +} + +/* + * Verify that sigreturn is able to restore altstack even if the thread had + * disabled pkey 0. + */ +static void test_pkru_sigreturn(void) +{ + struct sigaction sa = {0}; + static stack_t sigstack; + void *stack; + int pkey; + int parent_pid = 0; + int child_pid = 0; + + sa.sa_handler = SIG_DFL; + sa.sa_flags = 0; + sigemptyset(&sa.sa_mask); + + /* + * For this testcase, we do not want to handle SIGSEGV. Reset handler + * to default so that the application can crash if it receives SIGSEGV. + */ + if (sigaction(SIGSEGV, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + sa.sa_flags = SA_SIGINFO | SA_ONSTACK; + sa.sa_sigaction = sigusr2_handler; + sigemptyset(&sa.sa_mask); + + if (sigaction(SIGUSR2, &sa, NULL) == -1) { + perror("sigaction"); + exit(EXIT_FAILURE); + } + + stack = mmap(0, STACK_SIZE, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + + assert(stack != MAP_FAILED); + + /* + * Allow access to MPK 0 and MPK 2. The child thread (to be created + * later in this flow) will have its stack protected by MPK 2, whereas + * the current thread's stack is protected by the default MPK 0. Hence + * both need to be enabled. + */ + __write_pkey_reg(0x55555544); + + /* Protect the stack with MPK 2 */ + pkey = pkey_alloc(0, 0); + pkey_mprotect(stack, STACK_SIZE, PROT_READ | PROT_WRITE, pkey); + + /* Set up alternate signal stack that will use the default MPK */ + sigstack.ss_sp = mmap(0, STACK_SIZE, PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + sigstack.ss_flags = 0; + sigstack.ss_size = STACK_SIZE; + + /* Use clone to avoid newer glibcs using rseq on new threads */ + long ret = syscall_raw(SYS_clone, + CLONE_VM | CLONE_FS | CLONE_FILES | + CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID | + CLONE_DETACHED, + (long) ((char *)(stack) + STACK_SIZE), + (long) &parent_pid, + (long) &child_pid, 0, 0); + + if (ret < 0) { + errno = -ret; + perror("clone"); + } else if (ret == 0) { + thread_sigusr2_self(&sigstack); + syscall_raw(SYS_exit, 0, 0, 0, 0, 0, 0); + } + + child_pid = ret; + /* Check that thread exited */ + do { + sched_yield(); + ret = syscall_raw(SYS_tkill, child_pid, 0, 0, 0, 0, 0); + } while (ret != -ESRCH && ret != -EINVAL); + + ksft_test_result_pass("%s\n", __func__); +} + +static void (*pkey_tests[])(void) = { + test_sigsegv_handler_with_pkey0_disabled, + test_sigsegv_handler_cannot_access_stack, + test_sigsegv_handler_with_different_pkey_for_stack, + test_pkru_preserved_after_sigusr1, + test_pkru_sigreturn +}; + +int main(int argc, char *argv[]) +{ + int i; + + ksft_print_header(); + ksft_set_plan(ARRAY_SIZE(pkey_tests)); + + for (i = 0; i < ARRAY_SIZE(pkey_tests); i++) + (*pkey_tests[i])(); + + ksft_finished(); + return 0; +} diff --git a/tools/testing/selftests/mm/protection_keys.c b/tools/testing/selftests/mm/protection_keys.c index 0789981b72b9..4990f7ab4cb7 100644 --- a/tools/testing/selftests/mm/protection_keys.c +++ b/tools/testing/selftests/mm/protection_keys.c @@ -954,16 +954,6 @@ void close_test_fds(void) nr_test_fds = 0; } -#define barrier() __asm__ __volatile__("": : :"memory") -__attribute__((noinline)) int read_ptr(int *ptr) -{ - /* - * Keep GCC from optimizing this away somehow - */ - barrier(); - return *ptr; -} - void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey) { int i, err;