banner
Zein

Zein

x_id

An Overview of x86 Architecture and Binary Interface

amd64, intel64, x86-64, IA-32e, and x64 are all classified as 64-bit x86 architectures; IA-32 is a 32-bit x86 architecture.

64-bit OS enables x64 mode (long mode): After loading the OS, it initially enters the traditional paged address protection mode, and then PAE mode is enabled.

  1. Compatibility sub-mode: Allows running 32-bit or 16-bit applications without modifying or recompiling the applications. However, virtual 8086 mode, task switching, and stack parameter copying features are not supported in compatibility mode. Therefore, there are some limitations when running traditional applications.
  2. 64-bit sub-mode: Specifically runs 64-bit applications and supports a 64-bit linear address space. The general registers in the CPU are extended to 64 bits, and 8 additional general registers (R8~R15) are added. Additionally, there are new SIMD extension registers (XMM8~XMM15) to improve processing efficiency. The 64-bit sub-mode also supports a more efficient instruction set, extended memory addressing, and a new interrupt priority control mechanism, providing more powerful computing capabilities.

32-bit OS enables x86-32 mode:

  1. Real mode: Logical addresses directly map to physical addresses, with no paging mechanism; the 20-bit external address bus can only access 1MB of memory space. The BIOS typically initializes hardware and data structures in this mode.
  2. Protected mode: The mechanisms introduced by protected mode are the foundation for modern operating systems (such as Windows, Linux). In this mode, the CPU can achieve memory protection, virtual memory, task switching, and other functions, allowing the OS to manage hardware resources using these features. The system can access a larger physical address space and supports multitasking and memory protection.
  3. Virtual 8086 mode: Allows running 16-bit programs in protected mode, which can run in an environment similar to real mode. The OS uses virtualization technology to run each 16-bit program in an independent virtual environment, allowing it to run without exiting protected mode.
Operating ModeSub-modeEnabled by how many bits of OSDoes the application need recompilationDefault address lengthDefault operand lengthExtended registersGeneral register bit width
x86-32 modeVirtual 8086 mode16-bitNo1616Not used16
Real mode16-bit or 32-bitNo1616Not used16
Protected mode16-bit or 32-bitNo16 or 3216 or 32Not used16 or 32
x64 mode (long mode)64-bit mode64-bitYes6432Used64
Compatibility mode64-bitNo16 or 3216 or 32Not used16 or 32

The following description is based on the classic architecture of x86-32 protected mode and x64 compatibility mode, as subsequent architectural improvements are based on this; and will list special improvements of x64 64-bit mode.

CPU Architecture#

Memory Architecture#

Address Space#

Physical Address Space#

Memory and other hardware devices that the CPU can use are uniformly addressed as physical address space; the CPU accesses it using physical addresses. The MMU converts the virtual addresses passed by the CPU into physical addresses and accesses memory through the external address bus; the bit width of the external address bus determines the size of the physical address space, and the bit width of the external address bus is greater than or equal to the internal address bus bit width.

The Memory Management Unit (MMU) has a base register base: which stores the physical address mapped from the process space 0 address; the MMU does the following work:
physical address = virtual address + base; because the process space starts from zero, and the virtual address is equivalent to the offset of the process space.

Linear Address Space#

image

The size of the linear address space is determined by the bit width of the CPU's internal address bus. The internal address bus is connected to the CPU execution unit, and the bit width of the internal address bus is often consistent with the CPU bit width; the linear address space describes the virtual address space that each program thinks it exclusively owns; the linear address space will be mapped to a certain part of the physical address space or the entire physical address space. A hardware platform can have multiple linear address spaces.

When the paging mechanism is not enabled, the linear address is the same as the physical address; when the CPU uses the paging mechanism, the linear address must be converted to a physical address to access platform memory or other hardware devices.

Logical Address/Segment Address#

image

The historical burden of the early segmentation mechanism in x86; the logical address is the address directly used by the program; it consists of a 16-bit segment index + a 32-bit or 64-bit offset. In the following statement, the pointer variable p stores the offset of the logical address of variable a, and the linker and loader will allocate the segment index corresponding to that offset, and the CPU will obtain the corresponding segment base address, segment limit, permissions, etc., from the segment register or by querying the mapping table (Descriptor table). Then, the segment base address + offset generates the virtual address.

int a = 100;
int *p = &a;

Segmentation Mechanism#

Process data is logically divided into code segments, data segments, stack segments, etc.; the program uses a Selector to identify each logical segment and an offset to identify the position of data within the segment.

Each logical segment maps to a memory block described by the virtual address space segment base + segment limit; when the program accesses memory data, it must convert the variable's logical address Selector + offset into a linear address base + offset:
The segment register caches the Selector → Descriptor relationship; the Descriptor contains the segment base address and various attributes, permission flags;

  1. Perform checks on attributes, access permissions, etc.;
  2. Obtain the base address of the virtual address space mapped by the segment from the Descriptor part of the DS data segment register;
  3. Add the variable's Segment offset to the base address of that segment in the linear space to obtain the linear address of that variable.

Segment register cache miss:
During the process loading phase, when the segment mapping relationship data structure in the LDT is created, and the LDT address is registered in the GDT, the process's LDT Selector indexes the GDT to obtain the address of the table described by the LDT's Descriptor, and caches the Selector → Descriptor relationship into the LDTR register. Additionally, the process's CS, DS, and SS are loaded into the corresponding Selector, and the CPU indexes the corresponding Descriptor table based on the TI field of the Selector to obtain the corresponding Descriptor, and caches the Selector → Descriptor relationship into the segment registers corresponding to CS, DS, and SS.

image

Descriptor#

image

Descriptor describes the base address, length, and various attributes (such as read/write, access permissions, etc.) of the segment:

  1. The Base field describes the base address of the segment.
  2. The Limit field describes the length of the segment.
  3. The DPL field indicates the minimum privilege required to access this segment.

When the CPU obtains the Descriptor corresponding to the segment, it checks the access using various attribute fields in the Descriptor. Once access is confirmed to be legal, the CPU adds the base (i.e., the base address of the virtual address space mapped by the Segment) and the offset of the logical address in the program to obtain the linear address corresponding to the logical address.

Selector & GDT/LDT#

image

Selector:
The program only stores and uses the logical address offset part; the Selector is the visible part of the logical address that the program uses, and its modification and allocation are completed by the linker and loader, used to index the Descriptor table to obtain the corresponding Descriptor of the segment:

  1. Index: Index of the Descriptor table.
  2. TI: Indicates whether to index the GDT or LDT:
    a. TI=0 indicates indexing the Global Descriptor Table;
    b. TI=1 indicates indexing the Local Descriptor Table.
  3. RPL: Requested Privilege Level, the required privilege level. RPL exists in the low 2 bits of the segment selection register, adding a level of check when the program accesses the segment.

GDT/LDT:
There is at least one global Descriptor table in the system that can be accessed by all processes. There can be one or more local Descriptor tables in the system, which can be private to a process or shared by multiple processes; they are just mapping tables of Selector → Descriptor, mostly implemented as linear tables:

  1. The location of the GDT in memory is described by the base address and table limit. It stores the mapping relationships of segments that need to be shared by multiple processes, such as kernel code segment, shared library segment, kernel data segment, etc.
  2. The LDT stores the mapping relationships of private segments that do not need to be shared between processes, such as user code segments, user data segments, heap, stack, etc.; the location of the LDT in memory is described by the Descriptor in the GDT; the number of LDTs in the system corresponds to the number of Descriptors in the GDT.
  3. GDTR and LDTR registers: These two registers help the CPU quickly find the GDT/LDT mapping table:
    a. GDTR: GDT base address + table limit.
    b. LDTR: The location of the LDT in memory is described by the Descriptor in the GDT; the Descriptor will be cached into the LDTR; thus, the structure of the LDTR is the same as that of the segment register (Selector + Descriptor).

The process of indexing the GDT/LDT through the Selector is shown in the diagram below:
The TI field of the Selector indicates whether to index the GDT or LDT; by using the LGDT/SGDT instructions to read and write the GDTR to obtain the mapping table address; during process switching, the value of the LDTR will be replaced with the corresponding LDTR of the new process.

image

Segment Register#

Each time, the logical segment mapping virtual address is obtained by indexing the GDT/LDT through the Selector, this type of overhead optimization: space for time, we are already very familiar with it. x86 provides 6 16-bit Segment registers, each Segment register is a Selector followed by a Descriptor register that caches the logical segment mapping virtual address Descriptor:

  1. CS (Code-Segment): Stores the Selector → Descriptor of the code segment.
  2. DS (Data-Segment): Stores the Selector → Descriptor of the data segment.
  3. SS (Stack-Segment): Stores the Selector → Descriptor of the stack.
    4~6) ES, FS, GS: Can store the Selector → Descriptor of three additional data segments, used freely by the program.

When a Segment register loads a new Selector, the CPU automatically caches the Descriptor indexed by that Selector into an invisible Descriptor register, meaning that the CPU only indexes the Descriptor table when updating the segment register (for example, when switching bound threads, the segment register will be updated).

image

Paging Mechanism#

The virtual address space is divided into multiple pages, and the physical address space is similarly divided, reducing memory fragmentation;
The paging mechanism allows infrequently used pages to be moved to the disk's swap space (such as Linux's Swap partition, Windows's virtual memory file) when physical memory is tight, which can be considered a fundamental aspect of memory virtualization.

The virtual address space of the 32-bit internal address bus is addressable memory $2^{32}B=4\cdot2^{10}\cdot2^{10}\cdot2^{10}B=4GB$; (the 64-bit internal address bus uses 48 bits of addressable memory 256TB)
The x86 architecture allows page sizes greater than 4KB (such as 2MB, 4MB); however, the typical size of x86-32 pages is $4KB=4\cdot2^{10}B$, and 4GB of memory can be divided into 1024×1024 pages.
It can be seen that the number of page table entries is proportional to $[internal address bus (virtual address space size) \cdot number of processes]$.

CPU data access flow:
When the CPU accesses data, it must convert the process's virtual address (virtual page VFN + offset) into the actual physical address (physical page PFN + offset);

  1. TLB lookup: The CPU first looks up the physical address of the virtual address in the TLB. If the TLB hits, it directly uses this physical address for access.
  2. Page Table lookup: If TLB Miss, the CPU looks up the corresponding page table entry based on the page table index in the virtual address. The page table entry stores the physical page base address corresponding to that virtual page.
  3. If there is no valid mapping in the page table entry (such as the page not being in physical memory), a Page Fault interrupt is triggered. The OS loads the page from the disk's swap space into physical memory and updates the corresponding page table entry PTE, setting the P bit to 1 and making corresponding settings to other fields; finally, it returns from the page fault handler; the CPU re-looks up the page table and inserts the corresponding mapping into the TLB.
  4. PFN + offset, to obtain the corresponding physical address.

image

TLB#

The Translation Lookaside Buffer caches recently used page mappings (VFN→PFN). When the CPU accesses a certain linear address, if the mapping of the page it is in exists in the TLB, there is no need to check the page table, and the PFN corresponding to that linear address can be obtained. The CPU then adds the PFN to the offset of the linear address to obtain the physical address. If the TLB misses, the Page Table is queried and the TLB is updated.

Page Table#

image

The Page Table data structure stores the mapping relationship from Virtual Frame Number to Physical Frame Number; for TLB hit rate and the size of each cached page table, the Page Table is usually implemented as a multi-level page table;

In x86-32 protected mode, the Page Table is implemented as a two-level page table; thus, each time the TLB caches, it does not cache $2^{20}$ page mapping relationships in one page table; instead, it caches one from $2^{10}$ page tables, each with $2^{10}$ page mapping relationships; CR3 points to a page directory with $2^{10}$ entries (each entry 4B, page directory size 4KB); each page directory entry points to 1024 page tables of 4KB size. The 4KB page size is shown in the diagram below.

image

A 4KB page size requires only 12 bits to index the page offset; it is stipulated that the TLB caches $2^{10}$ page mapping relationships PTE at a time, which means that 10 bits can encode the page mapping relationships PTE cached at a time; the remaining 10 bits naturally encode $2^{10}$ secondary Page Tables;
It is evident that the high 20 bits of the virtual address precisely encode the index of the entire secondary page table data structure;

  1. Page Table Entry: Stores the mapping relationship from VFN to PFN. The next 10 bits encode the index of the $2^{10}$ page mapping relationships that can be cached at one time in the TLB, stored in the secondary Page Table. By obtaining the mapping relationship from VFN to PFN from the PTE, the physical address PFN + offset corresponding to the linear address VFN + offset can be determined.

  2. Page Directory Entry: The high 10 bits encode the index of all secondary Page Tables, i.e., the PDE stores the 10-bit binary encoding xxxx xxxx xx → mapping relationship of the base address of the secondary Page Table; all PDEs are stored in the primary Page Table; the size of each PDE is 4B, and the page directory contains 1024 PDEs, occupying one 4KB physical page.

Page Fault#

Both PDE and PTE contain a P (Present) field:

  1. P=1, the physical page is in physical memory, and after the CPU completes the address conversion, it can directly access that page.
  2. P=0, the physical page is not in physical memory. When the CPU accesses that page, a page fault interrupt occurs, and it jumps to the page fault handler. The OS usually loads the page stored in the swap space into physical memory, allowing access to continue. Due to the locality of reference of programs, the OS will load pages near that page into physical memory for easier access by the CPU.

Physical Address Extension (PAE)#

After enabling PAE, the size of each level of page tables remains 4KB, but the page mapping relationship data structure is implemented as a three-level page table; each level of page table entries expands from 32 bits to 64 bits to use additional address bits. Two bits are used to encode the first-level page table; 9 bits encode the second-level page table; 9 bits encode the third-level page table; thus, the second and third-level page tables only have $2^{9}=512$ mapping data, which is half of the two-level page table scheme: CR3 points to the base address of the first-level page table, which is a table containing 4 entries.

image

PDBR/CR3 Register#

The Page-Directory Base Register or CR3 Register stores the physical address of the top-level page table. A process must store its top-level page table base address into CR3 before it runs, and the top-level page table base address must be aligned to a 4KB page boundary.

64-bit Sub-mode Memory Management Improvements#

Address Space:
Uses 64-bit linear addresses, but effectively limits the usable virtual address space to 48 bits, with the high 48-63 bits being the same, allowing programs to access $2^{48}B=256TB$ of virtual address memory space; for physical address space, the effective address space is limited to 52 bits.

The segmentation mechanism is not disabled but its effect is weakened:
The CPU no longer uses segment bases for address translation; the bases of segment registers such as CS, DS, ES, and SS are set to 0. The FS and GS segment registers still store segment bases and can be used for certain specific operations, such as local data addressing and OS internal data structure management (such as thread-local storage). Additionally, segment length checks are disabled, meaning the CPU will not check segment sizes.

Paging Mechanism Optimization:
Memory page sizes can be 4KB, 2MB, or 1GB. PAE must be enabled; once enabled, the OS uses a four-level page table—adding a fourth-level page mapping table called the Page-Map Level 4 Table, abbreviated as PML4 Table. This helps convert 48-bit linear addresses to 52-bit physical addresses. The physical base address of the PML4 is stored in CR3;
PML4 entries: contain a physical base address of PML3, access permissions, and memory management information.
PML3 entries: contain a physical base address of PML2, access permissions, and memory management information.
PML2 entries: contain a physical base address of PML1, access permissions, and memory management information.
PML1 entries: contain VDN→PFN, access permissions, and memory management information.

image

Interrupt & Exception Architecture#

Sub-categoryCauseAsynchronous/SynchronousReturn BehaviorExample
InterruptSignal from I/O devicesAsynchronousAlways returns to the next instructionResponse requests from external devices, such as pressing a keyboard
ExceptionErrorPotentially recoverable errorSynchronousMay return to the current instructionNot necessarily fatal errors, such as page faults
TrapIntentional exceptionSynchronousAlways returns to the next instructionRequesting a system call, such as file reading
TerminationUnrecoverable errorSynchronousWill not returnFatal errors, such as parity errors

Interrupt Architecture#

Some exceptions and interrupts can interrupt the sequential execution of instruction streams, entering a completely different execution path. Modern computer architectures are driven by a large number of interrupt events. The interrupt mechanism allows external hardware devices to interrupt the CPU's current execution task, allowing the CPU to provide services for itself; interrupts are forwarded to the CPU from devices via the "interrupt controller" (except for MSI).

PIC (Programmable Interrupt Controller)#

The Programmable Interrupt Controller (PIC) is the earliest widely used interrupt controller, which can only work on UP (single-processor) platforms and cannot be used on MP (multi-processor) platforms. It has 8 interrupt pins (IR0~IR7) connected to external devices; when an external device needs CPU processing, it sends a signal through the corresponding interrupt line to trigger an interrupt.

Main Registers:
1) IRR (Interrupt Request Register): Records the currently requested interrupts. If an external device issues an IR interrupt request and that request is not masked, the corresponding position in the IRR is set to 1.
2) ISR (In Service Register): Records the currently processed interrupts. When the CPU responds and begins processing the interrupt, the corresponding interrupt position in the ISR is set to 1.
3) IMR (Interrupt Mask Register): Masks specific interrupt lines. If the corresponding interrupt position in the IMR is set to 1, the interrupt request from the corresponding IR pin will be masked, and the CPU will not respond to that interrupt.

Interrupt Handling Process:

  1. An external device issues an interrupt signal; if that interrupt is not masked, the corresponding bit in the IRR is set to 1.
  2. The PIC notifies the CPU that an interrupt has occurred via the INT pin.
  3. The CPU responds to the PIC via the INTA (interrupt acknowledge) pin, indicating that it has received the interrupt request.
  4. After receiving the CPU's response, the PIC clears the highest priority interrupt request in the IRR and sets the corresponding bit in the ISR to 1, indicating that the interrupt is being processed.
  5. The CPU then issues a second pulse via INTA, and the PIC provides the corresponding interrupt vector based on priority and sends it to the CPU's data bus.
  6. After the CPU finishes processing the interrupt, it informs the PIC that the interrupt handling is complete by writing to the EOI (End of Interrupt) register, and the PIC clears the corresponding bit in the ISR.

APIC (Advanced Programmable Interrupt Controller)#

image

In multi-processor (MP) systems, the APIC system allows multiple CPUs to coordinate work and share interrupt handling tasks. Each CPU's LAPIC can communicate and collaborate through the IPI mechanism. This interrupt control mechanism not only supports efficient interrupt handling but also optimizes interrupt balancing in multi-core processor systems.

Components:
1) LAPIC (Local APIC): Each CPU core is equipped with a LAPIC, responsible for handling local interrupt signals. It stores the status of current interrupt requests in the IRR (Interrupt Request Register) and the flags of processed interrupts in the ISR (Interrupt Service Register). The CPU informs the LAPIC that the interrupt has been processed by writing to the EOI (End of Interrupt) register, allowing other interrupts to be processed.
2) IOAPIC (I/O APIC): Typically located in the southbridge chip (or low-speed I/O controller), it is responsible for receiving external device interrupt requests and forwarding them to the LAPIC of a specific CPU. The IOAPIC usually has multiple interrupt input pins (typically 24), each of which can connect to external devices.

Inter-Processor Interrupt (IPI): IPI (Inter-Processor Interrupt) allows one CPU to send interrupt signals to other CPUs. This is crucial for operations such as process migration, load balancing, and TLB flushing in multi-core systems. The system can specify the target CPU and send an IPI interrupt through the LAPIC's ICR (Interrupt Command Register).

Interrupt Delivery Process:

  1. When an external device triggers an interrupt, the IOAPIC receives the interrupt request signal. The IOAPIC uses the PRT (Programmable Redirection Table) to look up routing information for the interrupt request.
  2. Based on the configuration of the PRT, the IOAPIC formats the interrupt information into RTE (Redirection Table Entry) and passes it to the system bus.
  3. The system bus sends the interrupt message to the LAPIC of the target CPU.
  4. After receiving the interrupt message, the LAPIC decides whether to process it based on the information in the IRR register. If the interrupt meets the processing conditions, the LAPIC triggers the ISR register, ultimately delivering the interrupt to the processor.

Exception Architecture#

Interrupts are generated by external devices and are unrelated to the instructions currently executed by the CPU. Exceptions are generated internally by the CPU, caused by issues with the instructions currently being executed by the CPU.

The causes of exceptions are classified by severity:

  1. Fault: Caused by some error condition, generally correctable by an error handler. When an error occurs, the processor hands over control to the corresponding handler. The previously mentioned page fault falls into this category.
  2. Trap: An exception caused by executing a special instruction. Traps are intentional exceptions, and their most important use is to provide a procedure-like interface between user programs and the kernel (i.e., system calls). The INT 80 instruction used in Linux to implement system calls belongs to this category.
  3. Abort: Refers to serious unrecoverable errors that will cause the program to terminate. Typical examples include some hardware errors.

IDT#

The CPU can look up the linear address of the entry point of the handler corresponding to the interrupt or exception vector number through the Interrupt Descriptor Table (IDT);
The IDT data structure stores the mapping relationship from vector to Descriptor; the Descriptor contains the Selector of the segment where the handling routine is located and the offset within the segment, as well as various attribute flags;
The base address of the IDT table is stored in the IDTR register:
1) Base: IDT base address.
2) Limit: Maximum byte size of the IDT table (usually aligned to 8 bytes).

Interrupt Gate & Trap Gate#

The Interrupt Gate is a Descriptor in the IDT; it describes which interrupt service routine (ISR) to jump to when a specific interrupt or exception occurs. Its format is similar to that of a segment Descriptor (but with some special attributes); to distinguish it, it is called a System Descriptor, and its S bit (Segment type) is used to differentiate between segment Descriptors and System Descriptors.

  1. Offset: Points to the address of the interrupt service routine (ISR); two offsets represent the high 16 bits and low 16 bits.
  2. Selector: Indicates the code segment where the ISR is located.
  3. Type: Interrupt Gate type, determining how the Interrupt Gate responds to interrupts.
  4. DPL (Descriptor Privilege Level): Indicates the privilege level of the Interrupt Gate; the DPL of interrupt gates and trap gates is only checked when an interrupt or exception is triggered by the INT n instruction; hardware-generated interrupts or exceptions are not checked.
  5. S bit (Segment Descriptor): Controls whether the Descriptor is a segment Descriptor or a System Descriptor.
  6. P (present): Indicates whether the interrupt gate is valid; set to 0 indicates invalid.

The Trap Gate is similar to the Interrupt Gate (the format is exactly the same), but it is used for exception handling. The only difference between the interrupt gate and the trap gate is that when the program jumps through the interrupt gate, the IF bit of the EFLAGS register is automatically cleared, disabling interrupts. The trap gate does not have this effect.

I/O Architecture#

The tasks that a computer processes can actually be divided into two types: CPU computation and I/O operations. I/O is the method by which the CPU accesses external devices. Devices typically present their functions to the CPU through registers and device RAM, and the CPU accesses the device through reading/writing device registers and RAM.

In modern computer architecture, Port I/O is gradually being phased out, especially in most x86 systems, where MMIO has become the mainstream I/O access method, with almost all external devices (such as graphics cards, network cards, storage controllers, etc.) communicating via MMIO. Although MMIO provides better performance, for some simple I/O devices (such as low-speed serial ports), Port I/O may still have certain advantages, so it is still introduced.

Port I/O#

x86 allocates 65536 8-bit I/O ports for Port I/O, with the Port I/O address space ranging from 0x0000 to 0xFFFF, totaling 64KB. The I/O port address space is independent and is not part of the linear address space or physical address space. The CPU accesses device registers through the I/O port addresses.

Access Method: The CPU executes the IN instruction to read data from the specified I/O port into a register, while the OUT instruction writes data from the register to the specified I/O port. For example, IN AX, 0x60 will read the data from the I/O port at address 0x60 into the AX register.

The CPU distinguishes between I/O operations and memory operations through a specific signal (such as the I/O signal). Two or four consecutive 8-bit I/O ports can form a 16-bit or 32-bit I/O port;
Limitations: One of the biggest drawbacks of port I/O is its relatively slow speed, as each I/O port must be accessed through a separate path, and different devices may require interrupts or polling to handle. This has led to the gradual replacement of port I/O in modern computer systems by MMIO, especially in scenarios requiring efficient data exchange.

MM I/O#

  • Memory-mapped I/O: Device registers or memory areas are mapped to physical address space. They can be accessed through memory access instructions (such as MOV) without the need for special I/O instructions.

Since modern CPUs and systems support efficient memory access, MMIO can provide faster access speeds and greater bandwidth than port I/O. Additionally, because it uses standard memory access instructions, it simplifies programming and hardware design.

Access Method: If a device's register is mapped to the physical memory address 0xA0000, then executing MOV AX, [0xA0000] will read the contents of that device register.

Since the state of I/O registers reflects the status of external devices, the MMIO address area is usually not optimized through the CPU's cache, as this may lead to cache inconsistency; especially, it cannot be cached to the TLB (Translation Lookaside Buffer). This means that MMIO operations may be affected by higher latency, especially when accessing frequently used I/O devices.

DMA#

Direct Memory Access (DMA) allows devices to bypass the CPU to directly copy or read data into memory. If data copying from devices to memory goes through the CPU, the CPU would have a large interrupt load, and during the interrupt process, the CPU cannot be used for other tasks, which is detrimental to system performance. With DMA, the CPU is only responsible for initialization, while the transfer action is performed by the DMA controller (DMAC).

DMA Transfer Process:
In implementing DMA transfer, the DMAC directly controls the bus. Before DMA transfer, the CPU must hand over bus control to the DMAC, and after the DMA transfer is completed, the DMAC immediately returns bus control to the CPU.

  1. DMA request: The CPU initializes the DMAC and issues an operation command to the I/O port, which requests DMA.
  2. DMA response: The DMAC prioritizes and masks the DMA request, then requests bus arbitration logic for bus access. The CPU releases bus control after completing the current bus cycle. At this point, the bus arbitration logic sends a bus acknowledgment, indicating that DMA has been granted, and notifies the I/O port to start DMA transfer through the DMAC.
  3. DMA transfer: Once the DMAC obtains bus control, the CPU can suspend or only perform internal operations, with the DMAC issuing read/write commands to directly control RAM and the I/O port for DMA transfer.
  4. DMA completion: After completing the specified batch data transfer, the DMAC releases bus control and sends a completion signal to the I/O port. When the I/O port receives the completion signal, it stops the I/O device's operation and requests an interrupt from the CPU, allowing the CPU to execute a piece of code to check the correctness of this DMA transfer operation and exit from the non-intervention state.

DMA does not require the CPU to directly control the transfer, nor does it have the process of saving and restoring context like interrupt handling; it opens a direct data transfer path between RAM and I/O devices through hardware (DMAC), greatly improving CPU efficiency. It is important to note that the memory accessed by DMA operations must be contiguous physical memory.

image

Clock#

Many events in the OS are driven by the clock, such as process scheduling and timers.

  1. Periodic Timer: The most common, the clock generates clock interrupts at a fixed frequency. Periodic clocks usually have a counter that either decrements to 0 to generate an interrupt, such as the PIT (Programmable Interrupt Timer); or increments steadily, generating an interrupt when a certain threshold is reached, while automatically increasing the threshold by a fixed value, allowing the counter to continue incrementing, such as the HPET (High Precision Event Timer).

  2. One-shot Timer: Most clocks can be configured in this way, such as PIT and HPET. Its operation is similar to that of a periodic clock that generates an interrupt upon reaching a threshold, except that after generating the interrupt, the threshold does not automatically increase but requires software, such as the clock interrupt handler, to increase the threshold. This provides software with the ability to dynamically adjust the timing of the next clock interrupt.

x86 provides various clocks, including PIT, RTC (Real Time Clock), TSC (Time Stamp Counter), LAPIC Timer, and HPET, etc. The OS can use one or more of these clocks as needed, but using multiple clocks simultaneously will lead to excessive clock interrupts, affecting system performance. When high-precision clocks are available, modern operating systems often disable low-precision clocks and simulate low-precision clocks using high-precision clocks as needed.

x64 Register#

General Register#

image

Compatible x86-32 General RegisterUsed to save temporary variables and pointers to the operation stack during function calls, etc.
eaxArithmetic accumulator, usually used for addition; the return value of function calls is generally placed here.
ebxData storage.
ecxUsually used as a counter, such as in for loops.
edxI/O pointer.
esiUsed to store the destination address during string operations, often used in conjunction with edi for string copy operations.
ediUsed to store the source address during string operations.
espStack Pointer Register: Points to the top of the stack.
ebpBase Pointer Register: Points to the bottom of the stack, usually used in the form of ebp + offset to locate local variables stored in the stack.
r8d~r15dThe low bits of the 8 registers expanded in x64.

In the x64 architecture, the above general registers are expanded to 64-bit versions, and their names have been upgraded. To maintain compatibility with 32-bit mode programs, the above names can still be accessed, which corresponds to accessing the low 32 bits of the 64-bit registers.

64-bit General RegisterFunction Description
raxUsually used to store the return value of function calls.
rspStack top pointer, points to the top of the stack.
rdiFirst argument.
rsiSecond argument.
rdxThird argument.
rcxFourth argument.
r8Fifth argument.
r9Sixth argument.
rbxData storage, following the Callee Save principle.
rbpData storage, following the Callee Save principle.
r12~r15Data storage, following the Callee Save principle.
r10~r11Data storage, following the Caller Save principle.

Register parameter passing is fast and can reduce the number of reads and writes to memory. When compiling to generate CPU instructions, it is determined whether to use the stack or registers for parameter passing;

In the 32-bit era, there were fewer general registers, and parameters were mostly passed through the thread's stack during function calls (there were also cases where registers were used for passing, such as the C++ this pointer using the ecx register for passing);

In the x64 era, register resources have become abundant, and the vast majority of parameter passing is done using registers.

Status Register/RFLAGS#

Translated as the flag register; it contains numerous flag bits that record a series of states during the execution of CPU instructions, most of which are automatically set and modified by the CPU; in the x64 architecture, the original eflags register has been upgraded to the 64-bit rflags, but its high 32 bits have not added any new functions and are reserved for future use.

image

Status Register BitFunction Description
CFCarry flag.
PFParity flag.
ZFZero flag.
SFSign flag.
OFTwo's complement overflow flag.
TFTrace flag.
IFInterrupt flag.

Instruction Pointer/RIP#

Pointer/Instruction register; eip stores the address of the next instruction to be executed. The CPU's job is to continuously fetch the instruction pointed to by eip and then execute that instruction, while the instruction register continues to point to the next instruction, repeating this process. In the x64 architecture, the 32-bit eip has been upgraded to the 64-bit rip register.
In vulnerability attacks, hackers go to great lengths to modify the address of the next instruction stored in the instruction register, allowing them to execute malicious code.

Isn't this the PC pointer?

Memory Management Registers#

Descriptor Table Registers#

GDTRGDT Address
LDTRLDT Address

Segment Registers#

Segment RegistersStore logical segment selector→descriptor mapping
csCode segment.
dsData segment.
ssStack segment.
esExtended segment.
fsData segment.
gsData segment.

Control Registers#

Used to manage and check the state of the CPU; these registers determine the mode and characteristics of CPU operation. The x86-64 64-bit mode introduces CR8, defined as the Task Priority Register (TPR), which allows the operating system to control whether to allow external interrupts to interrupt the processor based on the priority level of interrupts.

NamePurpose
CR0Basic CPU operation flags.
CR1Reserved.
CR2Page-fault linear address.
CR3Virtual addressing state.
CR4Protected mode operation flags.
CR5Reserved.
CR6Reserved.
CR7Reserved.
CR8Task priority register (TPR).
CR9Reserved.
CR10Reserved.
CR11Reserved.
CR12Reserved.
CR13Reserved.
CR14Reserved.
CR15Reserved.

Debug Registers/DR#

A set of registers used to support software debugging; a program can be debugged, which is crucial for being interrupted and resumed, with the interruption point being the breakpoint we set.
High-level languages that use interpreted execution (PHP, Python, JavaScript) or virtual machine execution (Java) can easily achieve this, as their execution is under the control of the interpreter/virtual machine. For low-level languages like C and C++, where code is compiled into CPU machine instructions for execution, CPU debugging support is required.

Task Register#

Model Specific Register/MSR#

Machine Check Register#

Performance Monitoring Register#

Memory Type Range Register (MTRR)#

Streaming SIMD Extensions (SSE) Register#

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.