System call optimization with the SYSENTER instruction
My previous article "How do Windows NT system calls
REALLY work?" explains how Windows NT calls system services
by using an 'int 2e' software interrupt. Newer platforms
such as Windows XP and 2003 normally use another
method to call system services. Like explained in my
previous article, the 'int 2e' instruction uses both an interrupt gate
and a code segment descriptor to find the interrupt
service routine (KiSystemService) which services the 'int 2e'
software interrupt. Since the CPU will have to
load one interrupt gate and one segment descriptor from
memory in order to know what interrupt service routine
to call, significant overhead is involved in making an
'int 2e' system call. The SYSENTER instruction
drastically reduces this overhead.
By John Gulbrandsen
John.Gulbrandsen@SummitSoftConsulting.com
Why is SYSENTER faster?
Like explained in my previous article, the interrupt gate
(entry 2e in the Interrupt Descriptor Table) identifies the entry in the Global
Descriptor Table which in turn identifies the code segment that contains the
KiSystemService function. Loading the 8 byte interrupt gate and segment
descriptors from memory is sped up by keeping these gate/descriptors cached in
the processors on-chip (level 1) or off-chip (level 2) cache. The CPU is very
likely to find these gate/descriptors cached since each and every Windows NT
system call uses the same interrupt gate and code segment descriptor when
making a system call via the 'int 2e' software interrupt. However, the CPU must
still perform memory read cycles to read from the cache, make access privilege
checks etc every time when switching the privilege level via the 'int 2e'
software interrupt. After having analyzed the whole sequence of events involved
in switching to kernel-mode it is clear that it would be much faster if the CPU
could be hard coded to always switch to the same location in a kernel-mode
segment when a system call is issued. Since the destination function is now
hard coded, no memory reads are necessary to find out where the system call
should end up. This would speed up system calls significantly. This is exactly
what is being done by the Intel SYSENTER and the AMD SYSCALL instructions which
are present in the Pentium II, AMD K7 and newer CPUs. These instructions are
collectively referred to as "Fast System Call" instructions.
SYSENTER or SYSCALL?
Why are there two different instructions to make a fast system
call? Most likely Intel and AMD simultaneously and independently developed
their versions of the Fast System Call instructions. They are both functionally
identical but they use somewhat different configuration registers in the CPU to
setup the destination segment and the offset within the destination segment
where the system call function resides. Because they are both so similar I will
below mainly describe the SYSENTER version and point out differences where they
matter.
How does a system call via the SYSENTER instruction work?
Like explained above, the SYSENTER call uses hard-coded code
segment descriptors to describe the target code segment. Instead of setting up
the CPU accordingly to a specification in memory described by a code segment
descriptor (segment base, segment size, segment privilege level etc) the CPU
always sets up the target segments base to 0, its size to 4GB and its privilege
level to 0 (kernel-mode). What is NOT hard-coded is the exact target location
within the target segment, i.e. the address of the function being called in the
kernel mode code segment. This function is called 'KiFastCallEntry' in Windows
XP and newer platforms. So if the address of the KiFastCallEntry function is
not hard-coded, how does the CPU know where to jump after switching to the
target code segment? The answer is that the CPU uses the "Model Specific
Registers" (MSR). MSRs are configuration registers that are only used by the
operating system, application programs never use them. The content of the MSRs
define how the CPU will behave. The RDMSR (Read MSR) and WRMSR (Write MSR)
instructions are used to modify the MSRs. The CPU is using an MSR called
SYSENTER_EIP_MSR in order to know where to jump when the SYSENTER instruction
is executed. In other words, the SYSENTER_EIP_MSR register contains the address
of the KiFastCallEntry function. This MSR must be set up by the operating
system very early in the boot process in order for system calls via the
SYSENTER instruction to work. Like explained in my previous article, the
operating system switches to the kernel-mode stack when an operating system
call is made. This behavior must be the same when making a SYSENTER call or
else the stability of the system will be compromised (the whole point of
switching to a kernel-mode stack is to assure that the integrity of the stack
used in kernel-mode can be trusted). So how does the CPU switch to the
kernel-mode stack? Again, it uses Model Specific Registers. Like the Code
Segment, the Stack Segment is loaded with hard-coded values when the CPU
executes a SYSENTER instruction. It is loaded with exactly the same values that
a system call via an 'int 2e' instruction would result in, i.e. a flat model
where the base is 0 and the size is 4GB. Like the EIP, the ESP is not
hard-coded. Its value is taken from the SYSENTER_ESP_MSR which is also set up
by the operating system at boot time.
The mechanics of SYSENTER
All Model Specific Registers are 64-bit registers. They are
loaded from EDX:EAX using the WRMSR instruction. The MSR index in the ECX
register tells the WRMSR instruction which MSR to load. The RDMSR register
works the same way but it stores the current value of an MSR into EDX:EAX. The
Programming manual for the CPU used specifies what index to use for any given
MSR. The table below lists the MSRs used by the SYSENTER/SYSEXIT instructions.
Model Specific Register
name
|
Index
|
Usage
|
SYSENTER_CS_MSR
|
174h
|
CS Selector of
the target segment
|
SYSENTER_ESP_MSR
|
175h
|
Target ESP
|
SYSENTER_EIP_MSR
|
176h
|
Target EIP
|
Table
1. The Model Specific Registers used by the SYSENTER instruction.
Note that SYSENTER_CS_MSR contains the Code Segment Selector
of the target code segment (the segment that contains the KiFastCallEntry
function). This value is loaded into the visible part of the CS register but it
is in fact never used by the SYSENTER or SYSEXIT instructions! Remember that
all information related to the target code segment is hard-coded by the
SYSENTER instruction and that therefore the Segment Selector loaded into CS is
not used to find the target code segment in the GDT like in the case of the
'int 2e' method of making system calls. In order to keep consistency between
the value in the CS Segment Register and the Descriptor it points to, the
operating system must however set up a real Code Segment Descriptor in GDT. In
fact, the operating system must set up four Segment Descriptors in the Global
Descriptor Table in order to keep consistency between the Segment Registers and
the content in the GDT. Intel specifies that these GDT descriptors must reside
contiguously in the GDT. Figure 1 below illustrates this.
As figure 1 shows, the operating system sets up four segment
descriptors in the GDT. The "CS Enter Descriptor" at index 1 in the GDT
describes the kernel-mode code segment that contains the KiFastCallEntry
routine. The "SS Enter Descriptor" describes the kernel-mode stack segment that
will be switched to when calling into kernel-mode via a SYSENTER instruction.
The "CS Exit Descriptor" and "SS Exit Descriptor" are used when switching back
from kernel-mode to user-mode via the SYSEXIT instruction. The details involved
in switching back into user-mode will be covered in detailed later in this
article.
To summarize, the steps taken when executing the SYSENTER
instructions are:
1)
The CPU loads the Segment Selector in the SYSENTER_CS_MSR into the visible part of the CS register.
2)
The hidden part of the CS register is loaded with hard-coded values like previously described.
3)
The SS register is loaded with a segment selector that points to the entry in the GDT after the CS
Enter Descriptor, i.e. to the SS Enter Descriptor. Since the SYSENTER_CS_MSR
(and also the CS register) contains the binary value 00001000 or hexadecimal
0x08, the SS will be loaded with a binary value of 00010000 or hexadecimal
0x10. The Intel Programmer's manual simply says that "the SS register is set to
the sum of 8 plus the value in SYSENTER_CS_MSR" which results in a segment
selector with an index one higher than the segment selector in SYSENTER_CS_MSR.
4)
The hidden part of the SS register is loaded with hard-coded values like previously described.
The
EIP register is loaded from the SYSENTER_EIP_MSR and the
CPU starts executing code in kernel-mode
(KiFastCallEntry).
The mechanics of SYSEXIT
The SYSEXIT instruction is very similarly to the SYSENTER
instruction with the main difference that the hidden part of the CS Register is
now set to a priority of 3 (user-mode) instead of 0 (kernel-mode). As shown in
figure 1 above, the GDT contains the CS Exit Descriptor and SS Exit Descriptors
at index 3 and 4. Like in the case of the SYSENTER instruction, the CS and SS
Exit Descriptors are not used at all by the SYSEXIT instruction. These
descriptors are only there to create consistency between the selectors selected
into the CS and SS registers and the corresponding CS and SS Exit Descriptors
when returning to user-mode. The selectors loaded into the CS and SS Registers
by the SYSEXIT instruction correctly points to the unused Exit CS and SS
Descriptors in the GDT. These selectors are:
Selector (binary and
hexadecimal)
|
Usage
|
00011000b = 18h
|
Points to the CS Exit Descriptor (Index 3 in GDT)
|
00100000b = 20h
|
Points to the SS Exit Descriptor (Index 4 in GDT)
|
Table 2.
The CS and SS Exit Selectors used by the SYSEXIT instruction.
Like in the case of loading the SS selector during the
SYSENTER instruction, the SYSEXIT instruction loads the CS and SS with
descriptors that have indices into the GDT 2 and 3 higher than the index in the
segment selector in the SYSENTER_CS_MSR register.
If you have paid close attention so far you might have noticed
that there is no "SYSEXIT_EIP_MSR" or "SYSEXIT_ESP_MSR" registers. So how does
the SYSEXIT instruction know where to return to in the user-mode code that
initially called SYSENTER? When you think about it, such information could not
be fixed in an MSR because each system call can potentially originate from
completely different locations in user-mode. Therefore, it is the
responsibility of the caller (the code that calls SYSENTER) to place the
address the CPU is to return to after the system call has returned in the EDX
register. The caller must also place the current stack pointer (the value of
ESP) in the ECX register. The SYSEXIT instruction will then restore the
original value in the EIP and ESP by copying the content from EDX and ECX
respectively. This will cause the execution to continue at the instruction
after the original SYSENTER instruction.
SYSENTER or 'int 2e'?
How does the operating system (XP or newer) know if it should
use the new SYSENTER instruction when calling a kernel-mode function? The
answer is that the operating system queries the CPU to find out if the SYSENTER
instruction is supported via the CPUID instruction. If the SEP (SysEnter
Present) bit is set, the operating system will use the SYSENTER instruction
instead of 'int 2e'. This information is cached by the operating system so that
once it has been determined that SYSENTER is supported it will always be used
instead of 'int 2e'. The same is true for the AMD CPUs SYSCALL instruction.
Are there different operating system binaries for SYSENTER and 'int
2e'?
Like described in my previous article, the NTDLL.dll system
call stub DLL is responsible for calling the 'int 2e' instruction whenever
calls into the kernel was made on Windows NT (Windows 2000 and older, not
including Windows 9x which has a completely different architecture). Since
Windows XP now has three different ways to call a kernel-mode function, will
the operating system have to check which method to use before each and every
system call? The answer is no. Instead it calls a special page of memory that
is mapped into all processes called the "SharedUserData" page which contains a
function called "SystemCallStub". NTDLL calls the SystemCallStub for each
system-call. Since the SystemCallStub calls a kernel-mode function differently
depending on if SYSENTER, SYSCALL or 'int 2e' is used, the operating system
binaries are identical regardless of the capabilities of the CPU.
KiFastCallEntry reuses the good old KiSystemService function
KiSystemService still does all the hard work involved in the
actual dispatching of the system call once kernel-mode has been reached.
KiFastCallEntry simply calls the implementation of KiSystemService after first
having prepared a stack image identical to one produced by an 'int 2e' style
system call (see my previous article for the details of how KiSystemService
expects the stack to be set up). The question now is; how does the
KiSystemService know if SYSEXIT, SYSRETURN or 'iretd' should be used to return
to user-mode? For this to work the end of the KiSystemService function has been
modified to handle any of the three system call types. In fact, there are three
different Exit-routines depending of what call-style was used to enter
kernel-mode:
Kernel Function Name
|
Call style
|
Exit instruction
|
KiSystemCallExit
|
'int 2e'
|
iretd
|
KiSystemCallExit2
|
SYSENTER
|
SYSEXIT
|
KiSystemCallExit3
|
SYSCALL
|
SYSRETURN
|
Table 3.
The three different ways to exit a system call.
The really interested reader can disassemble these functions
to see what is really going on but this is not done in this article. The bottom
line is that the choice of which of these three functions to use to return to
user-mode is made in the "KiSystemServiceExit" function based on the
feature-bits of the CPU (returned from the CPUID instruction).
Windows 2000 Experiment
We can confirm that the information presented in this article
is correct through a couple of debugging sessions with WinDbg on Windows 2000
and Windows XP systems. Let's first see what the content of the MSRs are on our
Windows 2000 OS running on a dual Pentium III machine:
0: kd> rdmsr 174 msr[174] = 00000000:00000000 0: kd> rdmsr 175 msr[175] = 00000000:00000000 0: kd> rdmsr 176 msr[176] = 00000000:00000000
The MSRs are all zero as expected since Windows 2000 is not
aware of the SYSENTER instruction. It therefore does not initialize the
SYSENTER_CS_MSR, SYSENTER_EIP_MSR or SYSENTER_ESP_MSR Model Specific Registers.
Let's confirm that the SEP bit is set in the result returned from the CPUID
instruction:
0: kd> !cpuinfo CP F/M/S Manufacturer MHz Update
Signature Features 0 6,8,3
GenuineIntel 797>0000001300000000<00002fff 1 6,8,3 GenuineIntel 797
0000000c00000000 00002fff
The feature bits (00002fff) translated into binary are 0010
1111 1111 1111. As can be seen, the SEP bit (bit 11) is set which tells us that
the CPU supports the SYSENTER and SYSEXIT instructions but Windows 2000 doesn't
(since the MSRs were not set up).
We can confirm that Windows 2000 uses the 'int 2e' method of
calling system functions by disassembling an arbitrary system call, let's pick
CreateMutex which ultimately ends up in the user-mode stub ZwCreateMutant in
NTDLL.dll:
ntdll!ZwCreateMutant:
77f853b8 b825000000 mov eax,0x25
77f853bd 8d542404 lea edx,[esp+0x4]
77f853c1 cd2e int 2e
77f853c3 c21000 ret 0x10
As can be seen, our Windows 2000 system indeed uses 'int 2e'
to make the system call.
Windows XP Experiment
If we are making the exact same tests on a Windows XP OS
running on our Pentium III machine we should be able to verify that the system
uses SYSENTER instead of 'int 2e' when system calls are made. Let's first check
the MSRs:
0: kd> RDMSR 174
msr[174] = 00000000:00000008
0: kd> RDMSR 175
msr[175] = 00000000:00000000
0: kd> RDMSR 176
msr[176] = 00000000:804fa1e0
As expected, the MSRs are set up by Windows XP. As previously
explained, the MSR with ID 174 is the SYSENTER_CS_MSR. It contains the selector
that points to the Code Segment Descriptor in the GDT that describes the
kernel-mode segment that contains the system call function (KiFastCallEntry).
Let's take a look at the selector in SYSENTER_CS_MSR (MSR index 174):
If we peek into the GDT at index 1 with the "ProtMode" WinDbg
debugger extension DLL presented in my previous article, we see the following
information:
0: kd> !ProtMode.Descriptor GDT 1
----------------- Code Segment Descriptor -----------------
GDT base = 0x8003F000, Index = 0x01, Descriptor @ 0x8003f008
8003f008 ff ff 00 00 00 9b cf 00
Segment size is in 4KB pages, 32-bit default operand and data size
Segment is present, DPL = 0, Not system segment, Code segment
Segment is not conforming, Segment is readable, Segment is accessed
Target code segment base address =
0x00000000
Target code segment size = 0x000fffff
As can be seen, this is the same descriptor that was described
in my previous article (the single 4GB kernel-mode segment that contains the
system address space). The descriptor table base is however different on the
Windows XP system (0x8003F000) compared to (0x80036000) on the Windows 2000
system used in my previous article. The MSR with MSR index 176
(SYSENTER_EIP_MSR) contains the address of the kernel-mode function that will
be called when a SYSENTER instruction is executed. Let's verify that the
address 804fa1e0 indeed is the address of KiFastCallEntry:
0: kd> u 804fa1e0
nt!KiFastCallEntry:
804fa1e0 b930000000 mov ecx,0x30
804fa1e5 8ee1 mov fs,ecx
804fa1e7 648b0d40000000 mov ecx,fs:[00000040]
804fa1ee 368b6104 mov esp,ss:[ecx+0x4]
804fa1f2 b90403fe7f mov ecx,0x7ffe0304
Let's finally see what our CreateMutex call looks like on our
Windows XP system:
ntdll!ZwCreateMutant:
77f7e663 b82b000000 mov eax,0x2b
77f7e668 ba0003fe7f mov edx,0x7ffe0300
77f7e66d ffd2 call edx {SharedUserData!SystemCallStub (7ffe0300)}
77f7e66f c21000 ret 0x10
77f7e672 90 nop
We here see that the ZwCreateMutant stub function in NTDLL no
longer calls directly into kernel-mode but instead calls the SystemCallStub
function that resides in the SharedUserData page like described above. Below is
a disassembly of the SystemCallStub itself:
SharedUserData!SystemCallStub:
7ffe0300 8bd4 mov edx,esp
7ffe0302 0f34 sysenter
7ffe0304 c3 ret
Ah, finally we reach the SYSENTER instruction!
How much faster is SYSENTER than 'int 2e'?
The below test program calls CreateMutex approximately 16.7 million times and
then prints out the time the application started and finished. The results are
displayed in table 4 below.
Platform
|
Time
|
Windows 2000 SP4 on PIII Dual 800MHz
|
4 minutes
|
Windows XP SP0 on PIII Dual 800MHz
|
1 minute 30 seconds
|
Table 4. The SYSENTER system call performance improvement over 'int 2e'.
As table 4 shows, the SYSENTER way of making system calls is
266% faster than 'int 2e'. This is quite impressing and it may be a hidden but
very good reasons to upgrade to Windows XP. Of course, very few applications
call system services with this frequency but the SYSENTER instruction still
does a very good optimization job.
#include <WINDOWS.H>
#include <CRTDBG.H>
void DisplaySystemTime(LPSYSTEMTIME
pSystemTime, char *
pszHdr) { char
szBuf[1024]; sprintf(szBuf,
"%02d:%02d:%02d",
pSystemTime->wHour,
pSystemTime->wMinute,
pSystemTime->wSecond);
MessageBox(NULL, szBuf, pszHdr, MB_OK);
}
int main(int argc, char* argv[])
{
SYSTEMTIME stStart;
GetSystemTime(&stStart);
for(DWORD dwCount=
0;dwCount<0x00FFFFFF; dwCount++)
{
HANDLE hMutex = CreateMutex(
NULL, // SD.
FALSE, // Initial owner? NULL); // Name.
_ASSERTE(hMutex != NULL);
CloseHandle(hMutex); }
SYSTEMTIME stEnd;
GetSystemTime(&stEnd);
DisplaySystemTime(&stStart, "Start time");
DisplaySystemTime(&stEnd, "End time");
return 0;
}
Further
Reading
For information on the Protected
Mode of the Intel x86 CPU there are two great sources:
1)
"Intel Architecture Software Developers Manual, Volume 3 - System Programming Guide". Available from
Intel's web site in PDF format.
2)
"Protected Mode Software Architecture" by Tom Shanley. Available from Amazon.com (published by
Addison Wesley).
For more programming details about
the x86 CPU, must-haves are:
1)
Intel Architecture Software Developers Manual, Volume 1 - Basic Architecture.
2)
Intel Architecture Software Developers Manual, Volume 2 - Instruction Set Reference Manual.
Both these books are available in PDF format on the Intel web site (you can also get a free hardcopy of
these two books. Volume 3 is however only available in PDF format).
About the Author
John Gulbrandsen is the founder
and president of Summit Soft Consulting. John has a formal background in
Microprocessor-, digital- and analog- electronics design as well as in embedded
and Windows systems development. John has programmed Windows since 1992
(Windows 3.0). He is as comfortable with programming Windows applications and
web systems in C++, C# and VB as he is writing and debugging Windows kernel
mode device drivers in SoftIce.
To contact John drop him an email:
John.Gulbrandsen@SummitSoftConsulting.com
About
Summit
Soft Consulting
Summit Soft Consulting is a
Southern California-based consulting firm specializing in Microsoft's operating
systems and core technologies. Our specialty is Windows Systems Development
including kernel mode and NT internals programming.
To visit Summit Soft Consulting on
the web: http://www.summitsoftconsulting.com
|