Linux System Calls
How a read() becomes a kernel transition and back.
At a Glance
- One instruction — on x86-64, the
SYSCALLinstruction raises privilege and jumps to a kernel-provided entry. Number inrax, arguments inrdi, rsi, rdx, r10, r8, r9. - Linear path — user
SYSCALL→ kernelentry_SYSCALL_64→do_syscall_64→sys_call_table[nr]→SYSRET. On the way back the kernel may also handle signals, reschedule, or touch TLS before returning to userspace. - Kernel returns
-errno— the raw return inraxis−EFOOfor errors (the range[−4095, −1]). Libc wrappers translate to-1and seterrno; rawsyscall(2)users must check themselves. - SYSCALL clobbers
rcxandr11— the hardware uses them to save userRIPandRFLAGS. That's why the 4th argument is inr10instead ofrcxlike the C calling convention. - vDSO skips the transition — vDSO pages are mapped into every process;
clock_gettime,gettimeofday,getcpu, andtimerun entirely in userspace by reading kernel-maintained memory. - Most apps never touch SYSCALL — glibc/musl wrappers do the register setup, translate
errno, and handle the cancellation points forpthread_cancel. Go skips libc and emitsSYSCALLdirectly. - Per-architecture — arm64 uses
SVCwith the number inx8, args inx0..x5. Legacy 32-bit x86 usedint 0x80; modern 32-bit usessysenter. Numbers differ per arch — see Filippo's table. - Fully observable —
straceviaptrace,bpftracevia tracepoints, andseccomp-bpffor filtering.
x86-64 SYSCALL ABI
Where each piece of the call lives. The syscall ABI overlaps the C calling convention but differs in two places: rcx is replaced by r10, and there are six argument registers, not seven.
| Register | Role | C ABI role (for comparison) |
|---|---|---|
rax | syscall number (from syscall_64.tbl); return value on exit | return value |
rdi | arg 1 | arg 1 |
rsi | arg 2 | arg 2 |
rdx | arg 3 | arg 3 |
r10 | arg 4 — not rcx | (rcx is arg 4 in C) |
r8 | arg 5 | arg 5 |
r9 | arg 6 | arg 6 |
rcx | clobbered — holds saved user RIP | arg 4 |
r11 | clobbered — holds saved user RFLAGS | caller-saved scratch |
r12–r15, rbx, rbp, rsp | preserved across the syscall | callee-saved |
Minimal call from hand-written assembly
; write(1, "hi\n", 3)
mov rax, 1 ; __NR_write
mov rdi, 1 ; fd = stdout
lea rsi, [rel msg] ; buf
mov rdx, 3 ; count
syscall ; rax = return value or -errno
; rcx and r11 now clobbered End-to-End Sequence
What happens between the user's SYSCALL and the instruction right after it, for a typical read(fd, buf, n).
What SYSCALL Does (Hardware Level)
SYSCALL is a one-instruction privilege transition. It's fast because it bypasses segmentation descriptor loads; all the per-CPU state it needs lives in model-specific registers programmed once at boot.
| MSR | Set at boot to | SYSCALL reads it to… |
|---|---|---|
MSR_LSTAR | entry_SYSCALL_64 | …load new RIP |
MSR_STAR | kernel CS/SS + user CS/SS | …set CS/SS selectors and CPL=0 |
MSR_FMASK | bits to mask out of RFLAGS | …disable IF/DF/TF/AC before entry |
MSR_GS_BASE / MSR_KERNEL_GS_BASE | per-CPU area base | …swapgs flips these so %gs points to per-CPU data |
Effect in one bullet list:
RCX ← RIP,R11 ← RFLAGS,RFLAGS &= ~MSR_FMASK.CS ← MSR_STAR[47:32],SS ← CS + 8, both CPL=0.RIP ← MSR_LSTAR. Kernel is now executing, still on the user stack for a few instructions until it explicitly switches.SYSRETundoes all of this:RIP ← RCX,RFLAGS ← R11, selectors restored fromMSR_STAR[63:48], CPL=3.
Kernel Entry Path
Simplified view of what entry_SYSCALL_64 does before handing off to a C-level handler. The real assembly lives in arch/x86/entry/entry_64.S.
entry_SYSCALL_64:
swapgs ; %gs now points at per-CPU area
mov %rsp, PER_CPU_VAR(cpu_tss + TSS_sp2) ; save user %rsp
mov PER_CPU_VAR(cpu_current_top_of_stack), %rsp ; kernel stack
pushq $__USER_DS ; build pt_regs
pushq PER_CPU_VAR(cpu_tss + TSS_sp2) ; user %rsp
pushq %r11 ; saved RFLAGS
pushq $__USER_CS ; user CS
pushq %rcx ; saved RIP
pushq %rax ; syscall nr
pushq %rdi ... %r15 ; all GPRs
mov %rsp, %rdi ; pt_regs *
call do_syscall_64 ; C land
; ... exit work (signals, resched) ...
sysretq ; or iretq do_syscall_64(regs, nr) dispatches via the sys_call_table:
// kernel/entry/common.c, simplified
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls); // Spectre-v1 gadget guard
regs->ax = sys_call_table[nr](regs); // e.g. __x64_sys_read
} else {
regs->ax = -ENOSYS;
}
syscall_exit_to_user_mode(regs);
} Why r10, Not rcx
A subtle trap when calling syscalls directly.
The CPU uses rcx to save the user RIP, and r11 for RFLAGS. So the syscall ABI can't pass the 4th argument in rcx the way the C ABI does — it would be overwritten the instant SYSCALL executes. The Linux ABI shifts arg 4 to r10 instead. A glibc wrapper for a 4+ arg syscall therefore looks like:
mov %rcx, %r10 ; slot the C-ABI arg-4 into the syscall-ABI arg-4 register
mov $SYS_foo, %eax
syscall
; rcx is garbage (== RIP), r11 is garbage (== RFLAGS), all other callee-saved regs intact The libc Wrapper
Why read() in your C program is a function call, not an instruction. Everything below is glibc/musl boilerplate; Go emits syscalls inline.
- Copy args into the syscall registers (with the
rcx → r10dance if needed). - Execute
SYSCALL. - Check if the return is in
[−4095, −1]. If so, negate and store in__errno_location(), return-1. Otherwise return the value as-is. - If the thread is in a cancellation region (POSIX cancellation points), test
pthread_cancelflags before and after the call. - On some syscalls, fall back to a legacy number on old kernels (e.g.
statx→stat).
For a syscall with no wrapper (or to bypass one), use the generic syscall(2):
#include <sys/syscall.h>
#include <unistd.h>
#include <errno.h>
long ret = syscall(SYS_gettid);
if (ret == -1) perror("gettid"); The errno Convention
There is no error flag. The sign of rax is the error flag.
- The kernel returns
-errnoinrax.-EAGAINis-11,-EINVALis-22, etc. - To detect an error, libc checks whether
(unsigned long)rax >= -4095UL— equivalently, whetherraxis in the range[−4095, −1]. That carve-out at the top of the address space is the reason pointers returned bymmapcan never collide with error codes. - Error: set
errno = -rax, return-1. Success: returnraxunchanged (could be a large positive value, could be a pointer forbrk/mmap). - A handful of syscalls (
getpid,getuid, …) cannot fail and their wrappers never seterrno.
vDSO: Syscalls That Aren't Syscalls
For a few hot calls the kernel publishes user-callable code + data pages; the call happens entirely in userspace.
$ cat /proc/self/maps | grep vdso
7fff14b88000-7fff14b8a000 r-xp 00000000 00:00 0 [vdso] - What's mapped — a shared object (
linux-vdso.so.1) containing thin implementations of__vdso_clock_gettime,__vdso_gettimeofday,__vdso_getcpu,__vdso_time(on x86-64). Plus a read-only data page the kernel updates with the currenttimekeeperstate and CPU ID. - How libc finds it — the kernel passes the vDSO base via the auxv entry
AT_SYSINFO_EHDR. Libc walks the ELF dynsym table and caches function pointers. - Why it's a win —
clock_gettime(CLOCK_MONOTONIC)in vDSO reads TSC + timekeeper data, runs ~10–20 ns. The same call via real syscall costs ~1 µs (~50× slower). - Fallback — if the clocksource isn't vDSO-safe (e.g. running under a hypervisor without TSC, or on an unsupported clock id), the vDSO stub falls through to a real
syscall.
Returning to Userspace
The fast path out of the kernel is SYSRET; any pending thread-flag work forces a slow IRET.
SYSRETQ— ~5 ns return, but requires thatrcxandr11still hold a valid user RIP/RFLAGS and the canonical bit is right. Any weird state (e.g. user setRFLAGS.TFfor single-stepping, or RIP isn't canonical) forcesIRETQ.IRETQ— slower but can restore any architectural state. Always used when returning from an interrupt or when signal delivery modified the stack.
Other Architectures
Same shape, different letters. The syscall number is also per-arch: __NR_read is 0 on x86-64, 63 on arm64, 3 on x86-32. Sticking to SYS_read from <sys/syscall.h> keeps code portable.
| Arch | Enter | Number in | Args in | Return in | Clobbers |
|---|---|---|---|---|---|
| x86-64 | SYSCALL | rax | rdi, rsi, rdx, r10, r8, r9 | rax | rcx, r11 |
| arm64 (AArch64) | SVC #0 | x8 | x0, x1, x2, x3, x4, x5 | x0 | — (args in x0…x5 preserved on return other than x0) |
| x86-32 (i386) | int 0x80 or sysenter | eax | ebx, ecx, edx, esi, edi, ebp | eax | — (int 0x80 preserves all GPRs) |
| RISC-V | ecall | a7 | a0, a1, a2, a3, a4, a5 | a0 | — |
| ppc64le | sc | r0 | r3, r4, r5, r6, r7, r8 | r3 (CR0.SO = error flag) | — |
Observing and Filtering
Every syscall is a tracepoint. Between ptrace, tracepoints, and BPF, nothing slips past.
| Tool | Mechanism | Typical use |
|---|---|---|
strace | ptrace(PTRACE_SYSCALL) — stop the target on syscall entry + exit, decode registers. | Debugging: which syscall failed with what errno, with timing. |
perf trace | raw_syscalls:sys_enter / sys_exit tracepoints via perf ring buffer. | Lower-overhead replacement for strace; aggregation across a whole host. |
bpftrace / bcc | eBPF attached to the same tracepoints, with in-kernel filtering. | Production: count, histogram, or selectively capture with negligible overhead. |
seccomp-bpf | A BPF program attached to the task; runs on every syscall and returns allow / errno / kill / trap / user_notify. | Sandboxing: Docker/containerd, systemd units, browsers, CRIU. |
| audit | Kernel auditing subsystem, rules configured via auditctl. | Compliance logging; more heavyweight than tracepoints. |
Performance
A syscall is cheap but not free; in tight loops, avoiding or batching them matters.
- Raw cost — ~1–2 µs on modern x86-64, dominated by pipeline flush and CR3 switch (KPTI).
getpid()is the standard micro-benchmark. - KPTI (Meltdown mitigation) — each syscall switches page tables twice. Typical overhead: 10–30% on syscall-heavy workloads. Disable at boot with
pti=offonly if you know the hardware is safe. - Spectre-v1
array_index_nospec— thesys_call_tableindex is sanitized against speculation on every dispatch; a few cycles per call. - Batching APIs —
readv/writev(scatter-gather),sendmmsg/recvmmsg,epoll. io_uring— submit many I/O ops via a shared ring buffer, amortize transitions to zero in the fast path. unixism.net/loti is the canonical guide.- vDSO — as above, shaves 50× off the hot clock reads.
References
- syscalls(2) — the master list of Linux system calls with per-arch availability.
- syscall(2) — the generic wrapper; documents per-arch register conventions in one place.
- arch/x86/entry/syscalls/syscall_64.tbl — authoritative x86-64 number table in the kernel tree.
- Filippo's Linux syscall table — interactive cross-arch lookup.
- arch/x86/entry/entry_64.S — the real entry assembly.
- Intel SDM: SYSCALL / SYSRET (Cloutier's mirror) — precise semantics and MSR reads.
- vdso(7) — which functions are exported, per arch.
- LWN: The vDSO revisited — how the vDSO is built and mapped.
- ptrace(2) — the introspection mechanism behind
strace,gdb, and CRIU. - seccomp(2) — filtering/auditing syscalls via BPF.
- System V AMD64 ABI — the C calling convention the syscall ABI borrows from.