Linux System Calls

How a read() becomes a kernel transition and back.

At a Glance

x86-64 SYSCALL ABI

Where each piece of the call lives. The syscall ABI overlaps the C calling convention but differs in two places: rcx is replaced by r10, and there are six argument registers, not seven.

RegisterRoleC ABI role (for comparison)
raxsyscall number (from syscall_64.tbl); return value on exitreturn value
rdiarg 1arg 1
rsiarg 2arg 2
rdxarg 3arg 3
r10arg 4 — not rcx(rcx is arg 4 in C)
r8arg 5arg 5
r9arg 6arg 6
rcxclobbered — holds saved user RIParg 4
r11clobbered — holds saved user RFLAGScaller-saved scratch
r12r15, rbx, rbp, rsppreserved across the syscallcallee-saved

Minimal call from hand-written assembly

; write(1, "hi\n", 3)
mov rax, 1          ; __NR_write
mov rdi, 1          ; fd = stdout
lea rsi, [rel msg]  ; buf
mov rdx, 3          ; count
syscall             ; rax = return value or -errno
; rcx and r11 now clobbered

End-to-End Sequence

What happens between the user's SYSCALL and the instruction right after it, for a typical read(fd, buf, n).

sequenceDiagram participant U as User code participant C as CPU participant K as Kernel (entry_SYSCALL_64) participant H as sys_read handler participant F as VFS / fs / driver U->>C: SYSCALL C->>C: load RIP from MSR_LSTAR (kernel entry) C->>C: save user RIP to RCX, RFLAGS to R11 C->>C: mask RFLAGS via MSR_FMASK C->>C: set CS/SS from MSR_STAR, CPL=0 C->>K: jump to entry_SYSCALL_64 K->>K: swapgs (per-CPU data) K->>K: switch to task's kernel stack K->>K: push user regs (pt_regs) K->>K: sys_call_table[rax] = sys_read K->>H: call sys_read(fd, buf, count) H->>F: vfs_read → file->f_op->read_iter → driver F-->>H: bytes read or -EIO / -EAGAIN / ... H-->>K: return in rax (positive = n, negative = -errno) K->>K: syscall_exit_work: signals? resched? notify? K->>K: restore user regs, swapgs K->>C: SYSRET (or IRET if work was pending) C->>U: resume at user RIP, one past SYSCALL

What SYSCALL Does (Hardware Level)

SYSCALL is a one-instruction privilege transition. It's fast because it bypasses segmentation descriptor loads; all the per-CPU state it needs lives in model-specific registers programmed once at boot.

MSRSet at boot toSYSCALL reads it to…
MSR_LSTARentry_SYSCALL_64…load new RIP
MSR_STARkernel CS/SS + user CS/SS…set CS/SS selectors and CPL=0
MSR_FMASKbits to mask out of RFLAGS…disable IF/DF/TF/AC before entry
MSR_GS_BASE / MSR_KERNEL_GS_BASEper-CPU area base…swapgs flips these so %gs points to per-CPU data

Effect in one bullet list:

Kernel Entry Path

Simplified view of what entry_SYSCALL_64 does before handing off to a C-level handler. The real assembly lives in arch/x86/entry/entry_64.S.

entry_SYSCALL_64:
    swapgs                          ; %gs now points at per-CPU area
    mov    %rsp, PER_CPU_VAR(cpu_tss + TSS_sp2)  ; save user %rsp
    mov    PER_CPU_VAR(cpu_current_top_of_stack), %rsp  ; kernel stack
    pushq  $__USER_DS                                ; build pt_regs
    pushq  PER_CPU_VAR(cpu_tss + TSS_sp2)            ; user %rsp
    pushq  %r11                                      ; saved RFLAGS
    pushq  $__USER_CS                                ; user CS
    pushq  %rcx                                      ; saved RIP
    pushq  %rax                                      ; syscall nr
    pushq  %rdi ... %r15                             ; all GPRs
    mov    %rsp, %rdi                                ; pt_regs *
    call   do_syscall_64                             ; C land
    ; ... exit work (signals, resched) ...
    sysretq                                          ; or iretq

do_syscall_64(regs, nr) dispatches via the sys_call_table:

// kernel/entry/common.c, simplified
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) {
    if (likely(nr < NR_syscalls)) {
        nr = array_index_nospec(nr, NR_syscalls);  // Spectre-v1 gadget guard
        regs->ax = sys_call_table[nr](regs);       // e.g. __x64_sys_read
    } else {
        regs->ax = -ENOSYS;
    }
    syscall_exit_to_user_mode(regs);
}

Why r10, Not rcx

A subtle trap when calling syscalls directly.

The CPU uses rcx to save the user RIP, and r11 for RFLAGS. So the syscall ABI can't pass the 4th argument in rcx the way the C ABI does — it would be overwritten the instant SYSCALL executes. The Linux ABI shifts arg 4 to r10 instead. A glibc wrapper for a 4+ arg syscall therefore looks like:

mov    %rcx, %r10     ; slot the C-ABI arg-4 into the syscall-ABI arg-4 register
mov    $SYS_foo, %eax
syscall
; rcx is garbage (== RIP), r11 is garbage (== RFLAGS), all other callee-saved regs intact

The libc Wrapper

Why read() in your C program is a function call, not an instruction. Everything below is glibc/musl boilerplate; Go emits syscalls inline.

  1. Copy args into the syscall registers (with the rcx → r10 dance if needed).
  2. Execute SYSCALL.
  3. Check if the return is in [−4095, −1]. If so, negate and store in __errno_location(), return -1. Otherwise return the value as-is.
  4. If the thread is in a cancellation region (POSIX cancellation points), test pthread_cancel flags before and after the call.
  5. On some syscalls, fall back to a legacy number on old kernels (e.g. statxstat).

For a syscall with no wrapper (or to bypass one), use the generic syscall(2):

#include <sys/syscall.h>
#include <unistd.h>
#include <errno.h>

long ret = syscall(SYS_gettid);
if (ret == -1) perror("gettid");

The errno Convention

There is no error flag. The sign of rax is the error flag.

vDSO: Syscalls That Aren't Syscalls

For a few hot calls the kernel publishes user-callable code + data pages; the call happens entirely in userspace.

$ cat /proc/self/maps | grep vdso
7fff14b88000-7fff14b8a000 r-xp 00000000 00:00 0   [vdso]

Returning to Userspace

The fast path out of the kernel is SYSRET; any pending thread-flag work forces a slow IRET.

flowchart TB A["sys_call handler returns"] --> B["syscall_exit_to_user_mode()"] B --> C{"any TIF_* work flags set?"} C -- no --> D["SYSRETQ (fast)"] C -- yes --> E["exit_to_user_mode_loop()"] E --> F["TIF_SIGPENDING: run signal handler setup"] E --> G["TIF_NEED_RESCHED: schedule()"] E --> H["TIF_NOTIFY_RESUME: rseq, uprobes, etc."] F --> I["IRETQ (slow)"] G --> I H --> I D --> Z["back in user code"] I --> Z

Other Architectures

Same shape, different letters. The syscall number is also per-arch: __NR_read is 0 on x86-64, 63 on arm64, 3 on x86-32. Sticking to SYS_read from <sys/syscall.h> keeps code portable.

ArchEnterNumber inArgs inReturn inClobbers
x86-64 SYSCALL rax rdi, rsi, rdx, r10, r8, r9 rax rcx, r11
arm64 (AArch64) SVC #0 x8 x0, x1, x2, x3, x4, x5 x0 — (args in x0…x5 preserved on return other than x0)
x86-32 (i386) int 0x80 or sysenter eax ebx, ecx, edx, esi, edi, ebp eax — (int 0x80 preserves all GPRs)
RISC-V ecall a7 a0, a1, a2, a3, a4, a5 a0
ppc64le sc r0 r3, r4, r5, r6, r7, r8 r3 (CR0.SO = error flag)

Observing and Filtering

Every syscall is a tracepoint. Between ptrace, tracepoints, and BPF, nothing slips past.

ToolMechanismTypical use
strace ptrace(PTRACE_SYSCALL) — stop the target on syscall entry + exit, decode registers. Debugging: which syscall failed with what errno, with timing.
perf trace raw_syscalls:sys_enter / sys_exit tracepoints via perf ring buffer. Lower-overhead replacement for strace; aggregation across a whole host.
bpftrace / bcc eBPF attached to the same tracepoints, with in-kernel filtering. Production: count, histogram, or selectively capture with negligible overhead.
seccomp-bpf A BPF program attached to the task; runs on every syscall and returns allow / errno / kill / trap / user_notify. Sandboxing: Docker/containerd, systemd units, browsers, CRIU.
audit Kernel auditing subsystem, rules configured via auditctl. Compliance logging; more heavyweight than tracepoints.

Performance

A syscall is cheap but not free; in tight loops, avoiding or batching them matters.

References