Lachlan Cox

Booting Go on Bare Metal

These days Docker has pretty much taken over how you deploy software. When dealing with compiled languages it is ideal to deploy a statically compiled binary and have a multi-stage Dockerfile which ends up being a scratch container that runs the binary for the smallest image footprint.

I was thinking, what if instead of deploying a Docker image as a scratch container, I could deploy a statically compiled binary to a minimal Linux image and have it just run. No container runtime, no orchestration, just UEFI firmware loading a binary.

So I built a couple line main.go which runs an HTTP server serving GET / and outputting a simple Hello, World!. The goal: build it into a minilinux.img that I can boot in QEMU. The question: how small can I get this image?

The final result is a ~7 MiB bootable image where the only userspace binary is a static Go executable running as PID 1. Oh yeah, secure.

What Does It Take to Boot?

As I’m developing this on Apple Silicon, I didn’t want to mess around with cross compilation toolchains. So we are going to be using Docker to build everything, like a mad man.

What does it take to actually have a Linux image boot our binary? Well, we need a bootloader, a kernel, a way to set up the initramfs, and something to act as init to run our binary as a service.

If you haven’t encountered initramfs before, it’s a compressed cpio archive that the kernel extracts into a tmpfs filesystem at boot. tmpfs is a filesystem that lives entirely in RAM, so it’s fast but volatile. Everything in it disappears on reboot. The kernel executes /init from this temporary root. On a normal Linux system, initramfs contains just enough tools to find and mount the real root filesystem (maybe it needs to load storage drivers or unlock an encrypted volume), then switch_root hands control to the actual init system on disk. But nothing says you have to switch. If your initramfs contains everything you need, you can just stay there. That’s the key insight that makes this whole project work.

Something else we need to deal with is DHCP and the network interface. So we will use udhcpc to get an IP address, then use ip to bring the links up so we can send a request to <host>:8080.

The First Attempt

My first attempt had a partition layout of:

img (GPT)
|- boot (FAT32, ESP)
|   |- EFI/BOOT/BOOTAA64.EFI (GRUB)
|   |- vmlinuz
|   |- initramfs.img
|- data (ext4)
    |- rootfs.squashfs

Getting here was not straightforward. The first few boots dropped me into a GRUB rescue shell, or showed nothing at all, or kernel panicked with cryptic messages about not finding init.

The console output was the first hurdle. QEMU with -nographic expects the guest to output on the serial port, but the kernel defaults to the graphical console. Without console=ttyS0 (x86) or console=ttyAMA0 (arm64) in the kernel command line, the kernel boots silently and you have no idea what’s happening. Debugging a black screen is not fun.

GRUB configuration was the next puzzle. GRUB needs to know where the kernel and initramfs are, what command line to pass, and which device to boot from. Getting the paths right when your ESP is partition 1 of a loop mounted image built inside Docker took some trial and error. The grub.cfg ended up looking something like:

set timeout=0
menuentry "minilinux" {
    linux /vmlinuz console=ttyAMA0 root=/dev/vda2 ro
    initrd /initramfs.img
}

Then the initramfs /init script had to actually work. It needed to mount the ext4 partition, find the squashfs inside it, mount that as an overlay, then switch_root into the new root and exec the real init. Each step is a chance for something to go wrong, and when it does, you get “Kernel panic: not syncing: Attempted to kill init!” which tells you absolutely nothing about which step failed.

Eventually it all clicked. GRUB loaded the kernel, the kernel unpacked the initramfs, busybox ran /init which mounted the squashfs, switch_root’d into it, runit started, and my Go binary came up serving HTTP. Success!

Except the image was 180 MiB. The Debian kernel alone was 80 MiB. That seemed excessive for a “Hello, World!” HTTP server.

Stripping It Down

Looking at what was actually in the image, the breakdown was roughly:

ComponentSize
Debian kernel + modules~80 MiB
GRUB + EFI files~15 MiB
Squashfs rootfs~40 MiB
Initramfs (busybox)~5 MiB
My Go binary~2 MiB
Partition overhead~38 MiB

Note: This is what I think it was, I don’t actually remember and it feels right. I’m not going back, you can’t make me go back.

The squashfs layer was the first to go. Why did I even have it? The original idea was that the initramfs would be tiny (just enough to mount the real root), and the squashfs would contain the actual system. But my “actual system” was just busybox, runit, and a Go binary. The complexity of having an initramfs that mounts an ext4 partition, finds a squashfs file, mounts that, then switch_roots into it was solving a problem I didn’t have.

Instead, I moved everything directly into the initramfs and stayed there. No ext4 partition, no squashfs, no switch_root. The new layout:

img (GPT)
|- boot (FAT32, ESP)
    |- EFI/BOOT/BOOTAA64.EFI (GRUB)
    |- vmlinuz
    |- initramfs.img (contains busybox + runit + Go binary)

The initramfs went from a minimal “find the real root” setup to being the entire userspace. Busybox provided the shell and basic utilities, runit supervised services, and my Go binary was the only actual service.

Down to ~120 MiB. Still mostly kernel.

Next I switched from Debian’s kernel to Alpine’s linux-virt. Debian’s kernel is built for physical hardware: it includes drivers for SATA controllers, USB devices, graphics cards, sound cards, network adapters from a dozen vendors. None of that matters in a VM where everything is virtio. Alpine’s kernel is configured for virtual machines and strips out all that hardware support. The image dropped to ~60 MiB.

Better, but I started wondering, do I need busybox? The Go binary is statically linked anyway and I could probably just do the syscalls on startup.

Going Full Go

The turning point was realising that everything busybox and runit were doing could be done in Go. Looking at what my init script actually did:

  1. Mount virtual filesystems (/proc, /sys, /dev)
  2. Bring up network interfaces
  3. Run DHCP to get an IP address
  4. Start the API server
  5. Supervise the service (restart if it crashes)

None of that requires a shell. Go can make syscalls directly, and there are pure Go libraries for networking. The only reason I had busybox was because “that’s how you do embedded Linux.” But statically compiled Go binaries don’t need libc, don’t need a shell, don’t need anything except the kernel :tm:.

So I rewrote the init system in Go. The entire program structure is simple:

func main() {
    log.SetFlags(0)

    runSysinit()
    runAPI()
}

The runSysinit function replaces the shell script that used to do system initialisation:

func runSysinit() {
    mountFilesystems()
    bringUpInterfaces()
    lease := configureDHCP()
    setHostname(lease)
    log.Println(":: System ready")
}

Let me break down each piece.

Mounting Virtual Filesystems

The first thing any init does is mount the virtual filesystems that make Linux actually usable:

//   source       target       fstype
var mounts = [][3]string{
    {"proc",     "/proc",     "proc"},     // Process info
    {"sysfs",    "/sys",      "sysfs"},    // Hardware/driver info
    {"devtmpfs", "/dev",      "devtmpfs"}, // Device nodes
    {"devpts",   "/dev/pts",  "devpts"},   // Pseudoterminals
    {"tmpfs",    "/dev/shm",  "tmpfs"},    // Shared memory
    {"tmpfs",    "/tmp",      "tmpfs"},    // Temp files
    {"tmpfs",    "/run",      "tmpfs"},    // Runtime data
}

func mountFilesystems() {
    for _, m := range mounts {
        os.MkdirAll(m[1], 0o755)                     // Create mount point
        syscall.Mount(m[0], m[1], m[2], 0, "")       // Mount it
    }
}

Most of these aren’t real filesystems with files stored in RAM. They’re virtual filesystems that act as interfaces to the kernel:

MountWhat it actually is
/procProcess information and kernel parameters. Reading /proc/cpuinfo doesn’t read a file, it asks the kernel to generate CPU info on the fly
/sysHardware and driver information. This is where we find the ACPI power button device
/devDevice nodes. The kernel populates this with entries like /dev/null, /dev/urandom, and network interfaces
/dev/ptsPseudoterminal devices for SSH sessions (not that we have SSH)
/tmp, /run, /dev/shmThese actually are tmpfs, used for runtime scratch space

The initramfs itself only contains the Go binary and empty directories for these mount points. Everything else appears at runtime when we mount these virtual filesystems.

Normally you’d use ip link set eth0 up to bring up a network interface. But ip is a binary that comes from iproute2, and we don’t have that. Under the hood, ip just talks to the kernel via netlink sockets. Go can do that too.

I used vishvananda/netlink, a pure Go implementation of the netlink protocol. This is the same library Docker uses for container networking, so it’s well tested:

var ifaces = []string{
    "lo",       // Loopback Interface
    "eth0",     // Default Interface
}

func bringUpInterfaces() {
    for _, name := range []string{"lo", "eth0"} {
        link, _ := netlink.LinkByName(name)  // Get interface by name
        netlink.LinkSetUp(link)              // Equivalent to `ip link set <name> up`
    }
}

The interface is now up, but it doesn’t have an IP address yet.

DHCP Without dhclient

DHCP clients like dhclient or udhcpc are typically written in C and shell scripts. But DHCP is just UDP packets with a specific format:

  1. send DISCOVER
  2. receive OFFER
  3. send REQUEST
  4. receive ACK

Simple enough to implement from scratch, but I decided to be lazy and use insomniacslk/dhcp which implements the protocol in pure Go and handles all the edge cases I’d inevitably get wrong. I could’ve just done static IPs, but why not be somewhat versatile.

One gotcha. The library uses getrandom(2) to generate transaction IDs, which blocks until the kernel’s random number generator is initialised. In a minimal VM with no hardware entropy source, this can hang forever. The fix is setting UROOT_NOHWRNG=1 before calling the library, which makes it fall back to /dev/urandom. I absolutely didn’t get affected by this.

func configureDHCP() *dhcpLease {
    os.Setenv("UROOT_NOHWRNG", "1") // Don't block waiting for hardware entropy
    lease, _ := dhcpExchange("eth0")

    // Assign the leased IP to eth0 (like `ip addr add`)
    eth0, _ := netlink.LinkByName("eth0")
    netlink.AddrAdd(eth0, &netlink.Addr{
        IPNet: &net.IPNet{IP: lease.IP, Mask: lease.Mask},
    })

    // Add default route via gateway (like `ip route add default via`)
    netlink.RouteAdd(&netlink.Route{Gw: lease.Gateway})

    // Write DNS config so resolution works
    os.WriteFile("/etc/resolv.conf",
        fmt.Appendf(nil, "nameserver %s\n", lease.DNS),
        0o644,
    )

    return lease
}

Once we have a lease, we use netlink to assign the IP address and add the default route. Finally, we write /etc/resolv.conf so DNS resolution works. Technically we don’t need this since we’re just serving HTTP and never making outbound requests that need DNS, but it’s nice to have if you ever want to extend the binary to fetch something. I do end up removing this later.

Graceful Shutdown

The last piece was graceful shutdown. When QEMU or a hypervisor sends an ACPI power button event, you want the VM to shut down cleanly rather than just dying. Normally this is handled by acpid or systemd, but we have neither.

The ACPI power button shows up as a Linux input device. The Go binary scans /sys/class/input/ to find the device named “Power Button”, then reads input events from /dev/input/eventN.

Each event is a 24 byte input_event struct:

| timestamp (16 bytes) | type (2) | code (2) | value (4) |

We’re looking for type=EV_KEY, code=KEY_POWER, value=1 (pressed):

const (
    EV_KEY    = 0x01
    KEY_POWER = 116
)

func listenPowerButton(shutdownCh chan<- struct{}) {
    f, _ := os.Open(findPowerButtonDevice())
    defer f.Close()

    buf := make([]byte, 24)
    for {
        f.Read(buf)
        evType  := binary.LittleEndian.Uint16(buf[16:18])
        evCode  := binary.LittleEndian.Uint16(buf[18:20])
        evValue := binary.LittleEndian.Uint32(buf[20:24])

        if evType == EV_KEY && evCode == KEY_POWER && evValue == 1 {
            shutdownCh <- struct{}{}
            return
        }
    }
}

When the power button is pressed, the binary syncs filesystems and calls syscall.Reboot(syscall.LINUX_REBOOT_CMD_POWER_OFF). Clean shutdown, no external tools required.

The Result

Now the initramfs contained exactly one file: /init, my Go binary. No busybox, no shell, no nothing.

Removing GRUB

With the Go init working, I turned my attention to the bootloader. GRUB is powerful but it’s also “large” and complex. Modern UEFI firmware can load executables directly from the ESP if they’re placed at the fallback path /EFI/BOOT/BOOT{X64,AA64}.EFI.

systemd-boot is much simpler than GRUB. It reads a config file, loads a kernel and initramfs, and hands off. That’s it. So I replaced GRUB with systemd-boot and the image got smaller again.

But then I discovered Unified Kernel Images. A UKI bundles the kernel, initramfs, and command line into a single PE executable. UEFI firmware loads it directly. No bootloader configuration, no separate files, just one blob.

ukify build \
    --linux=/uki/vmlinuz \
    --initrd=/uki/initramfs.img \
    --cmdline="console=tty0 console=ttyS0 random.trust_cpu=on" \
    --output=/uki/BOOT.EFI

Now the ESP contains exactly one file: EFI/BOOT/BOOTX64.EFI, which is the kernel, initramfs, and command line all in one.

Building Our Own Kernel

At this point I was still using Alpine’s kernel, and it was still the largest thing in the image. The kernel has loadable modules, but I wasn’t loading any of them. The kernel has thousands of drivers, but I only needed virtio.

Time to build a custom kernel.

I started with Linux 6.18 LTS and a minimal config. The key insight is that CONFIG_MODULES=n means there’s no module loading at all. Every driver is compiled directly into the kernel binary. This sounds limiting, but it means no /lib/modules directory, no modprobe, no depmod. The kernel is entirely self contained, and we know exactly what’s in it.

What we need (compiled in):

ConfigWhy
CONFIG_VIRTIO_NET, CONFIG_VIRTIO_BLKVirtIO drivers for QEMU networking and block devices
CONFIG_VIRTIO_PCI, CONFIG_PCIPCI bus support (virtio devices attach via PCI)
CONFIG_EFI_STUBKernel can boot directly from UEFI firmware
CONFIG_BLK_DEV_INITRD, CONFIG_RD_XZInitramfs support with XZ decompression
CONFIG_PROC_FS, CONFIG_SYSFS, CONFIG_DEVTMPFSVirtual filesystems we mount
CONFIG_PACKET, CONFIG_INET, CONFIG_UNIXNetworking stack for DHCP and sockets
CONFIG_INPUT_EVDEV, CONFIG_ACPI_BUTTONACPI power button for graceful shutdown
CONFIG_HW_RANDOM_VIRTIO, CONFIG_RANDOM_TRUST_CPUEntropy sources so DHCP doesn’t block
CONFIG_SERIAL_8250 / CONFIG_SERIAL_AMBA_PL011Serial console (arch dependent)

What we disable:

ConfigWhy
CONFIG_MODULESNo module loading, everything built in
CONFIG_ETHERNETDisables all vendor NIC drivers (virtio_net is separate)
CONFIG_EXT4_FS, CONFIG_XFS_FS, etcNo disk filesystems, rootfs is tmpfs
CONFIG_SCSI, CONFIG_ATA, CONFIG_NVMENo storage drivers, boot from initramfs
CONFIG_USB_SUPPORTNo USB in a VM
CONFIG_SOUND, CONFIG_DRM, CONFIG_FBNo audio or graphics
CONFIG_NETFILTER, CONFIG_WIRELESSNo firewall or wifi
CONFIG_KVM, CONFIG_XENWe’re a guest, not a hypervisor host
CONFIG_FTRACE, CONFIG_DEBUG_KERNELNo debugging or tracing overhead

There are more options than this (TTY support, size optimisations, arch specific drivers, etc.) but these are the important ones. Many options also pull in dependencies automatically, so the actual config ends up longer than you’d expect.

Setting CONFIG_EXPERT=y is crucial here. Many of these options are forced on by default in non-expert mode. Without it, the kernel config system ignores your attempts to disable things like CONFIG_INPUT or CONFIG_HID. Which I didn’t realise until wasting copious amounts of time per kernel compilation.

The build is a Docker stage that downloads the kernel source, applies my config fragments, and compiles:

cd "linux-${KERNEL_VERSION}"
make defconfig
scripts/kconfig/merge_config.sh -m .config /configs/config.common
make -j"$(nproc)" bzImage

The resulting kernel is ~3.7 MiB for x86_64. Combined with the ~2 MiB Go binary (compressed in the initramfs), the final image comes in around 8 MiB for amd64.

The Final Architecture

UEFI firmware
 |-- loads /EFI/BOOT/BOOTX64.EFI (the UKI)
     |-- kernel unpacks initramfs into tmpfs
         |-- execs /init (the Go binary)
             |-- mounts proc, sys, dev
             |-- brings up eth0, runs DHCP
             |-- listens for ACPI power button
             |-- starts HTTP server on :8080
             |-- on power button: sync, poweroff

No bootloader menu. No module loading. No shell. No package manager. The only userspace binary is the Go program, and it does everything.

The disk layout is equally simple:

minilinux.img (GPT)
|-- Partition 1: FAT32 ESP
    |-- EFI/BOOT/BOOTX64.EFI   (UKI: kernel + initramfs + cmdline)

That’s it. One partition, one file.

The Result

$ just build
... terrifying docker logs ...
SUCCESS: output/minilinux.amd64.img (7.7M)

$ just run
... kernel logs ...
:: Mounting filesystems...
:: Configuring network...
:: Running DHCP on eth0...
:: Lease: 10.0.2.15/24 gw 10.0.2.2 dns 10.0.2.3
:: Setting hostname...
:: Hostname: minilinux-525400
:: System ready
:: HTTP server listening on :8080

# From another terminal:
$ curl http://localhost:8080/
Hello, World!

A 7.7 MiB bootable image serving HTTP. For comparison, the scratch Docker image with just the Go binary would be about 2 MiB. So we’re paying ~5.7 MiB for a custom kernel and the ability to boot on bare metal (kind of… VM metal).

The arm64 build is even smaller at ~7 MiB because the aarch64 kernel seems to compresses better with EFI zboot.

When Would You Actually Use This?

Honestly? Probably never. Containers exist for a reason and orchestration tools like Kubernetes solve real problems. But there are some edge cases:

For me, the real value was the journey. I now understand initramfs, kernel configuration, UEFI boot, and systemd-boot in a way I never did before. The next time I debug a boot failure, I’ll actually know what’s happening.

I’m also contemplating playing with the idea of building some form of software based router using nftables (well the netlink interface of it) and having a web controller/ui.

What’s Next

The project is functional but there’s more to explore:

The code will be at github.com/lcox74/bingo once I clean it up. The only requirement is Docker and just.

<< Previous Post

|

Next Post >>

#Golang #Linux #Kernel