Linux启动流程：内核解压

2019-03-11 11:47

本文发自 http://www.binss.me/blog/boot-process-of-linux-decompress-kernel/，转载请注明出处。

本文延续 Linux启动流程：从启动到 GRUB，探讨在 GRUB 将控制权交给 Linux kernel 后，kernel 完成初始化并开始运行的流程。再次说明，我们分析的是 Linux kernel 4.8.10 的 x64 架构源代码。

要想弄清楚 kernel 是如何被加载的，首先需要了解 kernel 镜像文件，也就是位于 /boot 目录下的 vmlinuz-<version> 到底是怎么来的，它到底是由什么所构成的。

kernel 镜像的由来

根据 grub.cfg，GRUB 通过 linux 命令加载的 kernel 文件是位于 /boot 目录下的 vmlinuz-<version> 。我们计算其 md5 / sha256，发现和 make bzImage 出来的在 kernel 目录下的 arch/x86/boot/bzImage 相同。可以理解为，在对 kernel make install 安装时，其中一步就是把 arch/x86/boot/bzImage 拷到 /boot/vmlinuz-... 。

make bzImage 对应的 Makefile 定义在 arch/x86/Makefile 中：

bzImage: vmlinux
ifeq ($(CONFIG_X86_DECODER_SELFTEST),y)
        $(Q)$(MAKE) $(build)=arch/x86/tools posttest
endif
        $(Q)$(MAKE) $(build)=$(boot) $(KBUILD_IMAGE)
        $(Q)mkdir -p $(objtree)/arch/$(UTS_MACHINE)/boot
        $(Q)ln -fsn ../../x86/boot/bzImage $(objtree)/arch/$(UTS_MACHINE)/boot/$@

make $(KBUILD_IMAGE) 会让各目录下的 Makefile 先被执行，包括 arch/x86/boot/Makefile，其中有生成 bzImage 的规则：

$(obj)/bzImage: $(obj)/setup.bin $(obj)/vmlinux.bin $(obj)/tools/build FORCE
  $(call if_changed,image)
  @echo 'Kernel: $@ is ready' ' (#'`cat .version`')'

它调用了 image 命令，其定义在 scripts/Makefile.lib ：

cmd_image = $(obj)/tools/build $(obj)/setup.bin $(obj)/vmlinux.bin \
             $(obj)/zoffset.h $@

根据 build 工具的源代码 arch/x86/boot/tools/build.c ，它将根据 zoffset.h 提供的偏移量将 setup.bin 和 vmlinux.bin 连在一起，最后再加上计算出来的 CRC 用作校验，生成 bzImage 。

而根据 arch/x86/boot/compressed/Makefile 的定义：

$(obj)/zoffset.h: $(obj)/compressed/vmlinux FORCE
    $(call if_changed,zoffset)

$(obj)/compressed/vmlinux: FORCE
  $(Q)$(MAKE) $(build)=$(obj)/compressed $@

zoffset.h 存放了压缩后内核的各段偏移量，我们将在后文进行分析。

vmlinux.bin

OBJCOPYFLAGS_vmlinux.bin := -O binary -R .note -R .comment -S
$(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
        $(call if_changed,objcopy)

根据 arch/x86/boot/Makefile ，vmlinux.bin 由 arch/x86/boot/compressed/vmlinux 去掉注释信息(.note section 和 .comment section)得到。

arch/x86/boot/compressed/vmlinux 在 arch/x86/boot/compressed/Makefile 中构建，由 vmlinux-objs-y 的那一堆 .o 文件链接而成，采用的链接脚本为 compressed/vmlinux.lds ：

$(obj)/vmlinux: $(vmlinux-objs-y) FORCE
  $(call if_changed,check_data_rel)
  $(call if_changed,ld)

vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
  $(obj)/string.o $(obj)/cmdline.o $(obj)/error.o \
  $(obj)/piggy.o $(obj)/cpuflags.o

其中 piggy.o 由 piggy.S 编译而成：

$(obj)/piggy.S: $(obj)/vmlinux.bin.$(suffix-y) $(obj)/mkpiggy FORCE
    $(call if_changed,mkpiggy)

cmd_mkpiggy = $(obj)/mkpiggy $< > $@ || ( rm -f $@ ; false )

$(obj)/vmlinux.bin.gz: $(vmlinux.bin.all-y) FORCE
  $(call if_changed,gzip)

vmlinux.bin.all-y := $(obj)/vmlinux.bin

根据 arch/x86/boot/compressed/mkpiggy.c ，程序 mkpiggy 将根据传入的 vmlinux.bin.gz ，生成 piggy.S ，其内容如下：

.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 6857603
.globl z_output_len
z_output_len = 22096448
.globl input_data, input_data_end
input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

piggy.S 会在编译时将 arch/x86/boot/compressed/vmlinux.bin.gz 作为一个段编译进 piggy.o 中。

由于我们在配置中设置了压缩算法为 gzip ，因此 vmlinux.bin.gz 是通过将 arch/x86/boot/compressed 目录下的 vmlinux.bin(注意，此 vmlinux.bin 非章节标题的 vmlinux.bin，后者位于 arch/x86/boot ) 用 gzip 算法压缩生成。所以这里的 $(suffix-y) 为 gz。

那么 arch/x86/boot/compressed 目录下的 vmlinux.bin 是从何而来呢？我们看回 arch/x86/boot/compressed/Makefile ：

OBJCOPYFLAGS_vmlinux.bin :=  -R .comment -S
$(obj)/vmlinux.bin: vmlinux FORCE
  $(call if_changed,objcopy)

根据 Makefile ，它是由源码根目录下的 vmlinux 拷贝去掉注释信息得到。而源码根目录下的 vmlinux 是通过 scripts/link-vmlinux.sh 将 Linux kernel 所需的模块 .o 文件链接起来得到，这里不再深入分析。

至此我们小结下 arch/x86/boot/vmlinux.bin 的产生流程：

kernel 编译链接后产生 vmlinux ，将 vmlinux 去掉注释信息后拷贝到 arch/x86/boot/compressed/vmlinux.bin 。将 arch/x86/boot/compressed/vmlinux.bin 用 gzip 压缩得到 arch/x86/boot/compressed/vmlinux.bin.gz ，然后将其嵌入到 arch/x86/boot/compressed/piggy.o 中。 arch/x86/boot/compressed/piggy.o 和其他 .o 文件一同被链接为 arch/x86/boot/compressed/vmlinux 。最后将 arch/x86/boot/compressed/vmlinux 去掉注释信息后拷贝到 arch/x86/boot/vmlinux.bin 。

setup.bin

再来看 bzImage 的另一个组成部分 setup.bin 。其 makefile 为 arch/x86/boot/Makefile ：

OBJCOPYFLAGS_setup.bin  := -O binary
$(obj)/setup.bin: $(obj)/setup.elf FORCE
  $(call if_changed,objcopy)

可以发现 setup.bin 由 arch/x86/boot/setup.elf 去除掉 symbols 和 relocation 信息得到(-O binary 表示将 object 文件转换为 raw binary 文件，会去掉这些信息)。

而 setup.elf 由链接器通过链接脚本 setup.ld 将 SETUP_OBJS 链接得到：

LDFLAGS_setup.elf := -T
$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
  $(call if_changed,ld)

SETUP_OBJS 定义如下：

setup-y         += a20.o bioscall.o cmdline.o copy.o cpu.o cpuflags.o cpucheck.o
setup-y         += early_serial_console.o edd.o header.o main.o memory.o
setup-y         += pm.o pmjump.o printf.o regs.o string.o tty.o video.o
setup-y         += video-mode.o version.o
setup-$(CONFIG_X86_APM_BOOT) += apm.o
setup-y         += video-vga.o
setup-y         += video-vesa.o
setup-y         += video-bios.o

SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))

所以 setup.elf 由 setup-y 中的各 obj 链接而成。其中，header.o 是由 arch/x86/boot/header.S 编译而来，它依赖于以下两个头文件：

$(obj)/header.o: $(obj)/voffset.h $(obj)/zoffset.h

根据 arch/x86/boot/.voffset.h.cmd ， voffset.h 通过以下流程生成：

nm vmlinux | sed -n -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p' > arch/x86/boot/compressed/../voffset.h

简而言之就是用 nm 列出 vmlinux 的符号表，找出代码段、bss 段和结束的地址，并将它们包装成宏：

#define VO___bss_start _AC(0xffffffff82112000,UL)
#define VO__end _AC(0xffffffff8227e000,UL)
#define VO__text _AC(0xffffffff81000000,UL)

而根据 arch/x86/boot/.zoffset.h.cmd ，zoffset.h 通过以下流程生成：

nm arch/x86/boot/compressed/vmlinux | sed -n -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|input_data\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p' > arch/x86/boot/zoffset.h

类似于 voffset.h ，用 nm 列出 压缩后的 vmlinux 符号表，找出代码段、startup 等地址，并将它们包装成宏：

#define ZO__ehead 0x00000000000003b4
#define ZO__end 0x00000000006af000
#define ZO__text 0x000000000068a740
#define ZO_efi32_stub_entry 0x0000000000000190
#define ZO_efi64_stub_entry 0x0000000000000390
#define ZO_efi_pe_entry 0x0000000000000210
#define ZO_input_data 0x00000000000003b4
#define ZO_startup_32 0x0000000000000000
#define ZO_startup_64 0x0000000000000200
#define ZO_z_input_len 0x000000000068a383
#define ZO_z_output_len 0x0000000001512a40

小结

前面以倒叙的方式阐述了bzImage的生成过程，下面看一段编译过程的log，顺序地展示了该过程：

  LD      drivers/built-in.o
  LINK    vmlinux
  LD      vmlinux.o
  MODPOST vmlinux.o
  GEN     .version
  CHK     include/generated/compile.h
  UPD     include/generated/compile.h
  CC      init/version.o
  LD      init/built-in.o
  KSYM    .tmp_kallsyms1.o
  KSYM    .tmp_kallsyms2.o
  LD      vmlinux
  SORTEX  vmlinux
  SYSMAP  System.map
  CC      arch/x86/boot/a20.o
  AS      arch/x86/boot/bioscall.o
  CC      arch/x86/boot/cmdline.o
  AS      arch/x86/boot/copy.o
  HOSTCC  arch/x86/boot/mkcpustr
  CC      arch/x86/boot/cpuflags.o
  CC      arch/x86/boot/cpucheck.o
  CC      arch/x86/boot/early_serial_console.o
  CC      arch/x86/boot/edd.o
  CC      arch/x86/boot/main.o
  CC      arch/x86/boot/memory.o
  CC      arch/x86/boot/pm.o
  AS      arch/x86/boot/pmjump.o
  CC      arch/x86/boot/printf.o
  CC      arch/x86/boot/regs.o
  CC      arch/x86/boot/tty.o
  CC      arch/x86/boot/video.o
  CC      arch/x86/boot/string.o
  CC      arch/x86/boot/video-mode.o
  CC      arch/x86/boot/version.o
  CC      arch/x86/boot/video-vga.o
  CC      arch/x86/boot/video-vesa.o
  CC      arch/x86/boot/video-bios.o
  HOSTCC  arch/x86/boot/tools/build
  Building modules, stage 2.
  CPUSTR  arch/x86/boot/cpustr.h
  CC      arch/x86/boot/cpu.o
  LDS     arch/x86/boot/compressed/vmlinux.lds
  AS      arch/x86/boot/compressed/head_64.o
  VOFFSET arch/x86/boot/compressed/../voffset.h
  CC      arch/x86/boot/compressed/string.o
  CC      arch/x86/boot/compressed/cmdline.o
  CC      arch/x86/boot/compressed/error.o
  OBJCOPY arch/x86/boot/compressed/vmlinux.bin
  HOSTCC  arch/x86/boot/compressed/mkpiggy
  CC      arch/x86/boot/compressed/cpuflags.o
  CC      arch/x86/boot/compressed/early_serial_console.o
  CC      arch/x86/boot/compressed/eboot.o
  AS      arch/x86/boot/compressed/efi_stub_64.o
  AS      arch/x86/boot/compressed/efi_thunk_64.o
  GZIP    arch/x86/boot/compressed/vmlinux.bin.gz
  MODPOST 4553 modules
  CC      arch/x86/boot/compressed/misc.o
  MKPIGGY arch/x86/boot/compressed/piggy.S
  AS      arch/x86/boot/compressed/piggy.o
  DATAREL arch/x86/boot/compressed/vmlinux
  LD      arch/x86/boot/compressed/vmlinux
  ZOFFSET arch/x86/boot/zoffset.h
  OBJCOPY arch/x86/boot/vmlinux.bin
  AS      arch/x86/boot/header.o
  LD      arch/x86/boot/setup.elf
  OBJCOPY arch/x86/boot/setup.bin
  BUILD   arch/x86/boot/bzImage
Setup is 17436 bytes (padded to 17920 bytes).
System is 6653 kB
CRC 77bd4dc3
Kernel: arch/x86/boot/bzImage is ready  (#2)

最终产生的内核镜像 arch/x86/boot/bzImage 布局如图所示：

            -------------------------
            |       setup.bin       |
        ->  ------------------------- _head (startup_32)
            |.head.text             |
            |                       |               uncompressed code(arch/x86/boot/compressed/head_64.S:ENTRY(startup_32))
            |                       |
            |                       |
vmlinux.bin |                       |
            -------------------------
            |                       |
            |   .rodata..compressed |               compressed code (vmlinux.bin.gz)
            |                       |
            ------------------------- _text
            |.text                  |
            ------------------------- _rodata       uncompressed code (part of head_64.o, misc.o, string.o ... )
            |.rodata                |
            ------------------------- _got
            |.got                   |
            ------------------------- _data
            |.data                  |
        ->  ------------------------- _bss
            |                       |
            ------------------------- _ebss
            |                       |
            ------------------------- _end

因为内核非压缩部分被编译为位置无关(PIC)代码，所以其包含 got 表。

初始化

下面我们回到 Linux 的初始化流程中来。GRUB 将 vmlinuz 包含的 setup.bin 和 vmlinux.bin 加载到对应的位置后，一般会跳转到 setup.bin 进行执行。

header.S

前文提到，setup.bin 也就是 setup.elf ，setup.elf 由 setup-y 中的各 obj 通过链接脚本 setup.ld 链接而成。先看看 setup.ld 的内容：

/*
 * setup.ld
 *
 * Linker script for the i386 setup code
 */
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)                               // 目标架构
ENTRY(_start)                                   // 程序入口点

SECTIONS
{
        . = 0;                                  // bootsector start
        .bstext         : { *(.bstext) }        // 所有输入文件的 .bstext section组合成输出文件的 .bstext section，下同
        .bsdata         : { *(.bsdata) }

        . = 495;                                // 如果bsdata结束还不够 495，用空白填充
        .header         : { *(.header) }        // bootsector end
        .entrytext      : { *(.entrytext) }
        .inittext       : { *(.inittext) }
        .initdata       : { *(.initdata) }
        __end_init = .;

        .text           : { *(.text) }
        .text32         : { *(.text32) }

        . = ALIGN(16);                          // 内存边界
        .rodata         : { *(.rodata*) }

        .videocards     : {
                video_cards = .;
                *(.videocards)
                video_cards_end = .;
        }

        . = ALIGN(16);
        .data           : { *(.data*) }

        .signature      : {
                setup_sig = .;
                LONG(0x5a5aaa55)
        }


        . = ALIGN(16);
        .bss            :
        {
                __bss_start = .;
                *(.bss)
                __bss_end = .;
        }
        . = ALIGN(16);
        _end = .;

        /DISCARD/ : { *(.note*) }

        /*
         * The ASSERT() sink to . is intentional, for binutils 2.14 compatibility:
         */
        . = ASSERT(_end <= 0x8000, "Setup too big!");                     // setup不能超过 32KB
        . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
        /* Necessary for the very-old-loader check to work... */
        . = ASSERT(__end_init <= 5*512, "init sections too big!");

}

在链接成 setup.elf 的一堆 obj 中，最关键的是 header.o ，它由 arch/x86/boot/header.S 编译而成，几乎定义了该 ld 文件中的所有 section：

        .code16
        .section ".bstext", "ax"

        .global bootsect_start
bootsect_start:
// 支持 UEFI 引导。UEFI Image 使用 PE32+ 格式，以 MZ 开头
#ifdef CONFIG_EFI_STUB
        # "MZ", MS-DOS header
        .byte 0x4d
        .byte 0x5a
#endif

        # Normalize the start address
        ljmp    $BOOTSEG, $start2

start2:
        movw    %cs, %ax                // 初始化各段寄存器，令 cs=ds=es=ss
        movw    %ax, %ds
        movw    %ax, %es
        movw    %ax, %ss
        xorw    %sp, %sp                // 清零sp
        sti                             // 打开中断
        cld

        movw    $bugger_off_msg, %si

msg_loop:                               // 打印字符
        lodsb
        andb    %al, %al
        jz      bs_die
        movb    $0xe, %ah               // Write Character in TTY Mode
        movw    $7, %bx
        int     $0x10
        jmp     msg_loop

bs_die:
        # Allow the user to press a key, then reboot
        xorw    %ax, %ax
        int     $0x16                   // 调用 Keyboard services 去读用户输入
        int     $0x19                   // reboot

        # int 0x19 should never return.  In cas e it does anyway,
        # invoke the BIOS reset code...
        ljmp    $0xf000,$0xfff0
...
  .section ".bsdata", "a"
bugger_off_msg:
  .ascii  "Use a boot loader.\r\n"
  .ascii  "\n"
  .ascii  "Remove disk and press any key to reboot...\r\n"
  .byte 0
...

header 的大小为 17 byte：

  .section ".header", "a"
  .globl  sentinel
sentinel: .byte 0xff, 0xff        /* Used to detect broken loaders */     // 2

  .globl  hdr
hdr:
setup_sects:  .byte 0     /* Filled in by build.c */                      // 3
root_flags: .word ROOT_RDONLY                                             // 5
syssize:  .long 0     /* Filled in by build.c */                          // 9
ram_size: .word 0     /* Obsolete */                                      // 11
vid_mode: .word SVGA_MODE                                                 // 13
root_dev: .word 0     /* Filled in by build.c */                          // 15
boot_flag:  .word 0xAA55                                                  // 17

回想链接脚本 setup.ld 在 bsdata section 后加了 . = 495;。这样一来 section bstext+bsdata(495)，再加上 header 的前面部分(17)刚好凑够 512 个byte，构成 1 个 sector，成为 bootsector 。

当年 Linux kernel image 被写到软盘中，可以直接进行引导，首先执行的就是这段代码。然而 bootloader (如 GRUB)出现后，磁盘的第一个扇区被 bootloader 占据，因此 bootloader 首先会被执行。bootloader 在完成必要的初始化操作后，会加载 setup.bin 部分到 X 处(在我的环境中，X 为 0x8a000)，然后直接跳转到 X + 0x200(512) 开始执行，跳过了 bootsector 。因此这段代码也就没有机会被执行。

为了保持兼容性，当今的 kernel 依然保留了 bootsector ，但此时它的功能仅仅是告诉用户你没有从 bootloader 进行引导：它会在 msg_loop 中输出 "Use a boot loader." 和 "Remove disk and press any key to reboot..." ，然后在 bs_die 中等待用户敲下任意字符后 reboot。

既然 bootsector 被跳过，那么我们看到其后续的代码(起始地址为 0x8a200，即 _start)：

  # offset 512, entry point

  .globl  _start
_start:
    # Explicitly enter this as bytes, or the assembler
    # tries to generate a 3-byte jump here, which causes
    # everything else to push off to the wrong offset.
    # 这里使用字节码来实现跳转，是因为汇编指令 jmp 会编译成三个字节，导致偏移
    .byte 0xeb    # short (2-byte) jump
    .byte start_of_setup-1f
1:

    # Part 2 of the header, from the old setup.S

eb 指令表示其实现的是一个 Short Relative Jump，后面需要跟一个 00 到 ff 的 offset，其中 00 到 7f 为 forward jump，而 80 到 ff 为 reverse(backward) jump。这里 offset 就是 start_of_setup 到跳转指令后续第一个名为 1 标签的距离，即 start_of_setup 减 1f。由于跳转指令后续第一个名为 1 标签就跟在其屁股后面，所以该跳转本质上是跳转到 start_of_setup 处。我们跳过了 header 段，这一部分其实就是 GRUB 中 linux_kernel_params 的后半部分。这一部分已由 GRUB 设置好，用于传递启动参数。

我们来看位于 entrytext section 的 start_of_setup ：

  .section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
    /* 设置 es = ds */
    /* bootloader 将控制权交给 kernel 时，es、ds 和 ss 的内容应该相等。但为了保证安全，这里还会再做一遍 */
    movw    %ds, %ax
    movw    %ax, %es
    /* 清除 flag 的方向 bit */
    cld

# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,
# which happened to work by accident for the old code.  Recalculate the stack
# pointer if %ss is invalid.  Otherwise leave it alone, LOADLIN sets up the
# stack behind its own code, so we can't blindly put it directly past the heap.

    /* 检查 ss 是否等于 ds，不是则需要对 ss 进行修正 */
    movw    %ss, %dx
    cmpw    %ax, %dx    # %ds == %ss?
    movw    %sp, %dx
    je  2f      # -> assume %sp is reasonably set

    /* 不等于则表示原堆栈无效，需要重新设置 */
    # Invalid %ss, make up a new stack
    /* 设置栈底(dx)为 _end ，即 BSS 段的结尾 */
    movw    $_end, %dx
    /* 如果能够使用堆，则分配堆空间 */
    testb   $CAN_USE_HEAP, loadflags
    jz  1f
    /*
     * 设置栈底(dx)为 heap_end_ptr ，即堆底，其在 GRUB 中设置为 GRUB_LINUX_HEAP_END_OFFSET
     * 即 (0x9000 - 0x200)
     */
    movw    heap_end_ptr, %dx
    /* 分配栈空间，大小为 512 byte */
1:  addw    $STACK_SIZE, %dx
    jnc 2f
    xorw    %dx, %dx    # Prevent wraparound

2:  # Now %dx should point to the end of our stack space
    /* dx (=sp) 4字节(dword)对齐，确保最后两个 bit 为 0 */
    andw    $~3, %dx    # dword align (might as well...)
    jnz 3f
    /*
     * 如果高位(bit 2及以上)都是 0，重置为 0xfffc (同样是 4 字节对齐)。
     * 否则使用原来的就好
     */
    movw    $0xfffc, %dx    # Make sure we're not zero

    /* 设置栈顶寄存器，栈初始化完成 */
3:  movw    %ax, %ss
    movzwl  %dx, %esp   # Clear upper half of %esp
    sti         # Now we should have a working stack

# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
    /* 将 ds 和 标签 6 的地址压入栈中，这样在返回(lret)时 ds:label 6 会被加载到 cs:ip 中，执行 label 6 的代码 */
    pushw   %ds
    pushw   $6f
    lretw
6:

# Check signature at end of setup
/*
 * 检查签名是否正确，不正确则报错
 * setup_sig 在 arch/x86/boot/setup.ld 中定义
 */
    cmpl    $0x5a5aaa55, setup_sig
    jne setup_bad

# Zero the bss
/* 初始化 bss 段，将 __bss_start 到 _end 全设置为0 */
    movw    $__bss_start, %di
    movw    $_end+3, %cx
    xorl    %eax, %eax
    /* 循环次数 = (cx - di) >> 2 */
    subw    %di, %cx
    shrw    $2, %cx
    rep; stosl

# Jump to C code (should not return)
/* 调用 boot/main.c 的 main 函数 */
    calll   main

小结

header.S 除了保存 bootsector ，传递 Linux 启动参数外，逻辑代码 start_of_setup 主要干了以下几件事：

初始化栈空间
开启中断
初始化 BSS 空间
调用 boot/main.c 的 main 函数

正是对栈和 BSS 空间进行了初始化，我们才能安全地执行 C 代码，跳转到 boot/main.c 的 main 去执行。

如果没堆，此时的内存布局图为：

            |        vmlinux         |
100000      +------------------------+
            |          ...           |
            +------------------------+   esp(栈顶)
STACK_SIZE  |         Stack          |
            +------------------------+   _end(栈底)
            |          BSS           |
            +------------------------+   __bss_start
            |    setup.bin other     |
08a200      +------------------------+
            |  setup.bin bootsector  |
08a000      +------------------------+
            |      Boot loader       |
007c00      +------------------------+
            |                        |
            +------------------------+

有堆：

            |        vmlinux         |
100000      +------------------------+
            |          ...           |
            +------------------------+   esp(栈顶)
STACK_SIZE  |         Stack          |
009000-200  +------------------------+   heap_end_ptr(栈底/堆底)
    ?       |         Heap           |
            +------------------------+
            |          ...           |
            +------------------------+   _end
            |          BSS           |
            +------------------------+   __bss_start
            |    setup.bin other     |
08a200      +------------------------+
            |  setup.bin bootsector  |
08a000      +------------------------+
            |      Boot loader       |
007c00      +------------------------+
            |                        |
            +------------------------+

arch/x86/boot/main.c

void main(void)
{
    /* First, copy the boot header into the "zeropage" */
    /*
     * 从 header.S 中把 GRUB 设置的 setup_header 拷贝到 boot_params.hdr 中
     * 注意这里的 memcpy 不是 glibc 中的那个，而是定义在 copy.S 中的那个：
     * GLOBAL(memcpy)
     *   pushw %si
     *   pushw %di
     *   movw  %ax, %di
     *   movw  %dx, %si
     *   pushw %cx
     *   shrw  $2, %cx
     *   rep; movsl
     *   popw  %cx
     *   andw  $3, %cx
     *   rep; movsb
     *   popw  %di
     *   popw  %si
     *   retl
     * ENDPROC(memcpy)
     * 它通过 ax,dx,cx 来传参，先以 4byte 为单位进行拷贝，然后把剩下的以 byte 为单位进行拷贝
     * 如果 kernel 是使用老式的 command line protocol，更新 boot_params.hdr.cmd_line_ptr
     */
    copy_boot_params();

    /* Initialize the early-boot console */
    /*
     * 初始化 console，如果有 earlyprintk 选项，选择对应的设备作为 console，如 serial,0x3f8,115200
     * 此后可以通过 puts => putchar 输出字符，本质上是通过 0x10 中断调用 BIOS 例程来打印字符
     */
    console_init();
    if (cmdline_find_option_bool("debug"))
        puts("early console in setup code\n");

    /* End of heap check */
    /* 如果开启了 CAN_USE_HEAP，初始化堆 */
    init_heap();

    /* Make sure we have all the proper CPU support */
    /* 检查当前特权级是否能够运行 kernel，kernel 需要的 feature 是否都满足。通过 cpuid 或 rdmsr 获取 */
    if (validate_cpu()) {
        puts("Unable to boot - please use a kernel appropriate "
             "for your CPU.\n");
        die();
    }

    /* Tell the BIOS what CPU mode we intend to run in. */
    /* x86_64 特有，通过 BIOS 0x15 例程告知之后将进入 long mode */
    set_bios_mode();

    /* Detect memory layout */
    /*
     * 检测内存布局是否符合要求
     * 共有 0xe820, 0xe801, 0x88 三种接口，最终都是发出 0x15 中断获取内存状态
     * 能够从 BIOS 得到可用、保留等内存区域的信息，得到每个内存区域的起始地址、长度和类型
     * 并将这些信息存到 boot_params.e820_map 中
     * 可通过 dmesg 查看，即跟在 BIOS-provided physical RAM map 后面那一坨
     */
    detect_memory();

    /* Set keyboard repeat rate (why?) and query the lock flags */
    /* 键盘初始化，通过 BIOS 0x16 例程获取键盘状态，然后设置 repeat rate (按住不放产生字符的速率) */
    keyboard_init();

    /* Query Intel SpeedStep (IST) information */
    query_ist();

    /* Query APM information */
    /* 通过 BIOS 0x15 例程获取 Advanced Power Management 信息，然后再次调用以连接 32 位接口(做两次，第二次是检查) */
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
    query_apm_bios();
#endif

    /* Query EDD information */
    /* 获取所有 Enhanced Disk Drive(支持大容量磁盘设备)信息，存到 boot_params.eddbuf 和 boot_params.edd_mbr_sig_buffer 中 */
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
    query_edd();
#endif

    /* Set the video mode */
    set_video();

    /* Do the last things and invoke protected mode */
    /* 切换到保护模式 */
    go_to_protected_mode();
}

P.S. boot_params 定义在 arch/x86/include/uapi/asm/bootparam.h

init_heap

static void init_heap(void)
{
    char *stack_end;

    if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
        /* 计算当前栈底： stack_end = esp - STACK_SIZE */
        asm("leal %P1(%%esp),%0"
            : "=r" (stack_end) : "i" (-STACK_SIZE));
        /* 计算堆底 */
        heap_end = (char *)
           ((size_t)boot_params.hdr.heap_end_ptr + 0x200);
        /* 确保堆紧挨着栈 */
        if (heap_end > stack_end)
            heap_end = stack_end;
    } else {
        /* Boot protocol 2.00 only, no heap available */
        puts("WARNING: Ancient bootloader, some functionality "
             "may be limited!\n");
    }
}

在 RESET_HEAP 中，会设置 HEAP 为 _end ，表示堆顶。内存布局图为：

            |        vmlinux         |
100000      +------------------------+
            |          ...           |
            +------------------------+   esp(栈顶)
STACK_SIZE  |         Stack          |
009000-200  +------------------------+   stack_end(栈底)  heap_end_ptr(堆底)
    ?       |         Heap           |
            +------------------------+   _end   HEAP(堆顶)
            |          BSS           |
            +------------------------+   __bss_start
            |    setup.bin other     |
08a200      +------------------------+
            |  setup.bin bootsector  |
08a000      +------------------------+
            |      Boot loader       |
007c00      +------------------------+
            |                        |
            +------------------------+

set_video

void set_video(void)
{
    /* vid_mode 由 bootloader 负责填充 */
    u16 mode = boot_params.hdr.vid_mode;

    /* 设置堆顶为 _end 变量 */
    RESET_HEAP();
    /*
     * 保存显示参数到 boot_params.screen_info
     * 根据 video mode 将 video_segment 设置为相应地址
     * 保存字体大小到 boot_params.screen_info.orig_video_points
     * 保存行数到 boot_params.screen_info.orig_video_lines
     * 保存列数到 boot_params.screen_info.orig_video_cols
     */
    store_mode_params();
    /*
     * 将 boot_params.screen_info 中设置的信息保存到 saved_screen 中
     * 为 saved_screen.data 在堆中分配 行*列*u16 的空间，用于保存 video_segment 指向的内容(32Kb)
     */
    save_screen();
    /* 检查所有 video_cards(card_info) ，调用它们的 probe 函数获取它们支持的 mode */
    probe_cards(0);

    /* 尝试设置 video mode */
    for (;;) {
        /* 如果 vid_mode=ask，则提供菜单进行询问 */
        if (mode == ASK_VGA)
            mode = mode_menu();
        /* 对支持该 mode 的第一个 card，调用其对应的 set_mode 函数进行设置 */
        if (!set_mode(mode))
            break;

        printf("Undefined video mode number: %x\n", mode);
        mode = ASK_VGA;
    }
    boot_params.hdr.vid_mode = mode;
    /* 存储 Extended Display Identification Data 信息 */
    vesa_store_edid();
    /* 再次保存显示参数 */
    store_mode_params();

    /* 用 saved_screen 中的信息恢复屏幕显示 */
    if (do_restore)
        restore_screen();
}

读取 bootloader 设置在 setup_header 的 video mode，将其设置到支持该 mode 的 video driver 中。

go_to_protected_mode

最后是关键的 go_to_protected_mode ，其主要负责设置屏蔽中断后设置 idt 和 gdt，然后调用 protected_mode_jump 切换到保护模式：

void go_to_protected_mode(void)
{
    /* Hook before leaving real mode, also disables interrupts */
    /* 如果设置了 boot_params.hdr.realmode_swtch，调用该 hook；否则关闭外部中断和不可屏蔽中断(nmi) */
    realmode_switch_hook();

    /* Enable the A20 gate */
    /* 启用 A20 地址线，如果失败，直接报错然后 hlt */
    if (enable_a20()) {
        puts("A20 gate not responding, unable to boot...\n");
        die();
    }

    /* Reset coprocessor (IGNNE#) */
    /* 清理并重置协处理器 */
    reset_coprocessor();

    /* Mask all interrupts in the PIC */
    /* 屏蔽所有 PIC 中断(除了 primary PIC 用来连接 secondary PIC 的 IRQ2 引脚) */
    mask_all_interrupts();

    /* Actual transition to protected mode... */
    /* 设置 IDTR，指向 null_idt */
    setup_idt();
    /* 设置 GDTR，指向 boot_gdt */
    setup_gdt();
    /* 切换到保护模式 */
    protected_mode_jump(boot_params.hdr.code32_start,
            (u32)&boot_params + (ds() << 4));
}

GDT 定义如下：

static const u64 boot_gdt[] __attribute__((aligned(16))) = {
    /* CS: code, read/execute, 4 GB, base 0 */
    [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
    /* DS: data, read/write, 4 GB, base 0 */
    [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
    /* TSS: 32-bit tss, 104 bytes, base 4096 */
    /* We only have a TSS here to keep Intel VT happy;
       we don't actually use it for anything. */
    [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};

可见其包含 GDT_ENTRY_BOOT_CS(代码段), GDT_ENTRY_BOOT_DS(数据段), GDT_ENTRY_BOOT_TSS(TSS 段，不使用，只是为了 keep Intel VT happy?) 共 3 项，每项长为 16byte 。

通过宏 GDT_ENTRY 来定义 entry，参数分别为(flags, base, limit)。

protected_mode_jump

定义在 arch/x86/boot/pmjump.S

/*
 * void protected_mode_jump(u32 entrypoint, u32 bootparams);
 */
/*
 * 参数通过 eax, edx 传递
 * eax 保存了代码入口地址 boot_params.hdr.code32_start
 * edx 保存了启动参数 bootparams
 */
GLOBAL(protected_mode_jump)
    movl    %edx, %esi      # Pointer to boot_params table

    xorl    %ebx, %ebx
    /* bx = cs = 0x1000 */
    movw    %cs, %bx
    shll    $4, %ebx
    /* ebx == 0x10000 + label 2 */
    addl    %ebx, 2f
    jmp 1f          # Short jump to serialize on 386/486
1:

    /* 将 GDT 中 data 段和 tss 段的段号设置到 cx 和 di */
    movw    $__BOOT_DS, %cx
    movw    $__BOOT_TSS, %di

    /* 设置 cr0 的 bit0(Protection Enable)，开启保护模式 */
    movl    %cr0, %edx
    orb $X86_CR0_PE, %dl    # Protected mode
    movl    %edx, %cr0

    # Transition to 32-bit mode
    /* 长跳转到 __BOOT_CS:in_pm32 */
    .byte   0x66, 0xea      # ljmpl opcode
2:  .long   in_pm32         # offset
    .word   __BOOT_CS       # segment
ENDPROC(protected_mode_jump)

    .code32
    .section ".text32","ax"
/* 保护模式代码 */
GLOBAL(in_pm32)
    # Set up data segments for flat 32-bit mode
    /* 重新设置段寄存器为 __BOOT_DS */
    movl    %ecx, %ds
    movl    %ecx, %es
    movl    %ecx, %fs
    movl    %ecx, %gs
    movl    %ecx, %ss
    # The 32-bit code sets up its own stack, but this way we do have
    # a valid stack if some debugging hack wants to use it.
    /* 栈指针 +0x10000 ，跳过实模式区域 */
    addl    %ebx, %esp

    # Set up TR to make Intel VT happy
    /* 设置任务状态段寄存器(tr)为 __BOOT_TSS */
    ltr %di

    # Clear registers to allow for future extensions to the
    # 32-bit boot protocol
    xorl    %ecx, %ecx
    xorl    %edx, %edx
    xorl    %ebx, %ebx
    xorl    %ebp, %ebp
    xorl    %edi, %edi

    # Set up LDTR to make Intel VT happy
    /* 加载 ldt 为 __BOOT_DS */
    lldt    %cx

    /* 跳转到 boot_params.hdr.code32_start */
    jmpl    *%eax           # Jump to the 32-bit entrypoint
ENDPROC(in_pm32)

arch/x86/include/asm/segment.h 定义了 GDT 中各段的段号方便使用：

#define GDT_ENTRY_BOOT_CS 2
#define GDT_ENTRY_BOOT_DS 3
#define GDT_ENTRY_BOOT_TSS  4
#define __BOOT_CS   (GDT_ENTRY_BOOT_CS*8)
#define __BOOT_DS   (GDT_ENTRY_BOOT_DS*8)
#define __BOOT_TSS    (GDT_ENTRY_BOOT_TSS*8)

因为在此过程中跳转到 __BOOT_CS:in_pm32 ，因此此时代码段寄存器(cs)值为 0x10 。而在 in_pm32 的起始几行，将除 cs 外的段寄存器都设置为 __BOOT_DS ，其值为 0x18 。最后将 tr 设置为 __BOOT_TSS ，其值为 0x20 。

最后跳转到 boot_params.hdr.code32_start ，其在 GRUB 就已经设置好，这里为 0x1000000 。

startup_32

在 gdb 中对 0x1000000 下个断点，注意需要设置架构为 32 位(set arch i386)，否则会卡在 0x0000000000000000：

(gdb) hb *0x1000000
(gdb) c
Continuing.

Thread 1 hit Breakpoint 1, 0x01000000 in ?? ()
(gdb) info reg
eax            0x7ffc2b01 2147232513
ecx            0x7ffbaec0 2147200704
edx            0x0  0
ebx            0x0  0
esp            0x8a000  0x8a000
ebp            0x0  0x0 <irq_stack_union>
esi            0x8a000  565248
edi            0x0  0
eip            0x1000000  0x1000000
eflags         0x200046 [ PF ZF ID ]
cs             0x10 16
ss             0x18 24
ds             0x18 24
es             0x18 24
fs             0x18 24
gs             0x18 24

通过反汇编，我们发现来到了 arch/x86/boot/compressed/head_64.S 的 ENTRY(startup_32) 。

但奇怪的是，我尝试在 arch/x86/boot/header.S:_start(0x8a200) 周围下断点却并没有能够使代码停下来。同时观察到 GRUB 其实已经把设置 GDT、IDT、开启 A20 等工作都做过一遍了，因此个人猜测在实际执行时可能跳过了 setup.bin 的相关代码，直接从 vmlinux.bin 的 startup_32 开始执行：

ENTRY(startup_32)
    /*
     * 32bit entry is 0 and it is ABI so immutable!
     * If we come here directly from a bootloader,
     * kernel(text+data+bss+brk) ramdisk, zero_page, command line
     * all need to be under the 4G limit.
     */
    /* 清除方向位 DF，因为之后要用到 string 操作 */
    cld
    /*
     * Test KEEP_SEGMENTS flag to see if the bootloader is asking
     * us to not reload segments
     */
    /* 如果 GRUB 将 loadflags 的第 6bit(KEEP_SEGMENTS)设置为 1，则无需重新加载段寄存器 */
    testb $KEEP_SEGMENTS, BP_loadflags(%esi)
    jnz 1f

    cli
    movl    $(__BOOT_DS), %eax
    movl    %eax, %ds
    movl    %eax, %es
    movl    %eax, %ss
1:

/*
 * Calculate the delta between where we were compiled to run
 * at and where we were actually loaded at.  This can only be done
 * with a short local call on x86.  Nothing  else will tell us what
 * address we are running at.  The reserved chunk of the real-mode
 * data at 0x1e4 (defined as a scratch field) are used as the stack
 * for this calculation. Only 4 bytes are needed.
 */
    /* 获取当前指令的地址 */
    leal    (BP_scratch+4)(%esi), %esp
    call    1f
1:  popl    %ebp
    /* 算出代码编译和实际加载的偏差值 */
    subl    $1b, %ebp

/* setup a stack and make sure cpu supports long mode. */
    /* 编译的栈顶地址加上偏移量得到真实地址，将其设置到 esp */
    movl    $boot_stack_end, %eax
    addl    %ebp, %eax
    movl    %eax, %esp

    /*
     * arch/x86/kernel/verify_cpu.S
     * 通过 cpuid 来查询 cpu 是否支持 long mode 和 SSE
     */
    call    verify_cpu
    testl   %eax, %eax
    /* 不支持，停机直到收到中断后跳回1b(前一个为1的标签) */
    jnz no_longmode

对于 vmlinux.bin 而言，startup_32 的地址为 0 ，然而实际上它被加载到内存的地址并非为 0，从上述我们断点的位置可以知道其地址为 0x1000000。在代码中我们怎么知道这个信息呢？这里用了一个技巧：call 指令会把下一条指令的地址压栈，那么只需要将 call 的下一条指令的地址作为其目标，那么 call 会在将这条指令地址压栈的同时去运行该代码，就好像没有发生过跳转一样。如果下一条指令为 pop，则运行时会将压栈的地址弹出，实现“获取当前指令的地址”的功能。此过程涉及到压栈出栈操作，而此时栈却还没准备好(刚刚跳转过来)，为此 boot_params 特地预留了 4 byte 用于存放压栈的地址，即 boot_params.BP_scratch 到 boot_params.BP_scratch + 4，因此这里将 esp 设置为 BP_scratch+4 (栈是从高地址向低地址增长的)。

于是 ebp 将存放 label 1 代码编译(0x0 + offset_of_label)和实际加载(0x1000000 + offset_of_label)的偏差值，即 vmlinux.bin 代码在内存中的实际偏移量 0x1000000 。该偏移量此后被存放到 ebx 中：

#ifdef CONFIG_RELOCATABLE
    movl    %ebp, %ebx
    movl    BP_kernel_alignment(%esi), %eax
    decl    %eax
    addl    %eax, %ebx
    notl    %eax
    andl    %eax, %ebx
    /* ebp 和 LOAD_PHYSICAL_ADDR 比较，如果大于则使用 */
    cmpl    $LOAD_PHYSICAL_ADDR, %ebx
    jge 1f
#endif
    movl    $LOAD_PHYSICAL_ADDR, %ebx
1:

    /* Target address to relocate to for decompression */
    movl    BP_init_size(%esi), %eax
    subl    $_end, %eax
    addl    %eax, %ebx

编译时 CONFIG_PHYSICAL_START 配置的默认启动物理地址为 0x1000000 。但如果 kernel 挂了，允许从其他地址加载 rescue kernel (用于收集诊断信息)，为此提供 CONFIG_RELOCATABLE ，允许从 kernel 特定的物理地址开始运行。因此这里检查是否这种情况，如果不是则修正为 0x1000000 。

修正后的起始地址存在 ebx 中。作为后续地址计算的偏移量。接下来为跳转到 64 位代码做准备：

/*
 * Prepare for entering 64 bit mode
 */

    /* Load new GDT with the 64bit segments using 32bit descriptor */
    /*
     * 将 gdt 的实际地址设置到 gdt+2 的字段中(跳过 .word    gdt_end - gdt)
     * gdt 定义在同一个文件的 .data 段
     */
    leal    gdt(%ebp), %eax
    movl    %eax, gdt+2(%ebp)
    lgdt    gdt(%ebp)

    /* Enable PAE mode */
    /* 设置 cr4 的 PAE bit，启用物理地址扩展 */
    movl    %cr4, %eax
    orl $X86_CR4_PAE, %eax
    movl    %eax, %cr4

 /*
  * Build early 4G boot pagetable
  */
    /* Initialize Page tables to 0 */
    /*
     * 初始化页表
     * 6 个页表，每个 4KB，所以需要清出 24KB 内存
     * 因此设置 ecx 为 6144 ，6144 * 4Byte = 24KB
     */
    leal    pgtable(%ebx), %edi
    xorl    %eax, %eax
    movl    $(BOOT_INIT_PGT_SIZE/4), %ecx
    rep stosl

    /* Build Level 4 */
    /*
     * PGD(Page Global Directory)
     * 预留 0x1007 = 4103 = 4096 + 7(存放 PML4 的flags，即 PRESENT+RW+USER )
     * 最后将 PUD 的起始地址存到 PML4 的起始地址(第一项)
     */
    leal    pgtable + 0(%ebx), %edi
    leal    0x1007 (%edi), %eax
    movl    %eax, 0(%edi)

    /* Build Level 3 */
    /*
     * PUD(Page Upper Directory)
     * 预留 0x1000
     */
    leal    pgtable + 0x1000(%ebx), %edi
    leal    0x1007(%edi), %eax

    /*
     * PMD(Page Middle Directory)
     * 4个，每个 0x1000，地址记录在 PUD 中(4个表项，一个表项 8byte)
     */
    movl    $4, %ecx
1:  movl    %eax, 0x00(%edi)
    addl    $0x00001000, %eax
    addl    $8, %edi
    decl    %ecx
    jnz 1b

    /* Build Level 2 */
    /*
     * PD(Page Directory)
     * 设置 PMD 的值，4个 PMD 共 2048 个页表项
     * 每个页表项都含 0x183，表示 PRESENT + WRITE + MBZ
     * 2048 个页表项，一个页大小为 2M(0x200000)，一共能寻址 4G 内存
     */
    leal    pgtable + 0x2000(%ebx), %edi
    movl    $0x00000183, %eax
    movl    $2048, %ecx
1:  movl    %eax, 0(%edi)
    addl    $0x00200000, %eax
    addl    $8, %edi
    decl    %ecx
    jnz 1b

    /* Enable the boot page tables */
    /* 将 PGD 的首地址设置到 cr3 */
    leal    pgtable(%ebx), %eax
    movl    %eax, %cr3

GDT 的定义如下：

gdt:
  .word gdt_end - gdt           /* gdt 长度 */
  .long gdt                     /* gdtr */
  .word 0                       /* 对齐 */
  .quad 0x0000000000000000      /* NULL descriptor */
  .quad 0x00af9a000000ffff      /* __KERNEL_CS */
  .quad 0x00cf92000000ffff      /* __KERNEL_DS */
  .quad 0x0080890000000000      /* TS descriptor */
  .quad   0x0000000000000000    /* TS continued */
gdt_end:

页表结构如图所示：

该页表能够寻址 2M * 2048 = 4 GB 内存。

将 pgtable 的真实地址加载进 cr3 后，由于未设置 cr0 的 PG bit，因此此时分页还未启用。

    /* Enable Long mode in EFER (Extended Feature Enable Register) */
    /* 设置 msr(0xc0000080) 的 Long mode enable bit(LME) */
    movl    $MSR_EFER, %ecx
    rdmsr
    btsl    $_EFER_LME, %eax
    wrmsr

    /* After gdt is loaded */
    /* 清空 ldtr，设置 tr 为 gdt 的 TSS 项 */
    xorl    %eax, %eax
    lldt    %ax
    movl    $__BOOT_TSS, %eax
    ltr %ax

    /*
     * Setup for the jump to 64bit mode
     *
     * When the jump is performend we will be in long mode but
     * in 32bit compatibility mode with EFER.LME = 1, CS.L = 0, CS.D = 1
     * (and in turn EFER.LMA = 1).  To jump into 64bit mode we use
     * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
     * We place all of the values on our mini stack so lret can
     * used to perform that far jump.
     */
    /* 将 kernel 代码段号压栈(传参) */
    pushl   $__KERNEL_CS
    leal    startup_64(%ebp), %eax
#ifdef CONFIG_EFI_MIXED
    movl    efi32_config(%ebp), %ebx
    cmp $0, %ebx
    jz  1f
    leal    handover_entry(%ebp), %eax
1:
#endif
    /* 将 startup_64 的实际加载地址压栈 */
    pushl   %eax

    /* Enter paged protected Mode, activating Long Mode */
    /* 设置 cr0 中的 PG bit 为 1，启用分页 */
    movl    $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
    movl    %eax, %cr0

    /*
     * 一旦设置了 EFER.LME 为 1，且启用了分页，则进入了属于 long mode 的 compatibility mode
     * 顾名思义，该模式兼容 32 位代码的执行
     * 要切换到 64 位代码，只需进行长跳转，将代码段设置为 __KERNEL_CS 即可
     */
    /* Jump from 32bit compatibility mode into 64bit mode. */
    /* 返回，跳转到 __KERNEL_CS:startup_64 执行 64 位代码 */
    lret
ENDPROC(startup_32)

最后终于启用了分页，lret跳转到 startup_64 。

startup_64

位于 arch/x86/boot/compressed/head_64.S 。

注意如果要跟踪调试的话这里有点 tricky，因为从 32 位切换到 64 位时，我们原先在 32 位下 gdb 设置的是 set arch i386 ，在跳转后 gdb 就会报错，这时应该在 64 位代码下一个断点(如何得知 64 位代码起始地址是多少？请查看先前压栈的 eax)。我这里为 0x1000200 ，于是下一个断点后采用重连大法：

(gdb) hb *0x1000200
Hardware assisted breakpoint 3 at 0x1000200
(gdb) c
Continuing.
Remote 'g' packet reply is too long: 01000080000000000000000000000000800000c000000000000000000000000000a008000000000000305302000000000000000100000000c0ea6b01000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000020001000000004600200010000000180000001800000018000000180000001800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007f0300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000801f0000
(gdb) disconnect
Ending remote debugging.
(gdb) set arch i386:x86-64
The target architecture is assumed to be i386:x86-64
(gdb) target remote :8887
Remote debugging using :8887
0x0000000001000200 in ?? ()

来到以下代码：

ENTRY(startup_64)
    /*
     * 64bit entry is 0x200 and it is ABI so immutable!
     * We come here either from startup_32 or directly from a
     * 64bit bootloader.
     * If we come here from a bootloader, kernel(text+data+bss+brk),
     * ramdisk, zero_page, command line could be above 4G.
     * We depend on an identity mapped page table being provided
     * that maps our entire kernel(text+data+bss+brk), zero page
     * and command line.
     */
#ifdef CONFIG_EFI_STUB
    /*
     * The entry point for the PE/COFF executable is efi_pe_entry, so
     * only legacy boot loaders will execute this jmp.
     */
    jmp preferred_addr
...
preferred_addr:
#endif

    /* Setup data segments. */
    /* 重新将各个段寄存器清 0 */
    xorl %eax, %eax
    movl %eax, %ds
    movl %eax, %es
    movl %eax, %ss
    movl %eax, %fs
    movl %eax, %gs
    /*
     * Compute the decompressed kernel start address.  It is where
     * we were loaded at aligned to a 2M boundary. %rbp contains the
     * decompressed kernel start address.
     *
     * If it is a relocatable kernel then decompress and run the kernel
     * from load address aligned to 2MB addr, otherwise decompress and
     * run the kernel from LOAD_PHYSICAL_ADDR
     *
     * We cannot rely on the calculation done in 32-bit mode, since we
     * may have been invoked via the 64-bit entry point.
     */
    /* Start with the delta to where the kernel will run at. */
    /*
     * 计算代码编译和实际加载地址的偏差值
     * 虽然前面在 32 位下已经算过，但有可能 bootloader 会跳过 32 位的代码直接从 startup_64 开始引导
     * 因此需要再计算一遍
     */
#ifdef CONFIG_RELOCATABLE
    leaq startup_32(%rip) /* - $startup_32 */, %rbp
    movl BP_kernel_alignment(%rsi), %eax
    decl %eax
    addq %rax, %rbp
    notq %rax
    andq %rax, %rbp
    cmpq $LOAD_PHYSICAL_ADDR, %rbp
    jge  1f
#endif
    movq $LOAD_PHYSICAL_ADDR, %rbp
1:

    /* Target address to relocate to for decompression */
    /*
     * 读取 boot_params.init_size ，得到 kernel initialization size
     * 减去 _end ，得到起始地址，再加上实际的起始偏移量，得到：
     * 内核重定位后的起始地址，保存到 rbx
     */
    movl BP_init_size(%rsi), %ebx
    subl $_end, %ebx
    addq %rbp, %rbx

    /* Set up the stack */
    /* 设置栈顶 */
    leaq boot_stack_end(%rbx), %rsp

    /* Zero EFLAGS */
    pushq     $0
    popfq

/*
 * Copy the compressed kernel to the end of our buffer
 * where decompression in place becomes safe.
 */
    /*
     * 拷贝压缩后的 kernel 到相应位置(_head 到 _bss 之间)
     * 从高地址向低地址拷贝
     * 接下来的循环需要使用 rsi，但 rsi 目前指向 boot_params ，因此压栈保存
     */
    pushq     %rsi
    /* 当前 bss-8 加载到的实际地址 */
    leaq (_bss-8)(%rip), %rsi
    /* 重定位后 bss-8 的目标地址 */
    leaq (_bss-8)(%rbx), %rdi
    movq $_bss /* - $startup_32 */, %rcx
    /* 一次能拷贝 8个byte，因此循环次数除 8 */
    shrq $3, %rcx
    std
    rep  movsq
    /* 清掉 DF，恢复 rsi(boot_params) */
    cld
    popq %rsi

/*
 * Jump to the relocated address.
 */
    /* 跳转到 relocated */
    leaq relocated(%rbx), %rax
    jmp  *%rax

根据构成图：

0           -------------------------
            |       setup.bin       |
        ->  ------------------------- _head (startup_32)
            |.head.text             |
            |                       |               uncompressed code(arch/x86/boot/compressed/head_64.S:ENTRY(startup_32))
            |                       |
            |                       |
vmlinux.bin |                       |
            -------------------------
            |                       |
            |   .rodata..compressed |               compressed code (vmlinux.bin.gz)
            |                       |
            ------------------------- _text
            |.text                  |               uncompressed code (part of head_64.o, misc.o, string.o ... )
            ------------------------- _rodata
            |.rodata                |
            ------------------------- _got
            |.got                   |
            ------------------------- _data
            |.data                  |
        ->  ------------------------- _bss
            |                       |
            ------------------------- _ebss
            |                       |
            ------------------------- _end

并回顾 vmlinuz 的构造过程，我们知道：

head_64.S 的 startup_32 被链接到了 vmlinux.bin 的开头，位于 .head.text section 。因为 startup_32 定义的前一行为 __HEAD ，而 #define __HEAD .section ".head.text","ax"。

.rodata..compressed section 存放了压缩后的 linux kernel 代码，即 vmlinux.bin.gz 。

.text section 存放了内核非压缩部分的代码，首先是 head_64.o 中的不属于 .head.text section 的部分，即 head_64.S 的 relocated 。因为 relocated 定义的前一行为 .text 。当然之后还有 misc.o、string.o、cmdline.o 等。

因此我们这里要做的是，从当前的 _bss-8 开始，将剩下的内容拷贝到重定位的 _bss-8 ，从后往前拷，拷贝长度为 _head 到 _bss 的距离。

拷贝完成后，跳转到重定位后的 relocated ，我这里的地址为 0x000000000250f4f0 ：

relocated:
/*
 * Clear BSS (stack is currently empty)
 */
    /* 清空 BSS section，因为之后要运行 C 代码 */
    xorl    %eax, %eax
    leaq    _bss(%rip), %rdi
    leaq    _ebss(%rip), %rcx
    subq    %rdi, %rcx
    shrq    $3, %rcx
    rep stosq

/*
 * Adjust our own GOT
 */
    /* 将 _got 中的表项都加上重定位的偏移量 rbx */
    leaq    _got(%rip), %rdx
    leaq    _egot(%rip), %rcx
1:
    cmpq    %rcx, %rdx
    jae 2f
    addq    %rbx, (%rdx)
    addq    $8, %rdx
    jmp 1b
2:

/*
 * Do the extraction, and jump to the new kernel..
 */
    /*
     * 解压内核代码，即 vmlinux.bin.gz
     * 下面为设置函数 extract_kernel 的参数：
     * rdi - rmode - 指向 boot_params
     * rsi - heap - 指向初始阶段的堆
     * rdx - input_data - 指向压缩内核的起始地址
     * rcx - input_len - 压缩内核的长度
     * r8  - output - 解压后内核的起始地址
     * r9  - output_len - 解压后内核的长度
     */
    pushq   %rsi            /* Save the real mode argument */
    movq    %rsi, %rdi      /* real mode address */
    leaq    boot_heap(%rip), %rsi   /* malloc area for uncompression */
    leaq    input_data(%rip), %rdx  /* input_data */
    movl    $z_input_len, %ecx  /* input_len */
    movq    %rbp, %r8       /* output target address */
    movq    $z_output_len, %r9  /* decompressed length, end of relocs */
    call    extract_kernel      /* returns kernel location in %rax */
    /* 使 rsi 重新指向 boot_params */
    popq    %rsi

/*
 * Jump to the decompressed kernel.
 */
    /* extract_kernel 返回了解压后的内核的起始地址，跳转到该地址开始执行 */
    jmp *%rax

input_data、z_input_len 和 z_output_len 定义在 piggy.S 中，它是编译内核时 mkpiggy 根据 vmlinux.bin.gz 生成，保存了内核相关信息。

extract_kernel 定义在 arch/x86/boot/compressed/misc.c 中

asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
                  unsigned char *input_data,
                  unsigned long input_len,
                  unsigned char *output,
                  unsigned long output_len)
{
    /*
     * 计算解压后的内核大小
     * VO__end 和 VO__text 定义在 arch/x86/boot/voffset.h 中，其定义根据 vmlinux 的符号表生成
     */
    const unsigned long kernel_total_size = VO__end - VO__text;
    unsigned long virt_addr = (unsigned long)output;

    /* Retain x86 boot parameters pointer passed from startup_32/64. */
    boot_params = rmode;

    /* Clear flags intended for solely in-kernel use. */
    boot_params->hdr.loadflags &= ~KASLR_FLAG;

    sanitize_boot_params(boot_params);

    if (boot_params->screen_info.orig_video_mode == 7) {
        vidmem = (char *) 0xb0000;
        vidport = 0x3b4;
    } else {
        vidmem = (char *) 0xb8000;
        vidport = 0x3d4;
    }

    lines = boot_params->screen_info.orig_video_lines;
    cols = boot_params->screen_info.orig_video_cols;

    console_init();
    debug_putstr("early console in extract_kernel\n");

    /* 用于解压的临时堆内存 ，长度为 BOOT_HEAP_SIZE ，除 bzip2 (0x400000) 外为 0x10000 */
    free_mem_ptr     = heap;    /* Heap */
    free_mem_end_ptr = heap + BOOT_HEAP_SIZE;

    /* Report initial kernel position details. */
    debug_putaddr(input_data);
    debug_putaddr(input_len);
    debug_putaddr(output);
    debug_putaddr(output_len);
    debug_putaddr(kernel_total_size);

    /*
     * The memory hole needed for the kernel is the larger of either
     * the entire decompressed kernel plus relocation table, or the
     * entire decompressed kernel plus .bss and .brk sections.
     */
    /*
     * kernel aslr(kernel Address Space Layout Randomization)
     * 获得随机的解压后 kernel 的起始地址
     */
    choose_random_location((unsigned long)input_data, input_len,
                (unsigned long *)&output,
                max(output_len, kernel_total_size),
                &virt_addr);

    /* Validate memory location choices. */
    if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
        error("Destination physical address inappropriately aligned");
    if (virt_addr & (MIN_KERNEL_ALIGN - 1))
        error("Destination virtual address inappropriately aligned");
#ifdef CONFIG_X86_64
    if (heap > 0x3fffffffffffUL)
        error("Destination address too large");
#else
    if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
        error("Destination address too large");
#endif
#ifndef CONFIG_RELOCATABLE
    if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
        error("Destination address does not match LOAD_PHYSICAL_ADDR");
    if ((unsigned long)output != virt_addr)
        error("Destination virtual address changed when not relocatable");
#endif

    debug_putstr("\nDecompressing Linux... ");
    /* 原地解压 kernel */
    __decompress(input_data, input_len, NULL, NULL, output, output_len,
            NULL, error);
    /* 将解压后的内核 segment 放到合适位置 */
    parse_elf(output);
    handle_relocations(output, output_len, virt_addr);
    debug_putstr("done.\nBooting the kernel.\n");
    return output;
}

choose_random_location

如果开启了 kASLR (CONFIG_RANDOMIZE_BASE=y) ，则 choose_random_location 不对 output 进行任何修改，直接返回。否则采用 kaslr.c 中的定义：

=> 如果在内核启动参数中设置了 nokaslr ，则直接返回
=> initialize_identity_maps    填充 mapping_info 。如果从 start_32 而来，那么 cr3 中已经加载了 `_pgtable` ，则将其保存到 level4p ，并在维护页表分配信息的 `pgt_data(mapping_info.context)` 中跳过 BOOT_INIT_PGT_SIZE ，然后将 pgt_data 区域清 0。否则需要新分配顶级页表页，设置到 level4p 中
=> mem_avoid_init                                     找出不安全的内存区域，根据类型(mem_avoid_index)保存到 mem_vector 的数组 mem_avoid 中
    => add_identity_map                               为这些不安全的区域构建 2m 页表项并填充上级页表(identity mapping)
=> 设置内核解压的起始地址 min_addr ，其不能大于 output 地址 和 512MB
=> find_random_phys_addr => process_e820_entry        遍历 boot_params->e820_map 中的 entry(先前通过 BIOS 例程找到) ，将符合条件的区域以 slot_area 的形式添加到全局变量 slot_areas 中
                         => slots_fetch_random        随机从 slot_areas (满足条件的内存区域) 中选择一个，返回起始地址
=> add_identity_map => kernel_ident_mapping_init      为随机出来的内存区域创建涉及到的各级页表页，并将地址填充到上级页表中
=> finalize_identity_maps                             将 identity mapping 的 4 级页表指针设置到 cr3 中，使其生效
=> find_random_virt_addr                              设置 kernel 的起始虚拟地址，在 32 位下虚拟地址就是output，而在 64 位下需要随机选择(LOAD_PHYSICAL_ADDR + random_addr * CONFIG_PHYSICAL_ALIGN)

mem_avoid_init

本质上是填充 mem_avoid 数组，定义了一些不安全、或已经被占用的内存地址空间，在解压 kernel 时应该避免这些空间。

static struct mem_vector mem_avoid[MEM_AVOID_MAX];

enum mem_avoid_index {
  MEM_AVOID_ZO_RANGE = 0,
  MEM_AVOID_INITRD,
  MEM_AVOID_CMDLINE,
  MEM_AVOID_BOOTPARAMS,
  MEM_AVOID_MAX,
};

struct mem_vector {
  unsigned long start;
  unsigned long size;
};

mem_avoid 由 4 个 mem_vector ，每个 mem_vector 通过 start 和 size 定义了一段内存地址范围。分别表示：

压缩内核的存放区域
initrd 的存放区域(根据 boot_params 得到)
内核命令行区域(根据 boot_params 得到)
启动参数区域(boot_params 本身)

在填充的同时，通过调用 add_identity_map ，为这些区域创建涉及到的3级页表页，并添加到4级页表中

process_e820_entry

=> 如果 entry 加上其长度后依然小于 minimum ，放弃之
=> 根据 entry 信息构建 slot_area
=> mem_avoid_overlap    检查应该避免的内存区域，如果重叠，应该减去重叠部分，如果重叠部分太大，则放弃该 slot_area
=> 如果 slot_area 长度目前比 image_size 小，放弃之

__decompress

不同的压缩算法都实现了 __decompress 函数，在编译时就以确定。由于我们编译kernel的配置是 CONFIG_KERNEL_GZIP=y ，因此调用的是 lib/decompress_inflate.c 的实现，其实际调用 __gunzip ，将 [input_data, input_data + input_len] 的内容解压到 [output, output + output_len] 。

parse_elf

传入解压后的内核起始地址作为参数，负责解析内核的合法性，并进行重定位。

由于解压后的内核(vmlinux)是 ELF 类型文件，所以读取它的 File header ，根据其信息判断是否合法。如果合法(前4个 byte 为 7f 45 4c 46)，根据 header 中的信息，分配 program header table 的空间。然后遍历 program header table 的 entry，如果类型为 PT_LOAD (表示 loadable segment)且开启了重定位(CONFIG_RELOCATABLE)，需要将该 entry 表示的段从 output + phdr->p_offset 移动到 output + phdr->p_paddr - LOAD_PHYSICAL_ADDR 处。

那么 vmlinux 包含多少个 loadable segment 呢？我们可以通过 readelf 进行查看：

binss@giantvm:~/work/GDB-Kernel$ readelf -l vmlinux

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x0000000000dd3000 0x0000000000dd3000  R E    200000
  LOAD           0x0000000001000000 0xffffffff81e00000 0x0000000001e00000
                 0x0000000000186000 0x0000000000186000  RW     200000
  LOAD           0x0000000001200000 0x0000000000000000 0x0000000001f86000
                 0x000000000001aad8 0x000000000001aad8  RW     200000
  LOAD           0x00000000013a1000 0xffffffff81fa1000 0x0000000001fa1000
                 0x0000000000171000 0x00000000002dd000  RWE    200000
  NOTE           0x0000000000a6a284 0xffffffff8186a284 0x000000000186a284
                 0x0000000000000204 0x0000000000000204         4

 Section to Segment mapping:
  Segment Sections...
   00     .text .notes __ex_table .rodata __bug_table .pci_fixup .builtin_fw .tracedata __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl __ksymtab_strings __init_rodata __param __modver
   01     .data .vvar
   02     .data..percpu
   03     .init.text .altinstr_aux .init.data .x86_cpu_dev.init .parainstructions .altinstructions .altinstr_replacement .iommu_table .apicdrivers .exit.text .smp_locks .data_nosave .bss .brk
   04     .notes

可见有 4 个 loadable 段。

handle_relocations

如果配置了 CONFIG_X86_NEED_RELOCS (启用kASLR) ，则该函数有定义。其负责通过修正 relocation table ，为其加上偏移。

在一切都完成后，C 函数 extract_kernel 返回，返回值为解压后的 kernel 起始地址，执行汇编代码 jmp *%rax 跳转到该位置执行。通过实验发现此时 rax 为 0x0000000001000000 ，并来到了 arch/x86/kernel/head_64.S 的 startup_64 。

至此，我们终于来到了内核真正的主体代码，后续流程放到下一篇文章中进行分析。

总结

本文分析了从 kernel 镜像的构成入手，分析 GRUB 完成使命、将控制权交给 Linux kernel 镜像代码后的执行流程。首先是 boot 程序(arch/x86/boot)的相关代码，通过调用 BIOS 例程完成一些设备检查和初始化工作，在此期间会尝试切换到保护模式。然后是 arch/x86/boot/compressed/head_64.S 的 startup_32，创建第一张页表并启用分页，切换到 64 位模式。随后是 64位代码 startup_64，根据相关参数将 kernel 解压到相应位置，最后跳转到解压出来的代码，arch/x86/kernel/head_64.S 的 startup_64 继续执行。

整个流程涉及三个入口点(entry point)：

------------------------------------------------------------
|       setup.bin         |         vmlinux.bin            |
|boot sector | setup code |                                |
| (header.S) |            |                                |
------------------------------------------------------------
             |            |  |
             |            |  64 bit entry point(+0x200)
             |            |
             |            32 bit entry point(+0x0)
             |
             real mode(16 bit) entry point

由于可能从任意一个入口点开始执行，因此入口点后续的代码都十分谨慎地进行了系列初始化操作，确保环境符合期望。