Troubleshooting gem5 Errors

Table of Contents

1. Compilation Errors

1.1. Treating Warnings as Errors

If your build fails because of some non-harmful warnings, comment the '-Werror', line in the SConstruct file:

# Treat warnings as errors but white list some warnings that we
# want to allow (e.g., deprecation warnings).
main.Append(CCFLAGS=['-Wno-error=deprecated-declarations',
                     '-Wno-error=deprecated',
                     #'-Werror',
                    ])

1.2. Python? No such file or directory!

If your build fails because /usr/bin/env: 'python': No such file or directory, install python3 and/or create a symlink like this:

sudo ln -s /usr/bin/python3 /usr/bin/python

2. Runtime Errors

This guide reference some gem5 runtime errors that we had to solve during our development. This is far from a complete list, but still, it might help someone.

2.1. AttributeError: Can't resolve proxy 'any' of type 'XXX' from 'XXX'

2.1.1. Goal

  • This is the error we will troubleshoot:

    AttributeError: Can't resolve proxy 'any' of type 'ArmSystem' from 'system.realview.generic_timer'
    

2.1.2. Explanation

  • The proxy parameter in gem5 is a Python helper mechanism which is used to handle, affect and verify parameters of SimObject. It's implemented in the Gem5/src/python/m5/proxy.py file.
  • A special proxy paramater is a proxy parameter which have a dedicated class into the proxy.py file. Consider this special proxy parameter (Parent.any):

    system = Param.System(Parent.any, "The system the object is part of")
    
    • This is its special implementation:
    class AnyProxy(BaseProxy):
        def find(self, obj):
            return obj.find_any(self._pdesc.ptype)
    
        def path(self):
            return 'any'
    
    • And that's mean "We will affect to the system attribute of the current object any object of type System find into the parent object". It allow to affect a precise type of variable without knowing it's name in the parent object.

2.1.3. Resolution

  • To resolve our problem, we have to find the special proxy parameter:
    • system inherit from System class (System.py) ;
    • system.realview is of VExpress_GEM5_V1 class (RealView.py) ;
    • system.realview.generic_timer is of GenericTimer class (GenericTimer.py).
  • In the GenericTimer class, we can find the special proxy parameter mentioned in the message:

    system = Param.ArmSystem(Parent.any, "system")
    
  • This parameter search for an ArmSystem in the parent (VExpress_GEM5_V1).
  • The VExpress_GEM5_V1 class has a system attribute which is our system object here.
  • Therefore, our GenericTimer will find a System object but not the specialized ArmSystem object, which produce the error of matching type.
  • Finally, to resolve the proxy error, we have to change our system object to an ArmSystem object, or an object which inherit or the ArmSystem.

2.2. gem5 has encountered a segmentation fault!

2.2.1. Goal

  • Troubleshoot this kind of, is not understand, cryptic error:

    gem5 has encountered a segmentation fault!
    
    --- BEGIN LIBC BACKTRACE ---
    /opt/Gem5/build/ARM/gem5.opt(+0xd14cc9)[0x5579a2c41cc9]
    /opt/Gem5/build/ARM/gem5.opt(+0xd2781f)[0x5579a2c5481f]
    /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7f766cdc8140]
    /opt/Gem5/build/ARM/gem5.opt(+0x134d5d4)[0x5579a327a5d4]
    /opt/Gem5/build/ARM/gem5.opt(+0x13504f2)[0x5579a327d4f2]
    /opt/Gem5/build/ARM/gem5.opt(+0x9b3a8f)[0x5579a28e0a8f]
    /opt/Gem5/build/ARM/gem5.opt(+0x586ebe)[0x5579a24b3ebe]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0xa1a78)[0x7f766ce77a78]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyObject_MakeTpCall+0xa7)[0x7f766ce78817]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0xa37d0)[0x7f766ce797d0]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7639)[0x7f766ce518f9]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x73073)[0x7f766ce49073]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0xa379a)[0x7f766ce7979a]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7639)[0x7f766ce518f9]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x73073)[0x7f766ce49073]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0xa379a)[0x7f766ce7979a]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7639)[0x7f766ce518f9]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x73073)[0x7f766ce49073]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7639)[0x7f766ce518f9]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x73073)[0x7f766ce49073]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x1292)[0x7f766ce4b552]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8df)[0x7f766cf50ebf]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCodeEx+0x3e)[0x7f766cf5125e]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyEval_EvalCode+0x1b)[0x7f766cf4faab]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x175531)[0x7f766cf4b531]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0xe60a3)[0x7f766cebc0a3]
    /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x7dc4d)[0x7f766ce53c4d]
    --- END LIBC BACKTRACE ---
    

2.2.2. Explanation

  • Arrive often with a NULL pointer which is dereferenced in gem5, caused by:
    • A parameter that is asserted to be set, but in fact, it is not.
    • A port that this asserted to be linked, but in fact, it is not.

2.2.3. Resolution

  • Best thing is to use gdb here.
  • Ideally, you should use the gem5.debug binary:

    gdb $GEM5/build/ARM/gem5.opt
    run --debug-break=2000 -d /tmp $GEM5_SCRIPTS/RPIv4.py -v --fs --fs-kernel=$gem5_kernel --fs-disk-image=$gem5_disk
    
  • Use trial and error to refine your --debug-break tick start to arrive where you want to go.
  • At some point, you will arrive at your segfault:

    Program received signal SIGSEGV, Segmentation fault.
    0x00005555568a15d4 in ArmSystem::ArmSystem (this=0x5555595cfb00, p=0x555558cba1a0) at build/ARM/arch/arm/system.cc:77
    77	        _resetAddr = workload->getEntry();
    
    $rsp   : 0x00007fffffffc6c0  →  0x00007ffff50c6398  →  0x0000000000000000
    $rbp   : 0x00005555595cfb00  →  0x0000555557e10020  →  0x0000555556d1fd70  →  <ArmSystem::~ArmSystem()+0> lea rax, [rip+0x10f02a9]        # 0x555557e10020 <_ZTV9ArmSystem+16>
    $rsi   : 0x0000555557f3e0a0  →  0x0000555558f53140  →  0x0000555558f53120  →  0x000055555961c540  →  0x000055555961c560  →  0x000055555961c580  →  0x000055555961c5a0  →  0x000055555961c5c0
    $rdi   : 0x0               
    $rip   : 0x00005555568a15d4  →  <ArmSystem::ArmSystem(ArmSystemParams*)+276> mov rax, QWORD PTR [rdi]
    
      0x5555568a15c4 <ArmSystem::ArmSystem(ArmSystemParams*)+260> cmp    BYTE PTR [rbx+0x144], 0x0
      0x5555568a15cb <ArmSystem::ArmSystem(ArmSystemParams*)+267> je     0x5555568a1648 <ArmSystem::ArmSystem(ArmSystemParams*)+392>
      0x5555568a15cd <ArmSystem::ArmSystem(ArmSystemParams*)+269> mov    rdi, QWORD PTR [rbp+0x190]
    → 0x5555568a15d4 <ArmSystem::ArmSystem(ArmSystemParams*)+276> mov    rax, QWORD PTR [rdi]
    
        72	       _havePAN(p->have_pan),
        73	       semihosting(p->semihosting),
        74	       multiProc(p->multi_proc)
        75	 {
        76	     if (p->auto_reset_addr) {
    →   77	         _resetAddr = workload->getEntry();
    
  • We have find the source of the SEFGAULT:
    • workload->getEntry(); dereference workload pointer to call the getEntry() function.
    • mov rax, QWORD PTR [rdi] is the pointer dereference in assembly.
    • rdi is set to 0x0.
    • This lead to the segmentation fault. Hence, our workload is not well passed to our ArmSystem object. In fact, our workload was linked at the wrong SimObject by inadvertence.

2.3. fatal: XXX

2.3.1. Goal

  • Troubleshoot this kind of error:

    fatal: Must specify at least one workload!
    

2.3.2. Explanation

  • This error is generated in the C++ source code of gem5, by its error handling mechanism.

2.3.3. Resolution

  • Best thing is to search for the error (without the error-level keyword) in the source code:

    ack "Must specify at least one workload" $GEM5/src
    
    /opt/Gem5/src/cpu/o3/deriv.cc:47:            fatal("Must specify at least one workload!");
    
  • We can then search, in the source code, the source of the error:

    sed -n '35,54'p /opt/Gem5/src/cpu/o3/deriv.cc
    
    DerivO3CPU *
    DerivO3CPUParams::create()
    {
        ThreadID actual_num_threads;
        if (FullSystem) {
            // Full-system only supports a single thread for the moment.
            actual_num_threads = 1;
        } else {
            if (workload.size() > numThreads) {
                fatal("Workload Size (%i) > Max Supported Threads (%i) on This CPU",
                      workload.size(), numThreads);
            } else if (workload.size() == 0) {
                fatal("Must specify at least one workload!");
            }
    
            // In non-full-system mode, we infer the number of threads from
            // the workload if it's not explicitly specified.
            actual_num_threads =
                (numThreads >= workload.size()) ? numThreads : workload.size();
        }
    
  • Here, we can understand that the O3CPU take the first else path, when he should have take the first if (because we are in FS mode). Then, the CPU search for a workload linked on it, but there is not because, again, we are in FS mode, therefore producing the fatal error.
  • To fix this particular error, you have to set full_system=True variable of the Root object.

2.4. panic: XXX port of XXX not connected to anything!

2.4.1. Goal

  • Troubleshoot this kind of error:

    panic: Pio port of system.realview.generic_timer_mem not connected to anything!
    

2.4.2. Explanation

  • This error is generated in the C++ source code of gem5, by its error handling mechanism.
  • The reason is clear: the setup of one SimObject's ports is badly programmed or forgotten.

2.4.3. Resolution

  • The linkage of this port should perhaps have been done directly by you, or by an helper function already provided by gem5.
  • To distinguish between these two ways, search in the source code the concerned object (here, system.realview.generic_timer_mem). Understand its function, its ports, and so one.
  • One thing that can help a lot is the generated config.dot.pdf, which give a graphical representation of the system (with links between SimObject).

2.5. Kernel panic - not syncing: VFS: Unable to mount root fs

2.5.1. Goal

  • Troubleshoot this kernel panic:

    [    0.224367] List of all partitions:
    [    0.224394] fe00         1048320 vda 
    [    0.224397]  driver: virtio_blk
    [    0.224440]   fe01         1048288 vda1 00000000-01
    [    0.224441] 
    [    0.224480] No filesystem could mount root, tried: 
    [    0.224481]  ext3
    [    0.224510]  ext4
    [    0.224524]  ext2
    [    0.224537]  squashfs
    [    0.224551]  vfat
    [    0.224566]  fuseblk
    [    0.224579] 
    [    0.224606] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
    [    0.224656] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0+ #1
    [    0.224692] Hardware name: V2P-CA15 (DT)
    [    0.224717] Call trace:
    [    0.224741]  dump_backtrace+0x0/0x1c0
    [    0.224765]  show_stack+0x14/0x20
    [    0.224790]  dump_stack+0x8c/0xac
    [    0.224812]  panic+0x130/0x288
    [    0.224836]  mount_block_root+0x22c/0x294
    [    0.224861]  mount_root+0x140/0x174
    [    0.224884]  prepare_namespace+0x138/0x180
    [    0.224910]  kernel_init_freeable+0x1c0/0x1e0
    [    0.224939]  kernel_init+0x10/0x108
    [    0.224961]  ret_from_fork+0x10/0x18
    [    0.224987] Kernel Offset: disabled
    [    0.225009] CPU features: 0x21c06492
    [    0.225032] Memory Limit: 2048 MB
    [    0.225056] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0) ]---
    
    

2.5.2. Explanation

  • This error is generated by the Linux kernel, in a full-system-emulation setup.
  • We can see, from the error:
    • The kernel recognize the VirtIO block device, which means that this driver is correctly loaded.
    • The kernel tried the ext file system, which means that the file systems are correctly loaded.
    • The kernel detect a vda1 partition.

2.5.3. Resolution

  • The problem lying into the specification of the root partition, on the kernel command line. In the full-system emulation script, we have to correctly set the root partition, like this:

    # Linux kernel boot command flags.
    kernel_cmd = [
        ...
        # Tell Linux where to find the root disk image.
        "root=/dev/vda1",
        ...
    ]
    system.workload.command_line = " ".join(kernel_cmd)
    
  • Don't forget to replace ... with other correct options.
  • Before our modification, the VirtIO block device was specified (/dev/vda). The kernel wants a partition (/dev/vda1), not a block device.

Author: Pierre Ayoub

Created: 2023-07-27 jeu. 17:14

Validate