Exploring C++20 coroutines for embedded and bare-metal development on RISC-V platforms

Can C++20 coroutines build efficient, real-time embedded applications on bare-metal RISC-V platforms — no RTOS required?

7 min readNov 24, 2024

Introduction

This post is about using C++ coroutines to suspend and resume functions in real time. The objective is to create a simple way of building real-time tasks using only C++ without the need for an RTOS or operating system kernel.

Coroutines are functions that can be suspended and resumed, using the keywords. The C++20 standard introduced coroutines to the language.

C++ standardized the keywords and type concepts for coroutines, but it did not standardize a runtime. The lack of a standard runtime has made it hard to use coroutines “out of the box,” but their implementation is very adaptable to different use cases.

Here I use a simple runtime implementing C++20 coroutines on bare metal (no operating system) for RISC-V, using the co_await keyword. This is done by passing the real-time scheduler and resume time condition as the argument to the asynchronous wait operator.

The runtime is described in detail in this post.

Why coroutines?

I’m interested in coroutines for the following benefits:

Event-driven asynchronous functions can be written using a control/data flow in a single body of code that can be easy to understand.
The code is portable, so it can be tested on development OS and target systems.
Resource efficiency, in terms of memory usage (stack and heap) and CPU cycle usage.

A software timer example

This article will build a simple software timer example. The function has a loop that pauses for several microseconds before iterating again. While the loop is paused the control flow returns to the caller function.

A simple coroutines task

A simple task periodic is defined in example_simple.cpp. It takes scheduler, period and resume_count as arguments and asynchronously waits period microseconds for 10 iterations, updating the resume_count value each iteration.

The scheduler passed as an argument is not strictly necessary for C++ coroutines, but is used to make the ownership of the context of each task explicit. (It could be possible to use a global scheduler, such as when implementing via OS threads.)

The task returns nop_task. This is a special structure that is linked to the coroutines implementation. In this case, a "nop task" refers to a task that does not return a value via co_return.

template<typename SCHEDULER>
nop_task periodic(
    SCHEDULER& scheduler,
    std::chrono::microseconds period,
    volatile uint32_t& resume_count) {
    driver::timer<> mtimer;
    for (auto i = 0; i < 10; i++) {
        co_await scheduled_delay{ scheduler, period };
        *timestamp_resume[resume_count] = mtimer.get_time<driver::timer<>::timer_ticks>().count();
        resume_count = i + 1;
    }
    co_return; // Not strictly needed
}

The following sequence diagram shows an abstract coroutine execution where an abstracted OS exists to handle the scheduling of process execution. (PlantUML source)

Calling the simple coroutine task

The example_simple() function in example_simple.cpp calls the periodic function once, with 100ms as the period value.

The scheduler_delay<mtimer_clock> is a scheduler class that will manage the software timer to wake each coroutine at the appropriate time, using our RISC-V machine mode timer driver mtimer.

    driver::timer<> mtimer;
    // Class to manage timer coroutines
    scheduler_delay<mtimer_clock> scheduler;
    // Run two concurrent loops. The first loop will run concurrently to the second loop.
    auto t0 = periodic(scheduler, 100ms, resume_simple);

Resuming the coroutine tasks

For this example, the scheduler is an object instantiated in the example_simple() function. It needs to be called explicitly to calculate when each coroutine needs to be woken and resumed. This is a convention of the runtime for this example, and not a required convention for C++ coroutines.

The tasks are resumed in the WFI busy loop of example_Simple() when scheduler.update() is called. However, as the scheduler is just a C++ class, this can be called from other locations, such as a timer interrupt handler.

    do {
        // Get a delay to the next coroutines wake up
        schedule_by_delay<mtimer_clock> now;
        auto [pending, next_wake] = scheduler.update(now);
        if (pending) {
            // Next wakeup
            mtimer.set_time_cmp(next_wake->delay());
            // Timer interrupt enable
            riscv::csrs.mstatus.mie.clr();
            riscv::csrs.mie.mti.set();
            // WFI Should be called while interrupts are disabled 
            // to ensure interrupt enable and WFI is atomic.            
            core.wfi();
        ]
    } while(true)

For example, as the IRQ handler in this example is a lambda function, we could also capture the scheduler and run the timer coroutine in the IRQ handler.

    static const auto handler = [&](void) {
        ...
        schedule_by_delay<mtimer_clock> now;
        auto [pending, next_wake] = scheduler.update(now);
    };

Building and running with Platform IO

The example can be built and run using Platform IO. The default RISC-V platforms use an old version of GCC that does not support C++20, so a custom virtual platform configured to use xPack 12.2.0–3 riscv-none-elf-gcc and run on QEMU has been created in platformio/platforms/virt_riscv.

build_flags = 
    -std=c++20
    -O2
    -g
    -Wall 
    -ffunction-sections 
    -fcoroutines
    -fno-exceptions 
    -fno-rtti 
    -fno-nonansi-builtins 
    -fno-use-cxa-atexit 
    -fno-threadsafe-statics
    -nostartfiles 
    -Wl,-Map,c-hardware-access-riscv.map

The debug sequence shows entering the function example_simple(), initializing scheduler_delay<mtimer_clock> scheduler; then calling periodic(scheduler, 100ms, resume_simple);.

Once the statement co_await scheduled_delay{ scheduler, period }; is reached the context returns to example_simple(). Then when auto [pending, next_wake] = scheduler.resume(now); is called it returns to the for loop in periodic().

The coroutine handle is stored in the scheduler class by the first call to co_await. The following call to scheduler.resume() looks up the pending coroutine handle and calls resume on the handle.

The stack of the coroutine periodic() before the resume can be seen below. It's called from example_simple().

The stack of the coroutines periodic() after the resume can be seen below. It's called from coroutine_handle::resume, which is called from scheduler_ordered::resume.

The stack of example_simple() function calling resume() is also on the same stack.

Building with CMake and running with Spike

The Makefile has targets to build with CMake.

$ make target
cmake \
        -DCMAKE_TOOLCHAIN_FILE=cmake/riscv.cmake \
        -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
        -B build_target \
        -S .
cmake --build build_target --verbose

The Makefile also has targets to simulate and trace with the standard RISC-V ISA simulator, spike. The forked spike with VCD tracing is used. The forked spike is included in a docker container.

$ make spike_sim
docker run \
        -it \
        --rm \
        -v .:/project \
        fiveembeddev/forked_riscv_spike_dev_env:latest  \
        /opt/riscv-isa-sim/bin/spike \
        --log=spike_sim.log \
        --isa=rv32imac_zicsr \
        -m0x8000000:0x2000,0x80000000:0x4000,0x20010000:0x6a120 \
        --priv=m \
        --pc=0x20010000 \
        --vcd-log=spike_sim.vcd \
        --max-cycles=10000000  \
        --trace-var=timestamp_simple --trace-var=timestamp_resume --trace-var=timestamp_resume_0 --trace-var=timestamp_resume_1 --trace-var=timestamp_resume_2 --trace-var=timestamp_resume_3 --trace-var=timestamp_resume_4 --trace-var=timestamp_resume_5 --trace-var=timestamp_resume_6 --trace-var=timestamp_resume_7 --trace-var=timestamp_resume_8 --trace-var=timestamp_resume_9 --trace-var=resume_simple \
        build_target/src/main.elf
docker run \
     --rm \
    -v .:/project \
     fiveembeddev/riscv_gtkwave_base:latest \
     vcd2fst spike_sim.vcd spike_sim.fst

The results can be viewed with GTKWave. The GTKWave savefile includes address decode and opcode decode by using docker images containing the decoders.

gtkwave spike_sim.fst  spike_sim.gtkw

The benefit of tracing results from the ISA is that it is easy to confirm the periodic timing of the coroutine. (For this example the parameter to periodic() was changed to 1ms).

The periodic write to resume_count and timestamp_resume is traced to VCD so the exact timing of the coroutine execution is visible.

Using GTKWave the context switch can also be examined in detail. In the fake 1GhZ clock used by Spike, the context switch takes 104ns.

Coroutine runtime

The runtime for this example is in the header embeddev_coro.hpp, and it uses the embeddev_riscv.hpp header to provide a simple HAL for RISC-V and host emulation. Details on the runtime are in this post.

Summary

This post describes a simple working example of how to use C++ coroutines in an embedded context. The example and context are not meant to be a realistic use case, but the simplest possible use case that involves an interrupt handler and a context switch.

However, the example can be built on to explore portable and lightweight asynchronous programming techniques. Future posts will look at that topic.

Originally published at https://five-embeddev.com on November 24, 2024.