Everfree's ARMFerno - My Unholy Battle With a Rock64

2022-05-16

I’ve got this rock64, which is an aarch64 board comparable to a Raspberry Pi 3 B+ with 4 gigs of ram. For years I’ve wanted to put a distribution on here that doesn’t have a premade image available, mainly because out of all the options on that page I don’t actually like any of them. Well, except NetBSD, but NetBSD doesn’t have GPU drivers for it. Problem is, everything I do want to use provides rootfs tarballs and tells you to figure it out. To do that I’ve got to get a Linux kernel, track down the device trees so it knows what hardware it has, and then wrangle u-boot into actually booting the whole thing. I figured that would be the hard part; little did I know the depths that Single Board Computer Hell would reach.

rock64 single board computer in front of minetest and never gonna give you up in the background

For the purposes of the install I decided to go with Gentoo. Yeah yeah, I know; memes aside, Gentoo made sense for this project. They make it really easy to apply custom patches to the kernel and other system packages. There’s a rootfs with all the files of a base install, but they also provide an aarch64 installation ISO. I figured I could find some way to boot up that ISO and go from there (narrator: she did not boot up that ISO). So I flashed the ISO onto an SD card, and then went on to solve the u-boot part of the problem.

dd if=/path/to/gentoo.iso of=/dev/mySDCard bs=4M status=progress

First Circle - u-boot

u-boot is a bootloader that’s commonly used in embedded systems. It’s got a lot of flexibility in the build process that lets devs adapt it for whatever convoluted boot process the system needs to get going. That’s important because the boot process for ARM SoCs is almost entirely non-standard, and any similarities between chips is largely incidental. At the extreme end you’ve got those awful broadcom chips in the raspberry pi that infamously use the GPU to boot the system. Thankfully we don’t have to deal with anything that bad here.

If I was doing this when the rock64 came out I’d expect to go looking for some fork of u-boot to work with, but we live in 2022, so I just went for mainline u-boot. Configuring this is a bit like configuring a Linux kernel. First we generate a default config file.

git clone https://github.com/u-boot/u-boot
cd u-boot
make rock64-rk3328_defconfig

Then we edit it interactively with make menuconfig if we want to change anything. Once that’s done we can build the image- except we can’t yet actually.

Second Circle - ARM Trusted Firmware

To actually boot up, u-boot needs to bundle in the ARM Trusted Firmware for the SoC, so we’ve got to go get that.

git clone https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/
cd trusted-firmware-a
CROSS_COMPILE=aarch64-linux-gnu- make PLAT=rk3328 bl31
# save the path of the output for use with u-boot
export BL31="$(realpath build/rk3328/release/bl31/bl31.elf)"

How did I figure this out? Why, through the power of friendship! No seriously, I just asked my friend and she told me to do this and it worked. I don’t know where else you’d find this information on your own.

Ok, back to u-boot then. From the u-boot folder again, I built my image with the BL31 file in tow.

# this relies on the BL31 environment variable we exported in the last code block.
CROSS_COMPILE=aarch64-linux-gnu- make -j4

Now we’ve got some binaries, and the main ones we care about are idbloader.img and u-boot.itb. idbloader.img is the very first thing that runs when the chip starts booting from the SD card, and that needs to go at sector 64 (using 512 byte sectors). u-boot.itb on the other hand has an address configurable in the menuconfig, and at the time of writing the default in upstream u-boot is sector 16384. idbloader jumps into the main u-boot code after early initialization, so if we change u-boot’s offset we’ve got to reflash idbloader too.

There’s two approaches you can take to flashing this onto the SD card from here if you’re following along at home. The first option is to write a third file, u-boot-rockchip.bin, at sector 64. This is a bundle of both the idbloader.img and u-boot.itb files, with padding in between. The downside is, this also obliterates the partition table, so I flashed them separately instead.

dd if=idbloader.img of=/dev/mySDCard bs=512 seek=64
dd if=u-boot.itb of=/dev/mySDCard bs=512 seek=16384

If all goes well, you’ll get something like this when you power on the board:

U-Boot TPL 2021.07 (Apr 30 2022 - 00:50:36)
LPDDR3, 800MHz
BW=32 Col=11 Bk=8 CS0 Row=15 CS1 Row=15 CS=2 Die BW=16 Size=4096MB
Trying to boot from BOOTROM
Returning to boot ROM...

U-Boot SPL 2021.07 (Apr 30 2022 - 00:50:36 -0700)
board_init_sdmmc_pwr_en
Trying to boot from MMC1
NOTICE:  BL31: v2.6(release):v2.6
NOTICE:  BL31: Built : 00:14:00, Apr 28 2022
NOTICE:  BL31:Rockchip release version: v1.2


U-Boot 2021.07 (Apr 30 2022 - 00:50:36 -0700)

Model: Pine64 Rock64
DRAM:  4 GiB
PMIC:  RK8050 (on=0x10, off=0x08)
MMC:   mmc@ff500000: 1, mmc@ff520000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial@ff130000
Out:   serial@ff130000
Err:   serial@ff130000
Model: Pine64 Rock64
Net:   eth0: ethernet@ff540000
Hit any key to stop autoboot:  10

Hit a key to interrupt the boot sequence and get manual control over the u-boot shell, or Control-C if it’s already started trying to boot the system.

Third Circle - pxeboot

I realized at this point that while I might be able to boot from the ISO, I wasn’t able to install from it unless I copied it into a tmpfs and remounted the in-ram ISO as /, because I was going to obliterate the ISO partition table on the SD card during the install. In retrospect I probably should have done that, but I didn’t feel like figuring it out, so I took a different road.

I re-imaged the SD card with gentoo’s rootfs tarball, but then I extracted the kernel and initramfs from the ISO and slapped those in there as well. However, when I tried to boot this with u-boot’s booti command, it thought the initramfs was corrupt. It wasn’t decompressing it properly I guess, I’m really not sure. For some reason I decided the logical next step was to try to boot it with PXE instead. You shouldn’t do this. It’s a pain. What I should have done, and what you should do, is to just use an uncompressed initramfs; I’ll tell you how to do that later. But I want to document the PXE process, so here we go.

How does pxe boot work from u-boot? Here’s the rough outline:

Run a TFTP server somewhere to host our files for u-boot to retrieve.
- I used atftp.
Set up the configuration structure for pxelinux.
Get u-boot connected to the network.
Tell u-boot the tftp server IP, either through magic DHCP settings or manually.
Run pxe get.
Run pxe boot.
Hope it works.

On my desktop I have a file tree structure in my tftp server directory a bit like this:

.
├── gentoo.igz
├── gentoo.img
└── pxelinux.cfg
    └── default-arm

The first two files are the initrd and linux kernel, and then default-arm contains this:

DEFAULT GENTOO
MENU TITLE  Installer
PROMPT 0
TIMEOUT 150

MENU WIDTH 80
MENU MARGIN 16
MENU ROWS 15
MENU TABMSGROW 20
MENU CMDLINEROW 20
MENU TIMEOUTROW 21
MENU HELPMSGROW 22

LABEL GENTOO
 MENU DEFAULT
 MENU LABEL Boot Gentoo
 KERNEL gentoo.img
 INITRD gentoo.igz
 APPEND root=LABEL=root console=ttyS2,1500000

I should tell you that a number of these config lines don’t actually do anything since u-boot only emulates a subset of real pxelinux, but they don’t hurt anything either. In particular, all those menu formatting commands are irrelevant since there’s no menu to format, but I’m leaving this file as-is since it’s what’s on my hard drive. This config also relies on your rootfs partition having the root label, but change the linux command line however you want really.

So with my desktop serving that, I booted my board into u-boot and ran

dhcp
setenv serverip my.desktop.ipv4.address
pxe get
pxe boot

This usually worked. Sometimes my board was able to hit my router, but nothing else on my network, and I have no idea why. Whenever that happened I had to power off the board for 10-15 minutes and then power it back on for it to work again. I also saw some mentions of ARP so, yeah this is low level networking issues that I just did not feel like figuring out at the time.

But with that all done, I had a booted gentoo system, so let’s move on.

Fourth Circle - Kernel

Gentoo proved to be perhaps the best choice I could have made for this project, though I didn’t realize it at the time. Gentoo’s facilities for applying patches and doing whatever you want with the kernel took some of the pain out of using all this hardware’s features, but I’m getting ahead of myself. Before we get to the good parts, we’ve got to address the elliephant in the room: compile times.

The Rock64 has a quad-core processor with Cortex-A53s clocked at 1.2ish GHz. In technical terms, that means compiling things is gonna take awhile. I have the 4GB of ram variant so that helps at least but to put this in perspective, compiling GCC took me about 18 hours straight. That’s the worst case scenario though, and everything else isn’t quite as bad. In some sense, the forced breaks on the project were welcome, as I could have easily been sucked in for even more hours at a time than I already was.

There wasn’t much left to do to finish the installation, but I did want to free myself from pxeboot. So, after installing some creature comforts, I loosely followed gentoo’s amd64 handbook until I got to building the kernel. Actually configuring the kernel took me a few hours as I poked through every menu and turned config options on for my hardware, and I still kept missing things along the way. I was using gentoo’s normal 5.15 source package, but if I had used ayufan’s kernel and defconfig I might have had an easier time. If you want to do that you can clone that repo and use

ARCH=arm64 make defconfig

Using this kernel will at least get you most of the way to a full working set of modules for the hardware. But building the driver modules isn’t enough on it’s own, because we also need to use the right ✨Device Trees✨.

Fifth Circle - Device Trees

On the x86 systems we’re all used to, device trees aren’t ever something we have to think about. The platform is standardized such that the kernel knows how to talk to all the platform hardware, and it can enumerate anything connected over PCIe automatically. On older systems you might have to worry about defining IRQs, but generally speaking if your hardware isn’t showing up on a modern amd64 Linux install, it’s just because you’re missing kernel modules or firmware.

Outside that ivory tower, we have device trees. Device trees are a static descriptor of the hardware available on a device. They describe what hardware exists, what address range that hardware is memory-mapped into, some information the kernel can use to decide what modules are responsible for it, and any additional device-specific configuration needed. This is all defined in a web of dts and dtsi files that all get compiled into a binary representation called a dtb file.

Our u-boot actually already has a device tree baked into it that it’s providing to our kernel when we pxeboot, but that device tree is wrong. The USB2.0 ports don’t provide power, for one thing, and the USB3.0 hardware doesn’t even show up in lsusb. So where’s the right tree? Good question! Here’s some of the places we could find a device tree that claims to be for the rock64 specifically:

mainline u-boot
ayufan’s fork of mainline u-boot
ayufan’s older fork of non-mainline u-boot
mainline linux kernel source tree
ayufan’s fork of mainline linux
ayufan’s fork of rockchip’s linux
a patch file I got from someone in the rock64 IRC that needs to be applied to HEAD of torvalds/linux

Can you guess which device tree is the right one? That’s right, it’s either the one in ayufan’s fork of mainline linux if you don’t need hardware accelerated video decoding, or the patch file applied to HEAD of torvalds/linux if you do. I’m told that patch is getting upstreamed in Linux 5.19, so once that’s out the easy choice will be to just use the Linux 5.19 source tree and call it a day. If you need that patch now, here’s a link to it on patchwork.kernel.org.

I took the patched upstream. Once I applied the patch, I deleted the arch/arm64/boot/dts/rockchip/ folder from my 5.15 kernel source tree and replaced it with the same folder from my patched upstream kernel. Then I deleted a couple definitions for other boards that were giving me compile errors.

In either case, to build the dtb files we can go into the kernel source tree and run

make dtbs

and then to install them in /boot it’s

make dtbs_install

Sixth Circle - Booting from SD Card

At this point we’ve got the holy trinity of booting a linux system: the kernel, the initramfs, and the device tree binaries. Let’s go! I still hadn’t automated booting at this point, so from the u-boot prompt I did something along the lines of

load mmc 1:2 ${kernel_addr_r} /vmlinuz-5.15.32-gentoo-r1.img
load mmc 1:2 ${fdt_addr_r} /dtbs/5.15.32-gentoo-r1/rockchip/rk3328-rock64.dtb
load mmc 1:2 ${ramdisk_addr_r} /initramfs-5.15.32-gentoo-r1.img
booti ${kernel_addr_r} ${ramdisk_addr_r}:${filesize} ${fdt_addr_r}
Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[    0.000000] Linux version 5.15.32-gentoo-r1 (root@localhost) (gcc (Gentoo 11.2.1_p20220115 p4) 11.2.1 20220115, GNU ld (Gentoo 2.37_p1 p2) 2.37) #4 SMP PREEMPT Sat Apr 30 05:14:29 PDT 2022
[    0.000000] Machine model: Pine64 Rock64
[    0.000000] efi: UEFI not found.
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000200000-0x00000000feffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   empty

[... it goes on for awhile ...]

And there we go. Booting!

There’s a number of ways to automate this, simplest of which is probably baking a boot script into the u-boot image. But, I never actually bothered to automate the bootup process so I am unfortunately leaving this one as an exercise for the reader. Sorry!

Seventh Circle - Responsive GUI

With USB working, I could finally run startx from the TTY and get a GUI, and, ooooh boy was it slow. I’m talking, you drag a window and watch it follow behind. You can watch the pixels update over a few frames after minimizing a window. I’m the girl that uses an 800MHz laptop with software rendering on the daily, and I’m saying it’s slow. So what’s the problem?

$ glxinfo | grep llvm
OpenGL renderer string: llvmpipe

Well that might do it. No GPU. You see, I had built my system with support for panfrost, but what I actually needed was lima. Who the hell is Steve Jobs, you ask?

lima balls.

a blue person pointing at someone who is exploding

Anyway, now that I’d remembered that my GPU was a Mali-450, and got the correct VIDEO_CARDS setting in my make.conf, I ran startx again. Guess what, it ran EVEN WORSE!! I wish I was joking, but no. It was somehow less responsive. I did a test and our humble glxgears got 25fps full screen with both hardware and software rendering- the only difference was CPU usage. What is going on?

Well quite simply, X sucks on embedded hardware. I’m a noted X-apologist, and even I have to face facts on this one. So, I installed sway, a wayland window manager inspired by i3wm. Starting sway, I was pleased to see that dragging windows around was actually fast, how incredible. Not only that, glxgears bumped its way up from 25fps full screen to 50fps! There’s no escaping the fact that this GPU is extremely an embedded GPU, but at least it gets the job done.

And with all that, we can graphically multitask like I teased at the top of the post. We’ve got some low framerates, but it’s responsive!

But what about that video in the corner? It’s looking particularly choppy…

Eighth Circle - Hardware Accelerated Video Playback

Surely video playback can’t be that hard right? We’ve been doing hardware accelerated video for literally decades. How could we ever need cutting edge software for that?

Good question! The problem is, up until recently there has been almost no standardization of this stuff on SoCs. In x86 land we’ve ended up with something that feels a bit like two competing standards, VAAPI and VDPAU. VAAPI is pushed by Intel, VDPAU is pushed by Nvidia, AMD has over the years used both, and there’s wrapper libraries that translate between the two for applications’ benefit. Technically, nothing stops SoC vendors from implementing one or both of these standards. In fact, some even have! But it’s not a given, and there’s a lot of vendor-specific stuff going on.

ffmpeg and by extension mpv have support for Rockchip’s “Rockchip Media Process Platform”, so that’s what I chased down for a day or two. As it turns out, this only works with Rockchip’s fork of the Linux kernel. The video decoding hardware has support in mainline Linux, but it’s using a completely different interface called “Video4Linux2 Request”. As far as I can tell, Video4Linux started out as an API for accessing video capture devices, TV tuners, and the like. These days it’s grown beyond that, and one thing it can do is facilitate hardware video decoding. Finally it seems like we might be approaching a standard API to support SoCs’ weird signal chains.

So there’s a driver in the kernel in staging called rkvdec which supports the rock64 and rockpro64’s hardware with v4l2. It’s been in there for awhile, so if you want to stick to an LTS kernel you can get it in 5.15. We’ll also need the v4l2 modules, and that dts patch I mentioned earlier to detect the hardware properly. With that all out of the way, you should see a /dev/video1 file after booting- that means we’re in business! You can confirm using v4l2-ctl:

$ v4l2-ctl -Dl 
Driver Info:
  Driver name      : hantro-vpu
  Card type        : rockchip,rk3328-vpu-dec
  Bus info         : platform: hantro-vpu
  Driver version   : 5.15.32

[...snip...]

Codec Controls

  h264_profile 0x00990a6b (menu)   : min=0 max=4 default=2 value=2 (Main)

Stateless Codec Controls

  h264_decode_mode 0x00a40900 (menu)   : min=1 max=1 default=1 value=1 (Frame
-Based)
  h264_start_code 0x00a40901 (menu)   : min=1 max=1 default=1 value=1 (Annex
 B Start Code)
  h264_sequence_parameter_set 0x00a40902 (h264-sps): value=unsupported payload type flag
s=has-payload
  h264_picture_parameter_set 0x00a40903 (h264-pps): value=unsupported payload type flag
s=has-payload
  h264_scaling_matrix 0x00a40904 (h264-scaling-matrix): value=unsupported payloa
d type flags=has-payload
  h264_decode_parameters 0x00a40907 (h264-decode-params): value=unsupported payload
 type flags=has-payload

Ok, next problem, upstream ffmpeg doesn’t have support for the V4L2-Request API yet. Right now, you can get a fork with support from jernesk/FFMpeg on github. I’ve also created a patch file that applies cleanly to the upstream 4.4.1 source tarball if you want to use that instead. You’ll need to pass --enable-v4l2-request to configure to use it.

Finally, make sure your mpv is actually using the right ffmpeg if you have more than one installed. If it is, you can pass --hwdec=drm-copy to mpv, and you’ll be decoding video with hardware!

By the way, a lot of this is also documented in Mainline Hardware Decoding on the pine64 wiki.

So I did all that, and what was my reward? Well here’s the punchline, video playback was actually CHOPPIER than without hardware decoding. WHY?! I don’t have a perfect answer for you. My gut feeling is this is a memory bandwidth problem. You see, in wayland the only acceleration method we can use is drm-copy. As the name implies, the data path here is something like

ffmpeg demuxes our file.
h264 frames are sent to the media engine hardware.
decoded frames are sent back to system memory.
mpv then uploads these frames to the GPU in the OpenGL layer.
(?) There might be some colorspace conversion that has to happen here too.
sway composites this into the framebuffer that’s finally displayed on screen.

I may be missing some steps there, but the gist is, that’s a lot of memory bandwidth used for just a single frame. I’ll be generous and guess that we’re using 3 bytes for pixel; a 1920x1080 frame is 5.9MiB. If we have to shuffle that frame around even 3 times, we’re already at 533MB/s of bandwidth used for a 30fps video minimum. Add onto that the other things the system has to do and the latency involved with a number of these operations, and this little thing just cannot keep up. With software decoding, yeah the CPU is doing all the decoding work, but the reduced memory bandwidth used pushes it ahead just a little bit.

“But Artemis, kodi can do smooth playback, how does it do it?”

Well, to answer that let’s go out of wayland and back to the TTY. Now run

mpv --hwdec=drm /path/to/file.mp4

Perfectly smooth playback of ambience by Quite at 1080p30fps. Not a frame dropped. When we use --hwdec=drm instead of --hwdec=drm-copy, the data path is much more direct from file to media processing engine to display. None of these intermediary copies involved. Since mpv has exclusive control over the display, it can easily take the fast path. No pesky windowing system or composition stack in the way. Same goes for kodi! But, sadly, we can’t used --hwdec=drm inside Wayland.

Technically, there’s no reason this fast path couldn’t be taken from within Wayland or even X. On raspberry pi, for example, you can use the very janky omxplayer to take the fast path over there, rendering the video as an overlay atop the session. Rockchip devices can do the same sort of thing, and with proper code you can even make it fit in with the windowing manager cleanly by positioning the output overtop a window. It’s just, nobody has bothered to write the code to do it.

Well, almost nobody.

<artemis> Apparently nobody has bothered to write something that can take the fast path under Wayland or X on rockchip
<artemis> Because I guess everybody just gives up and uses kodi if they want a media center, or otherwise stops using the thing
<artemis> And the hardware in the pinebookpro is fast enough it can just brute force the inefficiencies at pbp resolution
<my friend> oh uh
<my friend> my coworker did
<my friend> but he can't release it

Gotta love intellectual property law.

Ninth Circle - Adjusting the Display Resolution

It was here at the ninth circle of ARM hell that my journey came to an end. You see, I cannot get this thing to output a display signal at anything other than 1920x1080. If I set it to 1280x720, I don’t get a signal output. If I set it to any manner of standard VGA resolutions, I don’t get a signal. If I plug it into my 1280x1024 monitor, it sure claims it’s in 1280x1024 mode, but there’s no signal. This is constant across the TTY, X11, Wayland. It simply does not matter. I felt, and still feel, like I’m losing my mind when I talk about this. It’s supposed to be in the right mode, its just, there’s no output. Here, look!

vi@shiny ~ $ swaymsg -t get_outputs
Output HDMI-A-1 'Unknown GH18PS 0323ME0502' (focused)
  Current mode: 1280x1024 @ 60.020 Hz
  Position: 0,0
  Scale factor: 1.000000
  Scale filter: nearest
  Subpixel hinting: unknown
  Transform: normal
  Workspace: 1
  Max render time: off
  Adaptive sync: disabled
  Available modes:
    1280x1024 @ 60.020 Hz
    1024x768 @ 60.004 Hz
    800x600 @ 60.317 Hz
    800x600 @ 56.250 Hz

And at this point I had to give up. I tried to pick this apart for a day or so, but ultimately I decided enough was enough, and I powered my board down. I fought the Rock64, and the Rock64 won.

Roll Credits

Thank you to everyone that helped me along the way writing this post. There is no way I could have figured this all out on my own; it is so hard to find accurate information about these boards online, especially as the kernel and userspace are both constantly changing around these devices. In particular I want to shout out

Will, for teaching me u-boot and helping me wrangle it into working on this board.
linear, for answering gentoo and u-boot questions, and helping me get lima working.
CounterPillow in the Pine64 IRC, for getting my video acceleration working. The dts patch, the ffmpeg fork, all those links came from them.
The folks over in #gentoo on liberachat, for answering a lot of my beginner gentoo questions.