Blazingly Fast Lua Serialization

You’re writing lua, you want to serialize and deserialize data, and you want to pick the best format/library pairing for the job. What’s good? I’ve been doing some testing to find out. Here’s the short version: If you want the fastest option and you can choose the format, use lua-cbor if you need it to be pure lua, or use lua-protobuf if you’re cool with a C library. If you need JSON, use either lunajson for pure lua, or lua-cjson for a faster C implementation. And now, the details.

JSON

Not much to say about JSON. You’re not going to get great speeds out of this no matter what, but it’s everywhere, and you’ll need it eventually. The long-lived lua-cjson library is going to be fastest for you if you’re cool with a C-based library. If you want pure lua, use lunajson.

Unfortunately, JSON being JSON, there’s not a great way to put binary data in here. Both lua-cjson and lunajson will happily encode a binary string directly into the output JSON (and successfully decode it too!) regardless of if it’s valid Unicode, immediately violating the JSON spec and making many decoders very unhappy with you. If you need to put binary data in you’re probably best off base64-encoding it.

MessagePack

MessagePack is a schema-less format like JSON, but it’s a binary format. This makes it much more bandwidth-efficient than json, and for parsers it also means it can encode numbers and strings a lot faster. Unfortunately the lua implementations both leave a bit to be desired, so consider using CBOR instead if you can (see the next section).

There’s two options here. They both have some important jankyness with regards to how they encode strings.

The first option is kieselsteini’s msgpack. It runs lua’s utf8.len function on all strings before encoding them- if that length comes back successfully, lua validated it as utf8, and the library encodes it as a utf8 string. Otherwise it sends it as a binary string. This imposes a pretty big cost if you’re primarily transmitting binary data. It’ll also cause problems if the other end is expecting a value to be always-binary or always-utf8, because it’ll see both depending on the contents. But, it won’t be mis-tagging binary data as utf8, so that’s good.

The second option is fperrad’s lua-MessagePack. This one is janky in a different way: out of the box it tags all strings as utf8, even if they’re binary strings. As a result, you might generate errors in whatever you’re sending the data to, as it tries to decode binary data as utf8. You can change this behavior globally to tag all strings as binary by calling MessagePack.set_string('binary'), which at least is technically correct.

lua-MessagePack also lets you specify a custom encoder for a piece of data, which you can use to switch out the string tagging for a specific piece of data only, but it looks a bit cumbersome to use with nested data structures. That said, if all you need is binary strings, lua-MessagePack is probably going to be about as applicable to the task as the CBOR library I’ll talk about next, but I just find it a bit more of a hassle to use.

If lua version compatibility is a concern for you, kieselteini’s msgpack requires lua5.3. fperrad’s lua-MessagePack provides both a 5.1 compatible version and a 5.3 compatible version, which are available as separate packages on luarocks.

CBOR

Out of JSON, MessagePack, and CBOR, you’re going to have the best time with CBOR. CBOR, like JSON and MessagePack, is a schema-less data format. CBOR is inspired by/derived from MessagePack so at a protocol level it’s very similar. It’s a binary format, it’s bandwidth-efficient, etc. The lua-cbor library is well written, incredibly fast, has good defaults, but is flexible enough to handle mixed string formats in a reasonable fashion.

Its default behavior is to tag all strings as binary strings (a safe default!). It correctly handles the strange array/map duality of lua tables in the most efficient way it can do safely (serializing both variants concurrently and only writing out the correct one at the end). But, it also makes it possible to override these behaviors in your data structures.

Your first option it to pass in an options table as the second argument to cbor.encode. For example, if you want to encode a data structure composed entirely of arrays without any key-value maps, you could run:

local array_mt = { __name = "array" }
local options = {
    [array_mt] = cbor.type_encoders.array
}
cbor.encode(my_array, options)

But the bit I’m happy about is you can also set encoders in the metatables for values. In this case, I’ll return to the string example: if you have a table containing a utf8 string, and you care about tagging it as such, you can set a custom encoder in its metatable.

local table_with_utf8strings_mt = {}
table_with_utf8strings_mt.__tocbor = function(table, opts)
    -- encode table as a map with all strings tagged as utf8 strings
    local old_string_encoder = cbor.type_encoders.string
    cbor.type_encoders.string = cbor.type_encoders.utf8string
    return cbor.type_encoders.map(table, opts)
    cbor.type_encoders.string = old_string_encoder
end

setmetatable(my_table, table_with_utf8strings_mt)

This is a bit grungy honestly and I’m not super thrilled about it, but at least it’s possible if you need it. You could even use these custom encoders if you want a custom wire format for encoding your table (maybe you want to omit some fields?), without having to write all the machinery yourself.

As far as lua compatibility goes, lua-cbor will work with lua5.1. With lua5.2 it’ll go faster with the help of the bitshift operators, and with lua5.3+ it’ll go faster still by using string.pack/string.unpack. All of that is handled transparently for you, it selects whatever’s available when you require() it.

There is another CBOR implementation worth mentioning which is org.conman.cbor. This has the benefit of supporting CBOR extensions, which you probably don’t need, but if you need to interface with something that is using them, it’s an option. But it comes at a cost; despite boasting about how parts of it are implemented in C, I actually saw lua-cbor encode/decode data 15-25x times faster than this C library. Absurd!

Protobufs

Ah, protobufs. Love them or hate them, they exist, and you might need or even want to use them. In that case lua-protobuf has you covered, for both protobufs version 2 and 3.

Unlike the other data formats listed here, Protobufs uses a schema, meaning you define in advance what your data looks like. lua-protobuf makes this pretty ergonomic. You can include your schema directly in your lua code as a multi-line string and parse it at startup, or you can pre-compile your schema to a binary format. In exchange for using a schema and importing some C code, you get some incredible speeds.

For data which is primarily strings or blobs, you’ll see about the same speed as as lua-cbor since for both libraries all the time spent there is pretty much just memcpy()s. On the other paw, for complex data structures and specially anything with large arrays, you can see on the order of 20x faster speeds with protobufs than lua-cbor.

There is a quirk you need to be aware of: the currently loaded schema is global state. Thankfully, there is a way to work with this design. lua-protobuf provides pb.state(), a function you can use to grab a copy of the current state. Then you can reset it, load up some new state, do whatever serialization you need, and put the old state back. This lets you juggle multiple schemas within the same program if you need to (although for most usecases you won’t need to).

It’s a worthwhile price to pay for the performance if you need it. I don’t think protobufs can really be beat here, short of writing your own library in C to hand-roll a protocol.

In Conclusion

JSON gets the data there. CBOR gets it there faster. Protobufs gets it there at ludicrous speed. What you ultimately use probably depends on more than just what’s fastest, particularly if you’re interoperating with some other code. But there’s probably your best options.