Optimizing Storage: String Compression

January 2, 2025

- 4 min

Smush it into a string

I built a note-taking app recently, and decided on using Lexical for the text editor. It has great built-in features and is extremely customizable, but one downside is the output file for the note state.

It generates a .lexical file that is essentially a JSON file, and it can get pretty large. There is a minimum size of the file. For example, saving just a few lines of text results in a multi-kilobyte file.

Since I'm storing these notes in Postgres, and saving space is a priority, I explored ways to reduce the file size.

In general, I would store large JSON objects in some blob or object storage, but in this case the compressed and minified output is a tiny string.

View the results

View the code

Compression Screenshot

Minification

The first thing I started to do was build a minifier for the lexical file. Luckily, before I got too far I found a really good implementation of a minifier for lexical files.

Manually creating a minifier and array packing is interesting and broad enough that it deserves a separate article, but I suggest taking a look at the code.

Compression

Looking at the Lexical data structure, even the minified output, I noticed a lot of repeated properties. With only a handful of properties and values being used repeatedly, this structure was an ideal candidate for compression. But not all compression algorithms are the same. I decided to test a few to see which would perform best for my needs.

Evaluating Algorithms

Huffman Coding

Huffman coding is a frequency-sorted binary tree algorithm. It assigns shorter codes to more frequent characters. It's a great general-purpose compression method and is used in conventional compressions formats like gzip.

{ f:0, c:100, d:101, a:1100, ... }

Lempel-Ziv-Welch (LZW)

LZW builds a dictionary of patterns from the input data. It has an interesting history of legal issues. Depending on the implementation and style of encoding, the output can be:

A map of encoded character frequency:

{ "0": 123, "1": 34, "2": 99, "3": 34, ... }

A binary array:

[ 123, 34, 99, 34, ...]

An encoded string 㞂張䔲...

What Matters Most for My Use Case

When choosing a compression approach, there are four factors to consider - compression speed, decompression speed, compressed size, and output format.

Decompression speed

In my app's case, decompression happens only once, when the frontend retrieves and processes the saved note. Whether it takes 1ms or 50ms is irrelevant. Even the slowest algorithm can decompress a single encoded string fast enough that it's imperceptible, and the GET request will account for more of the loading time.

Compression speed

Similarly, compression happens asynchronously in the app. When a user is writing a note, the text input is debounced and an async POST request is made. This whole process is opaque to the user and doesn't effect the frontend, so speed is also not a concern here.