May 12, 2021

Implement a key-value database in python

Introduction

In this tutorial, we’ll walk through building a key-value database in python. The design is based entirely off of the bitcask project, with a few simplifications.

Our database will persist records to disk so we don’t lose all of our data in a crash. It will support just a few operations: put, get, and delete.

Here’s a sneak peek of the final result. Usage is quite simple:

>> db = BitCask("db")
>> db.put("message", "Hello, World!")
>> db.get("message")
'Hello, World!'

Design

The two data structures we’ll end up using are dictionaries and files, but let’s build up to the final design intuitively.

To store a key-value (KV) pair in the database, we’ll append it to a file.

Here’s what our record looks like:

 ____________________
|          |         |
|   key    |  value  |
|__________|_________|

This plan for storing records seems simple enough, but how do we access them? One idea would be to scan through the files in our database until we find the key we’re looking for.

When does one record and another begin? Since we don’t know how big each record is, we can’t say. We could make them uniform in size by imposing a maximum key and value size, but then we’d be limiting the size of our keys and values. There’s also an issue of wasted disk space for records smaller than the maximum size.

Instead, let’s store some metadata in our record: two fixed-sized fields that hold the sizes of our key and value. To read an individual record, we can look at the metadata to figure out how much further we should read to get the key and value. Here’s our updated record:

 _________________________________________________
|             |             |           |         |
|  key size   | value size  |   key     |  value  |
|_____________|_____________|___________|_________|

Efficiency aside (for now), there’s still a problem with this approach. If we store KV pair name -> John and later update our value by storing name -> Johnny, how can we be sure to find the correct record on lookup 🤔?

One option is to store a timestamp along with our KV pair. If we have two records with the same key, we consider the record with the most recent timestamp correct. We can store the timestamp in another fixed size field on our record. Here’s the third iteration:

 _____________________________________________________________
|             |            |            |           |         |
|  timestamp  | key size   | value size |   key     |  value  |
|_____________|____________|____________|___________|_________|

Unfortunately, this made our lookup efficiency even worse. Now when we lookup a key, not only do we have to scan records until we find the key, we have to scan until we find the key with the highest timestamp. This means we have the scan the entire database. Ouch.

That would be like flipping through every page of an encyclopedia until we found the term we were interested in. If our database were a book, we’d just lookup our word in the index, and jump directly to the right page.

For the price of a bit of extra book-keeping, the index helps us look things up faster. As you might have guessed, databases leverage the same concept: a database index.

We’ll implement an index for our key-value store as follows: whenever we store a record, we’ll use a dictionary to map its key to its location on disk. If we store a new value at a key, we can just update the dictionary so it points to the current record.

This design sounds viable! It’s a bit more complicated than the initial proposal of KV-pairs on disk, but it’s really just a log-file paired with a dictionary. Time to implement it.

Implementation

Since our records are stored on disk, we’ll need a way to encode an in-memory record representation into raw bytes and a way to decode the bytes back into our in-memory representation.

The following code accomplishes this. Note that I’ve added an additional convenience method that just decodes the metadata portion of the record.

# codec.py

import struct
import bitcask_file
from collections import namedtuple

METADATA_STRUCT = ">dqq"
METADATA_BYTE_SIZE = 24  # 3 * 8 bytes

def encode(record):
    # 64 bit integers
    key_size = record.keysize
    value_size = record.valuesize
    timestamp = record.timestamp

    # variable length strings
    key = record.key
    value = record.value

    metadata = struct.pack(METADATA_STRUCT, timestamp, key_size, value_size)
    data = key.encode() + value.encode()
    record_bytes = metadata + data
    return record_bytes

def decode_metadata(data):
    (timestamp, keysize, valuesize) = struct.unpack(METADATA_STRUCT, data)
    return (timestamp, keysize, valuesize)

def decode(data):
    (timestamp, ksize, vsize) = decode_metadata(data[:METADATA_BYTE_SIZE])
    string_data = data[METADATA_BYTE_SIZE:]
    key = string_data[:ksize]
    value = string_data[ksize:]
    return bitcask_file.Record(timestamp, ksize, vsize, key, value)

This code makes use of the struct python module. It allows us define the layout of our byte data so that it can be converted into a python tuple and vice-versa. The METADATA_STRUCT variable above defines a byte layout of a double (d) followed by two long longs (q), in big endian format (>).

In the encode function, the timestamp will be converted into an 8-byte float, and our key size and value size are both converted to 8-byte integers. The key and value themselves are ascii-encoded, variable length strings which we append to the end of the metadata.

The decode step is just the reverse of this, but we use struct.unpack() to interpret the byte data using the layout we defined.

Our codec tells us how to encode/decode our records, but we still need to actually read and write them from files. That’s what the following is responsible for:

# bitcask_file.py

import time
import uuid
import codec
from collections import namedtuple

Record = namedtuple(
    'Record', ['timestamp', 'keysize', 'valuesize', 'key', 'value'])


class File:
    """
    Class representing a bitcask file, which is an append-only log
    of key, value pairs and their associated metadata
    """

    def __init__(self, dir, filename=str(uuid.uuid4()), offset=0):
        self.filename = '/'.join([dir, filename])
        self.offset = offset

    def _load_next_record(self):
        read_bytes = 0
        with open(self.filename, 'rb') as f:
            f.seek(self.offset, 0)
            meta_bytes = f.read(codec.METADATA_BYTE_SIZE)
            if meta_bytes:
                (tstamp, ksize, vsize) = codec.decode_metadata(meta_bytes)
                key_bytes = f.read(ksize)
                value_bytes = f.read(vsize)
                key = key_bytes.decode()
                value = value_bytes.decode()
                read_bytes += len(meta_bytes) + ksize + vsize
                self.offset += read_bytes
                return Record(tstamp, ksize, vsize, key, value)

    def write(self, key, value):
        """
        encode the data and append to the file
        """
        keysize = len(key)
        valuesize = len(value)
        timestamp = time.time()
        record = Record(timestamp, keysize, valuesize, key, value)
        data = codec.encode(record)
        count = 0
        with open(self.filename, 'ab') as f:
            count = f.write(data)
        curr_offset = self.offset
        self.offset += count
        return (timestamp, curr_offset, count)

    def read(self, pos, size):
        """
        read bytes from the file and decode into record
        """
        data = b''
        with open(self.filename, 'rb') as f:
            f.seek(pos, 0)
            data = f.read(size)
        return codec.decode(data).value

Let’s start with the read method. read takes a position and a size (of a record), seeks to that position in the file, and uses our decode method from before to build the record.

write on the other hand doesn’t take a position, just a key/value. Our bitcask file is an append-only log, so this function just needs to build a record, encode it, and write it to the end of the file. Note that this means we also have to know where the end of the file is. Hence, after we write we update an instance variable that keeps track of the current file offset.

Next is the _load_next_record function. This function will get used in our initialization of a bitcask instance. If we ever kill the bitcask process, we’ll need a way to restore the file offset and re-build our index from the records on file. This is a helper function that we’ll use shortly to do just that.

We’ve yet to actually see what the index data structure looks like in code. It’s just a dictionary wrapper class that’s explicit about what it stores.

# keydir.py

class KeyDir:
    def __init__(self):
        self.items = {}

    def put(self, file, timestamp, k, v, offset, size):
        self.items[k] = KeyDirItem(file, timestamp, v, offset, size)

    def get(self, k):
        if k in self.items:
            return self.items[k]

    def delete(self, k):
        del self.items[k]


class KeyDirItem:
    def __init__(self, file, timestamp, v, offset, size):
        self.file_id = file
        self.timestamp = timestamp
        self.value = v
        self.size = size
        self.pos = offset

We now have all the pieces of our bitcask database, so it’s time to glue them together. The next code is our API for the database. It implements the put/get/delete operations and the initialization routine.

# bitcask.py


from bitcask_file import File
import os
from keydir import KeyDir
import codec
import struct

TOMBSTONE_VALUE = 'd49200c8-0a26-4f00-b4f0-7a9dffe0e288'

class BitCask:
    _instance = None

    def __new__(cls, dir):
        if cls._instance is None:
            cls._instance = super(BitCask, cls).__new__(cls)
            cls._instance.setup(dir)
        return cls._instance

    def setup(self, dir):
        self.dir = dir
        os.makedirs(self.dir, exist_ok=True)
        self.active_file = File(self.dir)
        self.filemap = {self.active_file.filename: self.active_file}
        self.keydir = KeyDir()
        self.populate_keys()

    def populate_keys(self):
        """
        rebuild keydir in memory from existing db by reading through every
        file and adding keys that we've never seen or are newer than the key
        """
        for filename in os.listdir(self.dir):
            with open(f'{self.dir}/{filename}', 'rb') as f:
                file = File(self.dir, filename, 0)
                self.filemap[file.filename] = file
                while(True):
                    curr_offset = file.offset
                    r = file._load_next_record()
                    if (r):
                        entry = self.keydir.get(r.key)
                        if entry and r.timestamp > entry.timestamp and r.value != TOMBSTONE_VALUE:
                            # key exists but record we found is newer
                            size = file.offset - curr_offset
                            self.keydir.put(file.filename, r.timestamp, r.key, r.value, curr_offset, size)
                        elif entry and r.timestamp > entry.timestamp and r.value == TOMBSTONE_VALUE:
                            self.keydir.delete(r.key)
                        elif (not entry) and r.value != TOMBSTONE_VALUE:
                            # add new key to the keydir
                            size = file.offset - curr_offset
                            self.keydir.put(file.filename, r.timestamp, r.key, r.value, curr_offset, size)
                    else:
                        break

    def put(self, key, value):
        (timestamp, offset, size) = self.active_file.write(key, value)
        self.keydir.put(self.active_file.filename,
                        timestamp, key, value, offset, size)

    def get(self, key):
        entry = self.keydir.get(key)
        if entry:
            return self.filemap[entry.file_id].read(entry.pos, entry.size).decode()

    def delete(self, key):
        self.keydir.delete(key)
        self.active_file.write(key, TOMBSTONE_VALUE)

First, note that the put and get functions basically just call out to the keydir and bitcask_file helper methods.

The first of a couple caveats that I neglected to mention before is that there could be multiple files. Each time we start a bitcask process, we create a new active file that we’ll write to.¹ We maintain a map of all the other files, so we can reference data from any of them.

The second caveat is this TOMBSTONE_VALUE. To delete a record, we delete the key from the keydir index, but deleting it from the file is tricky because our bitcask files our append-only logs. Instead, we have a globally unique value that we write to the file to mark deletion of a key.

This last function, populate_keys, goes back to that _load_next_record helper function we saw earlier. This is our algorithm for re-building our index from our bitcask files. In short, we read through all the files in a directory specified at instantiation. For each record, we do the following:

If the record value is the TOMBESTONE_VALUE, delete the key from the keydir. Else if we’ve yet to see the record value, put it in the keydir. Else if the record key is in the keydir, but the timestamp is newer, put it in the keydir.

That’s all the code! It’s quite a small project. You can see the full source code on my github.

Footnotes

¹ Remember how updates are just implemented by writing the same key to disk with an updated timestamp? This wastes space since older records with the same key are no longer needed. Actual log-structured key-value databases resolve this issue by doing compaction. Whenever the “active” file gets big enough (or we start a new database process), we make it immutable and start a new “active” file. To free up disk space, the compaction process looks at all the immutable files and merges them while only keeping the most recent version of each key. I didn’t implement compaction here, so there’s not intuition for the “active” file concept, but there’s less code.