Tuesday, October 09, 2007

I saved 80MB

Its a common question in the database community where to store large binary
objects, commonly known as BLOBs (binary large objects). There are two
obvious options, store it in the filesystem or store it in a database. And by
twisting arms we get the third obvious approach, a mix of the two. There are
some studies in database academic community on the tradeoff.

Thinking carefully, the database and filesystem work in a similar way when
storing objects. Both store data in tree fashion, providing fast access to
any object. Both can add or delete object. And both suffers from
fragmentation.

The fragmentation is the key issue in my test case. So what is my test case.
It consists of lots and lots of small files and I mean, really small files
and some very large files. Just to give an example, I have a total of 20000
files. Of them about,
500 are above 4K
1000 2-4K
2000 1-2K
16000 less than 1K.

There is a reason for this extremely skewed distribution. The files are
textcaches in my home directory. The main contributing factors are about 8000
files (a mix of source code, tex, pdf and html and oh ... lots of pictures)
and 19000 emails (without spam). Now source codes are self caches i.e.
separate textcache is not stored for these files. Same with txt. Pictures and
other media has no text to cache. So the bulk of cached text comes from html,
pdf and office documents. Most of such documents in my home directory are
generally pretty large i.e. with a significant amount of textual content.
This is in sharp contrast to the behaviour for emails. The textual content
from the emails are stored. Also, unlike the html files on the disk, the HTML
parts in these emails are small, really small. Those contribute to the huge
number of extremely small files. If you are wondering how come I got 3000
absurdly small 20 byte files, I store the gzipped copy of the extracted text
in the textcache.

So now the question is how to best store data with such distribution. The
simplest thing is to use the filesystem. Filesystem are equipped with
sophisticated methods to avoid fragmentation. But with a fixed block size
(like 4K on my home directory partition), there is bound to be some
fragmentation when storing small files like above. On the other hand,
historically databases were concerned with storing small strings (with all
sorts of querying and update speedup; and they are pretty good at that).
Recently, most of the databases have added support for large binary objects.
There is also a small twist to this problem. The data is mostly static but
there are deletions (and sometimes modifications). The last important factor
is I am using the embedded sqlite database which stores all its data in a
single file; so in essence, what I get is a "fast retrieval and
update"-enabled huge tar file.

The prevailing folklore is that database is better for smaller objects and
filesystem for larger objects. What is small and what is large depends on the
nature of the problem at hand. There are some papers doing detailed analysis
of the tradeoff, and they conclude somewhat along the same lines. I found one
paper showing that adding and deleting lots of large objects over time causes
enough fragmentation in a database that it can lag behind a filesystem.

I am not much of an experimental person myself. With the above testcase, I did
not even think twice about what I should do. Well, for other reasons (like
heap fragmentation in my application, since reading from a database would
require me to store the entire data in memory) I was a bit biased towards the
filesystem approach. So I decided to take the middle path of a mix of
filesystem and database. Thanks to decent APIs, implementing the hybrid
approach did not turn out to be anything complicated. I store the first 4K in
a buffer and then if I get more data, move everything to a file otherwise
save everything in the database.

All the 19500 files with less than 4K size are now safely sleeping in the
database, which would have otherwise taken 4K each.

(* The title is not entirely correct. I did not have to waste 4K blocks for
each of 19500 files, but now my sqlite database is 26MB instead of something
much smaller. But I am confident, this is good for the mankind at large and
for the dolphins too.)

No comments: