Wednesday, October 24, 2007

KGiMAP

A not so beautiful Allston morning turned out to be a rather useful one with the slashdot post [ Free IMAP On Gmail]. I checked my gmail settings and (apprently I am among the fortunate ones here) I saw the "Enable IMAP" setting there. Wee...

Fired up Kmail, added a new account for Google as a disconnected IMAP. And on my way to sync ... I was curious how the labels were going to be handled, I have about 15 of them based on which mailing list or which bugzilla an email is from (I have mentioned this before - I prefer organization over search). It was nice to see the labels getting downloaded as imap folders. And there is a special folder [Gmail] which contains the regular gmail folders - All mail, Starred, Trash, Spam, Sent Mail. Thats when it struck me!

All the emails are going to be duplicated or triplicated or could have even more copies. Labels are not disjoint, so there are mails in my Inbox that are also in beagle-bugs and of course all mail ever is also in All Mail. To add to the trouble, deleting an email locally is going to archive the email (basically, it removes the Inbox label from the email) and not delete it. The huge amount of mails in the Spam folder is not healthy for my hard disk space. The huge amount of emails my beagle-bugzilla filter sends out to Trash is similarly bad.

As of now I resolved the situation by asking all my filters (I have one filter for each label) to archive the email ( Skip the inbox) and then apply the label. I asked KMail to not subscribe to any of the [Gmail] folders. And I created a Gmail label(IMAP folder on KMail) Totrash which I marked as my Trash folder in KMail (for the GMail account). Now whenever I delete a Gmail email from KMail, it will go to Totrash which I can empty periodically from the web-interface. What a kludge! I wish there was a rule to empty folders or a filter move emails from Totrash to Trash!

Saturday, October 13, 2007

There is no third way

(This post is not about beagle though it is hosted on PlanetBeagle.
However it is somewhat related to "search".)

I have seen two approaches to finding information, in land of software
as well as real-ware. There are people who find it themselves and
there are some who use an external agent to get what they need. Human
memory is expensive, so these are the most obvious approaches. I have
seen equal number of people on both sides.

In the first group are people who like to keep everything organized,
from physical files to computer files, books to RSS feeds, knives to
applications. Organization typically involves grouping the items into
categories and re-grouping again if the number of items in a category
grows too much. Its a pain to remember things, then there is so much
to remember. Care is taken so that the number of categories don't grow
too large or the nesting isn't too deep. Computer Science people
cashed in on such problems, analyzed them to death and presented us
with databases, shallow and deep trees, hashtables, sorting etc.
Whatever the approach, the objective remains that reaching one
particular item should take as less amount of memory and labour as
possible. Memory and labour appear as trade-off here, so there are
variations of this where the most frequently accessed items are
optimized more than others (compression, source coding, amortized
algorithms, cabinets with upper and lower shelves etc.).

The second group of people use an external agent to do this task. In
reality, not everyone belongs to only one group, people tend use
whichever approach is suitable. Anyway, the external agents could be
one's mom, secretary :), a database, a filesystem, search engine etc.
They are nothing but efficient members of the first category. The
efficiency matters here; tolerance gets extremely low if I am not
doing something but hiring someone to do it.

It is clear that when dealing with items on the start menu, the same
ideas would come into action. OK, so the rest of the post is about
start menu, you can stop reading now.

Either we have a menu, properly organized or we have a dump of
everything with an extremely efficient way to get what we want. OK, we
can do the optimized variation of the first and add search to it to
make it work for the second group (e.g. kickoff, internet with search
engine and directories, filesystem with desktop search). But to get it
done right, both should be done with precision and care. Which means,
the search has to be fast and accurate and categorization should not
be too deep nor too flat. Geeshh... that is difficult.

Actually not that difficult for applications. If done the right way,
they can be arranged in a start menu with categories (not Windows
style though!). Or they can just be there and you find them using a
search tool (quicksilver, katapult, deskbar). Or both (kickoff,
kickoff spinoffs).

This is where I found Mandriva to go wrong. They had a nice start menu
to begin with (because Mandrake had a nice start menu to begin with).
In the latest distribution, the Mandriva specific categorization has
been replaced with a two level (i.e. top-level with
network/tools/office and then apps in them) menu. Some of the submenus
became too large, and they have been put behind another nesting called
"More" (e.g. "Firefox" went behind more because it not native KDE).
The end result is a menu, which is only two level and pretty flat (so
less to remember which submenu to go to) but once in the submenu, have
to memorize the position in a pretty long vertical list. Add to that
the effort in dragging the mouse a significant vertical distance. I
have entries that requires me to drag the mouse all the way from top
(I have my start menu button at the top-right corner) to the bottom -
that's 1024 pixels. The earlier menu was the same with one more
nesting, like "Browsers", "Spreadsheet", "Terminals" where I only had
to memorize the levels and then look in the 2-3 items in each
immediate category. The vertical distance covered in the current menu
is larger than the horizontal distance due to that extra submenu.
Adding to the woes is the lack of an embedded search option.

There is another aspect with menu which make vertical movement much
harder than horizontal movement. Its related to Fitts's Law. But first
a small digression. I like to call it Fitts's Theory. Because it
describes certain operation in a particular model. Calling it law
isn't wrong, except it makes it sound like anyone not following the
"law" is from Mars. Once again, it tries to model some observed
phenomenon, and does not aim to dictate how the phenomenon should take
place. Unfortunately most commentaries about Fitts's law seem to think
we should all abide by Fitts's law or leave this planet.

Anyway, in menus the entries are horizontally layered. The menus are
designed to open a submenu when the mouse in on the corresponding
entry. The problem happens if I try to follow a straight line from an
entry to an entry in a submenu. If the entry in the submenu appears
significantly lower in the submenu, the mouse will inevitably cross
over the next entry in the menu, which will open the submenu for the
next menu entry thereby forcing you to start again. Vertical motion
isn't unrestricted. It happens to me always when I quickly try to go
from the start menu to some application whose position I know thereby
trying to go straight from the button to that entry. If I am not
careful, I end up opening the submenu of the next menu entry in a
haste, then move left and up to find the right menu entry, then move
right _carefully_ to horizontally enter the correct submenu and then
rush down to the memorized spot. All in a wink of an eye and it has
become my second nature now. But every time this happens, I feel
slightly irritated. If menus are designed to take advantage of the
horizontal freedom of movement more, it would be much smooth. Deeper
nesting has the mentioned side effect. By the way, radial menu or
submenus that open in the middle or in the first entry should be
better. But apart from a few radial Firefox context menus, I have not
found any other implementation. I love radial menu.

That is all I had to say about start menus. Before finishing, I should
point out that even for the first group of self-organized users,
having someone or something to search is not totally out of the
question. Some form of data are too vast to organize (bookmarks and
directories wrt internet), some have no single way of organizing or
some even have no of meaningful grouping criterion. And even if you
arrange all your computer files in directories, emails in mail
folders, feeds in feed reader folders, websites in bookmark folders,
if you are to find this blog (containing "Fitts's Law" but unrelated
to beagle), then good luck!

Tuesday, October 09, 2007

I saved 80MB

Its a common question in the database community where to store large binary
objects, commonly known as BLOBs (binary large objects). There are two
obvious options, store it in the filesystem or store it in a database. And by
twisting arms we get the third obvious approach, a mix of the two. There are
some studies in database academic community on the tradeoff.

Thinking carefully, the database and filesystem work in a similar way when
storing objects. Both store data in tree fashion, providing fast access to
any object. Both can add or delete object. And both suffers from
fragmentation.

The fragmentation is the key issue in my test case. So what is my test case.
It consists of lots and lots of small files and I mean, really small files
and some very large files. Just to give an example, I have a total of 20000
files. Of them about,
500 are above 4K
1000 2-4K
2000 1-2K
16000 less than 1K.

There is a reason for this extremely skewed distribution. The files are
textcaches in my home directory. The main contributing factors are about 8000
files (a mix of source code, tex, pdf and html and oh ... lots of pictures)
and 19000 emails (without spam). Now source codes are self caches i.e.
separate textcache is not stored for these files. Same with txt. Pictures and
other media has no text to cache. So the bulk of cached text comes from html,
pdf and office documents. Most of such documents in my home directory are
generally pretty large i.e. with a significant amount of textual content.
This is in sharp contrast to the behaviour for emails. The textual content
from the emails are stored. Also, unlike the html files on the disk, the HTML
parts in these emails are small, really small. Those contribute to the huge
number of extremely small files. If you are wondering how come I got 3000
absurdly small 20 byte files, I store the gzipped copy of the extracted text
in the textcache.

So now the question is how to best store data with such distribution. The
simplest thing is to use the filesystem. Filesystem are equipped with
sophisticated methods to avoid fragmentation. But with a fixed block size
(like 4K on my home directory partition), there is bound to be some
fragmentation when storing small files like above. On the other hand,
historically databases were concerned with storing small strings (with all
sorts of querying and update speedup; and they are pretty good at that).
Recently, most of the databases have added support for large binary objects.
There is also a small twist to this problem. The data is mostly static but
there are deletions (and sometimes modifications). The last important factor
is I am using the embedded sqlite database which stores all its data in a
single file; so in essence, what I get is a "fast retrieval and
update"-enabled huge tar file.

The prevailing folklore is that database is better for smaller objects and
filesystem for larger objects. What is small and what is large depends on the
nature of the problem at hand. There are some papers doing detailed analysis
of the tradeoff, and they conclude somewhat along the same lines. I found one
paper showing that adding and deleting lots of large objects over time causes
enough fragmentation in a database that it can lag behind a filesystem.

I am not much of an experimental person myself. With the above testcase, I did
not even think twice about what I should do. Well, for other reasons (like
heap fragmentation in my application, since reading from a database would
require me to store the entire data in memory) I was a bit biased towards the
filesystem approach. So I decided to take the middle path of a mix of
filesystem and database. Thanks to decent APIs, implementing the hybrid
approach did not turn out to be anything complicated. I store the first 4K in
a buffer and then if I get more data, move everything to a file otherwise
save everything in the database.

All the 19500 files with less than 4K size are now safely sleeping in the
database, which would have otherwise taken 4K each.

(* The title is not entirely correct. I did not have to waste 4K blocks for
each of 19500 files, but now my sqlite database is 26MB instead of something
much smaller. But I am confident, this is good for the mankind at large and
for the dolphins too.)