Tuesday, January 29, 2008

FOSS meets ENG

Techkriti brings back fond memories. Techkriti is the annual technology fest at IIT-Kanpur (y'no the place where they figured out how to decide if a number is prime in a theoretically fast way).

That was one fabulous event I actively took part in during my undergrad days. A perfect mix of all kinds of technology. I was there in its early days of infancy; if I remember correctly the first prize for the software contest in my first year there went to a graphical calculator program in Tcl/Tk. By the time I left it was hugely popular and there participants from all over India. I was one of the organizers for Tech Olympiad in my final year. It was equally fun to come up with challenging problems where various concepts tie in together. The participants loved it.

This year Techkriti is even more exciting. I read in the news and blogs about how FOSS is catching up in the subcontinent. This time they are organizing a FOSS event for Techkriti, probably for the first time. I am hoping the event becomes a success, though the schedule page is a bit empty now. Beagle Xesam adapter author Arun and Web interface dude Nirbheek are among its organizers, so I am sure I will get first hand information about all the exciting things that will happen (how about a hackfest for Dashboard ;-).
FOSSKriti.

I will end with two simple brain-stormers.

The first one is of mathematical type, my favourite: Show that any 5 consequtive numbers will always contain some number which is prime to the other 4 numbers.

The second one relates to programming (somewhat), my hobby. You are given two arrays A1 of size m+n and A2 of size n where only the first m slots of A1 are filled with sorted integers (increasing order) and all the slots of A2 are filed with sorted integers (increasing order). You goal is to merge the two arrays into A1, in linear time (no block array copying or other tricks) but (here is the twist) using no extra space (i.e. no placeholder variable to hold temporary values).

Good luck!

Friday, January 25, 2008

Open letter to OpenSUSE users

(long post warning)

Dear OpenSUSE users,

Recently I came across several threads in various OpenSUSE mailing lists
[1], [2], [3]. I was both amused and felt sorry while reading the posts. No
really some of you write funny emails. That aside, people, especially those
using FOSS don't make up things like this. I am sure the problems that you
faced exist (or existed in whatever version you were using).

I joined the project later but I still feel responsible for the sleepless
nights some of you have had due to beagle, trying to imagine what you would
see beagle has done to your computer when you wake up. I would have felt the
same if I were in your position, in fact I sometimes feel the same for one of
the browsers that I use.

There were lots of suggestions and speculations. There were suggestions of
filing bugs with us. While I do appreciate if some you can file bug reports,
I sympathise with those who dont want to open yet another account to file
bugs or email the mailing list. I belong to the latter group, so instead of
replying to the thread, let me take a minute here explaining how we try to be
friendly to your computer hard-disk space, memory and CPU.

* We nice the process and (try to) lower the iopriority.

* Extracting text from the binary files, without rewriting the app which
deals with files of that type, is an expensive operation. So, we index them a
few at a time with sufficiently long wait in between. The wait period is
longer if the system load average is high. But if you are playing games or
doing other CPU intensive operation, you will not miss the CPU spikes. Normal
uses should not be hampered though.

* During crawling (for files, emails or browser cache) we try not to disturb
the existing vm buffer cache.

* We believe once the initial indexing is over there should not be noticable
effect from beagle, so we crawl a bit faster when the screensaver is on. But
we provide options for you to turn it off.

* We use a separate process to actually do the dirty job of reading the
files and extracting data. As a failsafe measure, if the memory usage of that
helper process increase too much we kill it and start a new helper process. I
would like to claim that for the last several versions I did not see/hear the
helper process being killed due to memory problems.

* For certain files we need to store indexable data in a temporary file. We
make sure we delete it as soon as the indexing is over. There were problems
in some old versions where the files would not be deleted (they definitely
wont be deleted if you kill-9 the process) but I have not heard about this
problem in recent times.

* To show you snippets of your searched words, we store the text data of
some types of files (not the text or c,c++ kind of files whose text data is
basically the file itself but the files in a binary format). We try to be
smart here to not create thousands of small files on the disk (I have about
20K mails generating at least 10K snippet files). In addition to it, we
provide ways for you to turn off the snippets completely.

We do care for your experience and certainly for my own experience while
indexing my data. So where do we go wrong:

* Once in a while the indexer encounters a file for which it ends up in an
infinite loop. Most of the times it is generally a malformed file but
sometimes it is also our fault.

* C# has lots of advantages and one of them is that the developer does not
have to worry about freeing memory after it is used. Depending on someone
else (in this case the garbage collector which frees the memory for us) has
its pros and cons. But one thing for sure (assuming mono is not making any
mistake in freeing), there is not going to be any memory leak of the kind we
are afraid of in C or C++. Neither are we afraid of segmentation faults due
to memory corruption. If you are wondering how do some of you see beagle's
memory growing, let me remind you that "to err is human". With sophisticated
tools to prevent simple errors, comes sophisticated errors. A simple example
could be like storing in a list all files beagle finds during crawling, but
forgetting to remove them once the data is written to the index. No, we never
did that but sometimes we make similar mistakes.

* We would be extremely happy if beagle only used C# for all its operations.
Unfortunately, we have to depend on a lot of C libraries for indexing certain
files. Sometimes memory leaks (the C type) and segmentation faults happen in
them. These are harder to spot since mono does not know about the memory
allocated in the C libraries.

* Beagle re-indexes a file as soon as possible once it is saved. It is in
general not possible to know whether it was the user pressing ctrl-s in
KOffice or a torrent app saving the file after downloading one chunk of data.
As a result, beagle performs horribly, yes horribly, if it encounters a file
that is being downloaded by a p2p/torrent app. You are bound to see almost
continuous indexing as beagle strives to index the updated file for you in
real time. Same goes for any large, active mbox file in the home directory
_not_ used by Thunderbird, Evolution or KMail (for mbox files of these apps,
the corresponding backend is smart to index only the changed data).

* NFS shares have their own share of problems with file locking, speed of
data access etc. We have tried to deal with them in the past by copying the
beagle data directory to a local partition, performing indexing and then
copying back the data directory to the users home directory. It is a feature
not continuously tested and I am sure you can think about lots of cases where
this would fail.

* The first attempt to write a Thunderbird backend was a disaster. Well, it
was good learning experience for us but it will cause headache to most users.
We disabled it in the later 0.2 versions. There is a new one in the 0.3
series which reportedly works better.

* There was one design decision which backfired on us. Imagine you dont have
inotify and have a large home directory. To present you changes in real time,
one option is to crawl the directories regularly (kind of what the WinXP
indexer does). You can imagine the rest. Though inotify is present in the
kernel these days, the number of default inotify watches (the number of
directories that can be watched) is pretty low for users with a non-trivial
sized home directory. In the recent versions, we disable the regular
recrawling.

* Besides continuous CPU usage and hard disk activity (for days and weeks
after the initial indexing is over) the above had another effect on the log
file. Add to it the pretty verbose exceptions beagle logs. We want to know
about the errors so we still print verbose exceptions but we dont reprint the
same errors anymore. (I have been told that some of the OpenSUSE packages
have the loglevel reduced to error-only which will automatically generate
smaller log files).

* This is a good excuse to end the list with. C# and beagle architecture
allows us to add lots of goodies. After all, we (read I) work on beagle
solely because I love to play with it. The more the features, more the lines
of code and more errors. The only good part is once spotted, they are easy to
fix. Check our mailing list and wiki for the available freebies.

So in summary, we try to be nice to your computer (and to you ? maybe, if you
are nice ;-) ... just kidding) but there are limitations that we are
constantly trying to improve on. Any of you can look in our bugzilla, our
mailing list archive, our wiki or hang out in our IRC channel to see for
yourself how we try to issue any problem with utmost importance. Ok, I lied,
as much as our time permits. There are lots of features in beagle and some of
them rarely get regular testing, mostly because none of us use those
features. I wont be surprised if there are major problem with these. I assure
you that if you bring any problem to our notice, they will be taken care of,
if not completely resolved.

Lastly, I read in one of the forum posts that beagle-0.3 will land in factory
sometime soon. If any of you wants to verify the facts above (the good ones
or the bad ones ;-), give that a spin. And a friendly suggestion, if you only
want to search for files with certain name or extension, you can do much much
better with find/locate.

Your friendly beagle developer,
- dBera

[1] http://lists4.suse.de/opensuse-factory/2008-01/msg00157.html
[2] http://lists.opensuse.org/opensuse/2007-12/msg01796.html
[3] http://lists.opensuse.org/opensuse/2008-01/msg01083.html (could not find
the parent of this thread)