Friday, January 25, 2008

Open letter to OpenSUSE users

(long post warning)

Dear OpenSUSE users,

Recently I came across several threads in various OpenSUSE mailing lists
[1], [2], [3]. I was both amused and felt sorry while reading the posts. No
really some of you write funny emails. That aside, people, especially those
using FOSS don't make up things like this. I am sure the problems that you
faced exist (or existed in whatever version you were using).

I joined the project later but I still feel responsible for the sleepless
nights some of you have had due to beagle, trying to imagine what you would
see beagle has done to your computer when you wake up. I would have felt the
same if I were in your position, in fact I sometimes feel the same for one of
the browsers that I use.

There were lots of suggestions and speculations. There were suggestions of
filing bugs with us. While I do appreciate if some you can file bug reports,
I sympathise with those who dont want to open yet another account to file
bugs or email the mailing list. I belong to the latter group, so instead of
replying to the thread, let me take a minute here explaining how we try to be
friendly to your computer hard-disk space, memory and CPU.

* We nice the process and (try to) lower the iopriority.

* Extracting text from the binary files, without rewriting the app which
deals with files of that type, is an expensive operation. So, we index them a
few at a time with sufficiently long wait in between. The wait period is
longer if the system load average is high. But if you are playing games or
doing other CPU intensive operation, you will not miss the CPU spikes. Normal
uses should not be hampered though.

* During crawling (for files, emails or browser cache) we try not to disturb
the existing vm buffer cache.

* We believe once the initial indexing is over there should not be noticable
effect from beagle, so we crawl a bit faster when the screensaver is on. But
we provide options for you to turn it off.

* We use a separate process to actually do the dirty job of reading the
files and extracting data. As a failsafe measure, if the memory usage of that
helper process increase too much we kill it and start a new helper process. I
would like to claim that for the last several versions I did not see/hear the
helper process being killed due to memory problems.

* For certain files we need to store indexable data in a temporary file. We
make sure we delete it as soon as the indexing is over. There were problems
in some old versions where the files would not be deleted (they definitely
wont be deleted if you kill-9 the process) but I have not heard about this
problem in recent times.

* To show you snippets of your searched words, we store the text data of
some types of files (not the text or c,c++ kind of files whose text data is
basically the file itself but the files in a binary format). We try to be
smart here to not create thousands of small files on the disk (I have about
20K mails generating at least 10K snippet files). In addition to it, we
provide ways for you to turn off the snippets completely.

We do care for your experience and certainly for my own experience while
indexing my data. So where do we go wrong:

* Once in a while the indexer encounters a file for which it ends up in an
infinite loop. Most of the times it is generally a malformed file but
sometimes it is also our fault.

* C# has lots of advantages and one of them is that the developer does not
have to worry about freeing memory after it is used. Depending on someone
else (in this case the garbage collector which frees the memory for us) has
its pros and cons. But one thing for sure (assuming mono is not making any
mistake in freeing), there is not going to be any memory leak of the kind we
are afraid of in C or C++. Neither are we afraid of segmentation faults due
to memory corruption. If you are wondering how do some of you see beagle's
memory growing, let me remind you that "to err is human". With sophisticated
tools to prevent simple errors, comes sophisticated errors. A simple example
could be like storing in a list all files beagle finds during crawling, but
forgetting to remove them once the data is written to the index. No, we never
did that but sometimes we make similar mistakes.

* We would be extremely happy if beagle only used C# for all its operations.
Unfortunately, we have to depend on a lot of C libraries for indexing certain
files. Sometimes memory leaks (the C type) and segmentation faults happen in
them. These are harder to spot since mono does not know about the memory
allocated in the C libraries.

* Beagle re-indexes a file as soon as possible once it is saved. It is in
general not possible to know whether it was the user pressing ctrl-s in
KOffice or a torrent app saving the file after downloading one chunk of data.
As a result, beagle performs horribly, yes horribly, if it encounters a file
that is being downloaded by a p2p/torrent app. You are bound to see almost
continuous indexing as beagle strives to index the updated file for you in
real time. Same goes for any large, active mbox file in the home directory
_not_ used by Thunderbird, Evolution or KMail (for mbox files of these apps,
the corresponding backend is smart to index only the changed data).

* NFS shares have their own share of problems with file locking, speed of
data access etc. We have tried to deal with them in the past by copying the
beagle data directory to a local partition, performing indexing and then
copying back the data directory to the users home directory. It is a feature
not continuously tested and I am sure you can think about lots of cases where
this would fail.

* The first attempt to write a Thunderbird backend was a disaster. Well, it
was good learning experience for us but it will cause headache to most users.
We disabled it in the later 0.2 versions. There is a new one in the 0.3
series which reportedly works better.

* There was one design decision which backfired on us. Imagine you dont have
inotify and have a large home directory. To present you changes in real time,
one option is to crawl the directories regularly (kind of what the WinXP
indexer does). You can imagine the rest. Though inotify is present in the
kernel these days, the number of default inotify watches (the number of
directories that can be watched) is pretty low for users with a non-trivial
sized home directory. In the recent versions, we disable the regular
recrawling.

* Besides continuous CPU usage and hard disk activity (for days and weeks
after the initial indexing is over) the above had another effect on the log
file. Add to it the pretty verbose exceptions beagle logs. We want to know
about the errors so we still print verbose exceptions but we dont reprint the
same errors anymore. (I have been told that some of the OpenSUSE packages
have the loglevel reduced to error-only which will automatically generate
smaller log files).

* This is a good excuse to end the list with. C# and beagle architecture
allows us to add lots of goodies. After all, we (read I) work on beagle
solely because I love to play with it. The more the features, more the lines
of code and more errors. The only good part is once spotted, they are easy to
fix. Check our mailing list and wiki for the available freebies.

So in summary, we try to be nice to your computer (and to you ? maybe, if you
are nice ;-) ... just kidding) but there are limitations that we are
constantly trying to improve on. Any of you can look in our bugzilla, our
mailing list archive, our wiki or hang out in our IRC channel to see for
yourself how we try to issue any problem with utmost importance. Ok, I lied,
as much as our time permits. There are lots of features in beagle and some of
them rarely get regular testing, mostly because none of us use those
features. I wont be surprised if there are major problem with these. I assure
you that if you bring any problem to our notice, they will be taken care of,
if not completely resolved.

Lastly, I read in one of the forum posts that beagle-0.3 will land in factory
sometime soon. If any of you wants to verify the facts above (the good ones
or the bad ones ;-), give that a spin. And a friendly suggestion, if you only
want to search for files with certain name or extension, you can do much much
better with find/locate.

Your friendly beagle developer,
- dBera

[1] http://lists4.suse.de/opensuse-factory/2008-01/msg00157.html
[2] http://lists.opensuse.org/opensuse/2007-12/msg01796.html
[3] http://lists.opensuse.org/opensuse/2008-01/msg01083.html (could not find
the parent of this thread)

4 comments:

Anonymous said...

beagle in opensuse 10.3 is a disaster.

Anonymous said...

"We would be extremely happy if beagle only used C# for all its operations." -- Yes, please, this would also allow me to actually install Beagle, because I'm just not willing to fight the current dependency hell.

Anonymous said...

Couldn't you use Java libs for document conversion? This way the dependencies would be Mono plus Java and not Mono plus several native libs. All could be in one big package, easy to install. Also, I imagine Java might be easier to integrate into Mono apps because it's also bytcode (though a different one).

mdi said...

I, for one, thank you very much for your ongoing efforts and hacking on Beagle.

Am a laptop user, and am not being bitten by NFS, so this is not really a problem for me.

I like that you have articulated some of the issues, hopefully someone will be interested in addressing those.

What I have done is that I have listed my ~/downloads directory as do-not-index, to prevent indexing of the random junk that I get from the net.

If I actually end up using the software/docs, I move it to ~/docs or ~/software, so this is not a problem for me with torrents for example.

Email does kick-in, not a major issue, and probably adding some "smarts" to it might help.