d-Tech-t: beagle

Showing posts with label beagle. Show all posts

Sunday, April 20, 2008

Takes Two to Release

The GMail backend I blogged about before is now available for mass abuse in Beagle 0.3.6(.1). We also tried to maintain our love of cutting edge technology by upgrading the Firefox extension to work with Firefox 3.0.

I noticed several forum posts where users wanted to use beagle like locate/find-grep. The desire was two pronged - no intention to run a daemon continuously and return files from everywhere doing basic searches in name and path. That is not how beagle is supposed to be used but users are the boss in a community project. So I added blocate, a wrapper to beagle-static-query. Currently it only matches the -d DBpath parameter of locate but works like a charm. Sample uses
$ blocate sondesh
$ blocate -d manpages locate

The other thing I added was a locate backend. I absolutely do not recommend using this one. Yet if you insist ... when enabled and used with the FileSystem backend, it will return search results from the locate program. Yes, results from eVeRyWhErE, as you wished.

You can use both the GMail and the locate backends in beagle-search as well. But both the new backends are rather primitive, so I have taken enough precautions againsts n00bs accidentally using them. So in summary, 0.3.6 is not going to do you any good. Oops... did I just say that ?!

The title is based on the empiricial count of the number of actual releases (including brown bag ones) needed for last few releases.

Saturday, March 08, 2008

Beyond Search: arrhh...dee...efff

If the news reports and blogs are to be believed, this is the age of Semantic Something. First people wanted to search web, then file contents, and then search emails and other user data. Everybody was talking about desktop search; along came Beagle, Spotlight, Google Desktop Search, Kat, MetaTracker, Pinot, Strigi etc. While desktop search at its core is nothing but a crawler which reads different file formats and stores them in a searchable database, searching is the most trivial and IMO, boring application built on Beagle's infrastructure.

These days the focus seems to have shifted to Semantic Desktop and Semantic Web. Most blog comments and mailing list posts about Semantic-Fu have a hint of it being vapourware. Its not totally their fault either; the ideas are around for a long time and people are working on it for many many years. But there is no glittering gold in sight. Only recently some interesting Semantic Web ideas have started taking shape. Semantic Desktop is a slightly different game but it should not be far behind. After taking about 40 developer years, Beagle is just about ready to take desktop search beyond simple file content search. Historians might want to take note of the dashboard project and how beagle came into being as a necessary requirement for that truly beyond-desktop-search application.

The core idea behind Semantic Desktop, upto my understanding, revolves around the buzzword jack-of-most-trades RDF. And for the impatient kind, here is a rude shock - RDF is not useful for human beings. Even further, it is not even meant for you, me and us; storing every conceivable data in the RDF format is not going to make our life any easier right away.

RDF or Resource Description Framework is a generic way to describe anything, to be accurate any description of anything. It is a fairly elaborate yet structured format; very easy for programs to analyze that information but extremely redundant to human eyes. Notwithstanding what the AI experts are claiming about the future of AI, human mind can work without immediate deductive reasoning and in fact does that a lot of time. It recognizes familiar words without reading the alphabets one at a time, it deduces the color by merely glancing at it, it conjures up strange connections; its a wonder that will be hard to completely characterize by any set of rules. At least at the current stage, algorithms have to be told the facts and the relations between them for them to do any kind of processing with its data. These are the things that we just know when we see something and is thus the reason why storing the description of something in an RDF format is not going to gain me anything immediately. On the other hand, this is also the reason why applications should be fed data in an RDF format to allow it unhindered access to the semantics of the data.

If that felt hand wavy, try to think about the difference between the semantics of a data and its syntax. An array could be used to represent a linked list, a queue, a stack, a tree or an heap - the latter are the different semantics of the representations, the array is one of the many syntactic representations of one of the latter concepts. A bunch of pairs could be stored in a database table; the table is a syntactic representation of the data which has the semantics of a bunch of name, phone-number pairs. It is hard to work with the semantics of an idea, in a sense it is something up in the air; on the other hand storing some data in a suitable working form could fail to capture some concept about the data. Also, once stored in a particular form it is easy to miss the bigger picture; thus limiting the scope of what we could do with that data.

Saying all that, for the time being think of the RDF format as a bunch of objects and facts where each object is related to some number of facts. The semantics of related could differ based on the context, and RDF is powerful enough to describe even that semantics and a whole bunch of other facts about the facts. With beagle pulling data from nooks and corners of a user's desktop and providing a service which allows applications to search this data, it is a shame if we cannot exploit the relationships in this data for a better mankind... err... dolphins... err... us.

Consider all the emails I have. Now I know that there some emails that are part of discussion threads. Beagle does not. With the beauty of N3 (a close cousin of Semantic-Fu and RDF), I can write this extra information as a set of rule (the single '.' represents end of one rule). I am using emails msgid to track emails in a thread. I could not help but notice the similarity of these rules with prolog or other logic programming languages.
/* an email with subject 'foobar' is in its own thread */ { ?email :title 'foobar' . ?email :msgid ?msg . } => { ?msg :inthread ?msg } . /* if any email refers to some email in thread, then this email is also in other email's thread */ { ?ref :inthread ?parent . ?email1 :reference ?ref . ?email1 :msgid ?msg .} => {?msg :inthread ?parent} .
Using the RDFAdapter of the beagle-rdf branch, I can use this to get all the emails in the thread with foobar in its subject. Note that I am able to write my set of rules only when I see this data as actual emails and not a bunch of lucene documents with fields. The latter carry no meaning. Further note that, I can also use the BeagleClient API to perform field specific queries to obtain the same results. The difference is that the process of using BeagleClient will require me to think about the relationships from scratch and then figure out the right sequence of queries. Instead I could store all the relationship among the emails in the email-index in an RDF format (and also related information not stored in the index e.g. saying a list of email addresses are all mine and should be treated as for one person). Then, whenever I want to extract some information, I can write the question (again in an RDF format) and let the RDF-Magic figure out the how to execute this question against that data given this set of inference rules. Isn't it cool ?

If I missed it earlier, this kind of data-mining operations are not for my everyday use (here my refers to usual computer users) and is not for everybody. Still it is can sometimes come in handy. Imagine the possibilities if you can write the relationships between a file in an mp3 playlist (playlist filter), its download link and how your arrived at that page (webhistory indexing), the email you sent with that file as an attachment in a zip file (email attachment and archive filter), its ratings and usage statistics in Amarok (amarok querydriver) and of course the actual file on the harddisk (user home directory indexing).

Warning: The RDF Adapter in beagle uses the sophisticated SemWeb library which allows anyone to perform graph operations (selecting subgraphs, walking on graphs, pruning nodes and edges etc.) on the RDF graph of the data. Unlike most RDF stores for desktop data, beagle is not optimized for RDF operations and could take quite a bit of time and heat up the CPU. It took me about 4 seconds to find all threads with the word beagle among 500 emails (my actual email index has about 20K emails! I refuse to imagine what will happen if I run it on the full index). If you are interested, checkout the rdf branch and take a look at the test SemWebClient.cs.

Saturday, February 02, 2008

Grab the 3rd of the third

I announced the release of Beagle 0.3.3 today. It is one of those releases with lots of features here and there which just makes me nervous. I found a problem (which incidentally was uncovered when we started using Sqlite prepared statements couple of weeks ago) today during. I quickly identified the problem and fixed it but I am hoping everything else works out OK with this one.

Apart from using Sqlite prepared statements, which should show some speed improvements during running beagle-build-index, there are few other goodies as well. Beagle-search includes a menu option to view the current index information. I would have liked it better if it kept refreshing but this is better than nothing.

Searching documentation is now enabled in beagle-search. It used to be disabled by default in the early days because apparently it returned a lot of results and messed be Best. The situation is better now but not by much; so you have to pass --search-docs to beagle-search to ask it to search the documentation index. That aside, system wide manpage index is now enabled by default and that includes lzma compressed manpages (so Mandriva users like yours truly will be extremely delighted). Beagle search happily searches manpages and it is a real pleasure to use it instead of man -k.

Another real pleasure is being able to create Qt GUI using C#. Its feels very good and I ended up creating beagle-setting-qt, a Qt GUI for beagle-settings. I never added the bindings for the new beagle-config to libbeagle. So I had to amend my fault by giving KDE users some GUI for beagle-settings.

You also get a fake implementation of searching in a directory, one of the popular requests. You can either search inside a directory by giving its full path or by giving a word in the directories name. One catch is that the search is not recursively under the directory but only in its contents.

Sadly almost all the currently known problems with beagle are outside our control. Fortunately, most of the problems in these dependent libraries or suites are fixed and will be released soon. It is surprising how bugs in these external programs, generally corner cases when used alone, are triggered when using beagle. A long running application for desktop users has to cover a lot of bad ground to be even slightly respectable.

If this release goes well, then we might try to fix all the horribly hacky property names (based on our own ontology) and come out with a 0.4.0. I am also hoping to merge the RDF branch to trunk before that. I should really blog about the RDF branch sometime; the experiment to overlay an RDF store on beagle's data is nearing a sure success.

Friday, January 25, 2008

Open letter to OpenSUSE users

(long post warning)

Dear OpenSUSE users,

Recently I came across several threads in various OpenSUSE mailing lists
[1], [2], [3]. I was both amused and felt sorry while reading the posts. No
really some of you write funny emails. That aside, people, especially those
using FOSS don't make up things like this. I am sure the problems that you
faced exist (or existed in whatever version you were using).

I joined the project later but I still feel responsible for the sleepless
nights some of you have had due to beagle, trying to imagine what you would
see beagle has done to your computer when you wake up. I would have felt the
same if I were in your position, in fact I sometimes feel the same for one of
the browsers that I use.

There were lots of suggestions and speculations. There were suggestions of
filing bugs with us. While I do appreciate if some you can file bug reports,
I sympathise with those who dont want to open yet another account to file
bugs or email the mailing list. I belong to the latter group, so instead of
replying to the thread, let me take a minute here explaining how we try to be
friendly to your computer hard-disk space, memory and CPU.

* We nice the process and (try to) lower the iopriority.

* Extracting text from the binary files, without rewriting the app which
deals with files of that type, is an expensive operation. So, we index them a
few at a time with sufficiently long wait in between. The wait period is
longer if the system load average is high. But if you are playing games or
doing other CPU intensive operation, you will not miss the CPU spikes. Normal
uses should not be hampered though.

* During crawling (for files, emails or browser cache) we try not to disturb
the existing vm buffer cache.

* We believe once the initial indexing is over there should not be noticable
effect from beagle, so we crawl a bit faster when the screensaver is on. But
we provide options for you to turn it off.

* We use a separate process to actually do the dirty job of reading the
files and extracting data. As a failsafe measure, if the memory usage of that
helper process increase too much we kill it and start a new helper process. I
would like to claim that for the last several versions I did not see/hear the
helper process being killed due to memory problems.

* For certain files we need to store indexable data in a temporary file. We
make sure we delete it as soon as the indexing is over. There were problems
in some old versions where the files would not be deleted (they definitely
wont be deleted if you kill-9 the process) but I have not heard about this
problem in recent times.

* To show you snippets of your searched words, we store the text data of
some types of files (not the text or c,c++ kind of files whose text data is
basically the file itself but the files in a binary format). We try to be
smart here to not create thousands of small files on the disk (I have about
20K mails generating at least 10K snippet files). In addition to it, we
provide ways for you to turn off the snippets completely.

We do care for your experience and certainly for my own experience while
indexing my data. So where do we go wrong:

* Once in a while the indexer encounters a file for which it ends up in an
infinite loop. Most of the times it is generally a malformed file but
sometimes it is also our fault.

* C# has lots of advantages and one of them is that the developer does not
have to worry about freeing memory after it is used. Depending on someone
else (in this case the garbage collector which frees the memory for us) has
its pros and cons. But one thing for sure (assuming mono is not making any
mistake in freeing), there is not going to be any memory leak of the kind we
are afraid of in C or C++. Neither are we afraid of segmentation faults due
to memory corruption. If you are wondering how do some of you see beagle's
memory growing, let me remind you that "to err is human". With sophisticated
tools to prevent simple errors, comes sophisticated errors. A simple example
could be like storing in a list all files beagle finds during crawling, but
forgetting to remove them once the data is written to the index. No, we never
did that but sometimes we make similar mistakes.

* We would be extremely happy if beagle only used C# for all its operations.
Unfortunately, we have to depend on a lot of C libraries for indexing certain
files. Sometimes memory leaks (the C type) and segmentation faults happen in
them. These are harder to spot since mono does not know about the memory
allocated in the C libraries.

* Beagle re-indexes a file as soon as possible once it is saved. It is in
general not possible to know whether it was the user pressing ctrl-s in
KOffice or a torrent app saving the file after downloading one chunk of data.
As a result, beagle performs horribly, yes horribly, if it encounters a file
that is being downloaded by a p2p/torrent app. You are bound to see almost
continuous indexing as beagle strives to index the updated file for you in
real time. Same goes for any large, active mbox file in the home directory
_not_ used by Thunderbird, Evolution or KMail (for mbox files of these apps,
the corresponding backend is smart to index only the changed data).

* NFS shares have their own share of problems with file locking, speed of
data access etc. We have tried to deal with them in the past by copying the
beagle data directory to a local partition, performing indexing and then
copying back the data directory to the users home directory. It is a feature
not continuously tested and I am sure you can think about lots of cases where
this would fail.

* The first attempt to write a Thunderbird backend was a disaster. Well, it
was good learning experience for us but it will cause headache to most users.
We disabled it in the later 0.2 versions. There is a new one in the 0.3
series which reportedly works better.

* There was one design decision which backfired on us. Imagine you dont have
inotify and have a large home directory. To present you changes in real time,
one option is to crawl the directories regularly (kind of what the WinXP
indexer does). You can imagine the rest. Though inotify is present in the
kernel these days, the number of default inotify watches (the number of
directories that can be watched) is pretty low for users with a non-trivial
sized home directory. In the recent versions, we disable the regular
recrawling.

* Besides continuous CPU usage and hard disk activity (for days and weeks
after the initial indexing is over) the above had another effect on the log
file. Add to it the pretty verbose exceptions beagle logs. We want to know
about the errors so we still print verbose exceptions but we dont reprint the
same errors anymore. (I have been told that some of the OpenSUSE packages
have the loglevel reduced to error-only which will automatically generate
smaller log files).

* This is a good excuse to end the list with. C# and beagle architecture
allows us to add lots of goodies. After all, we (read I) work on beagle
solely because I love to play with it. The more the features, more the lines
of code and more errors. The only good part is once spotted, they are easy to
fix. Check our mailing list and wiki for the available freebies.

So in summary, we try to be nice to your computer (and to you ? maybe, if you
are nice ;-) ... just kidding) but there are limitations that we are
constantly trying to improve on. Any of you can look in our bugzilla, our
mailing list archive, our wiki or hang out in our IRC channel to see for
yourself how we try to issue any problem with utmost importance. Ok, I lied,
as much as our time permits. There are lots of features in beagle and some of
them rarely get regular testing, mostly because none of us use those
features. I wont be surprised if there are major problem with these. I assure
you that if you bring any problem to our notice, they will be taken care of,
if not completely resolved.

Lastly, I read in one of the forum posts that beagle-0.3 will land in factory
sometime soon. If any of you wants to verify the facts above (the good ones
or the bad ones ;-), give that a spin. And a friendly suggestion, if you only
want to search for files with certain name or extension, you can do much much
better with find/locate.

Your friendly beagle developer,
- dBera

[1] http://lists4.suse.de/opensuse-factory/2008-01/msg00157.html
[2] http://lists.opensuse.org/opensuse/2007-12/msg01796.html
[3] http://lists.opensuse.org/opensuse/2008-01/msg01083.html (could not find
the parent of this thread)

Wednesday, December 12, 2007

Many reasons to like, what's yours ?

Beagle 0.3.0 was released beginning of December. It is nearly 2 years since 0.2.0, more than 10 months since the last feature release and it has been about 2 weeks since then. In the mean time we identified some problems upgrade problems with 0.3.0 and released 0.3.1 and Mono released 1.2.6.

In contrast to 0.1.0 and 0.2.0, beagle-0.3.0 did not have any single major-impact change. But there were lots of small changes, all over the summer months and the months following them. It was getting increasingly difficult to handle all the small changes without going through the "Release early" trick and at some point we paused development, did a test release and then finally released what we have as a major release. I am personally expecting a fair share of bugs and regressions.

What are these small changes anyway ? I will leave out the invisible ones, some of which I have blogged about before, and only explain the ones that will directly impact your desktop usage.

There are 3 new backends: the Thunderbird backend (newly written, much better than the earlier one), the Opera history backend and the Nautilus metadata backend. There is also the TeX filter, one of our most demanded ones and new audio filter based on Taglib-sharp. There are new Firefox and Epiphany extensions which do a lot more than indexing browsing history and bookmarks.

The UI got some love; specially a bunch of useful options were added to beagle-settings like the backend selecion list. For obvious reasons, users should disable the backends they are never going to use.

One of the side effects of the beagle textcache previously was the creation of thousands of small cache files on the disk. People reported that the external fragmentation was wasting a lot of space. The textcache module was redesigned to minimize the fragmentation; I am sure you will appreciate the recovered space. We also compacted the external attributes; besides other benefits that will save some more space.

Two major enhancements were made to the query syntax, which is already quite rich. Date queries are now possible; date queries do not make complete sense without date range query, so that too is possible. And a new "filetype:" keyword was added e.g. to search for images use "filetype:image", to search among documents use "filetype:document" etc.

The major complains against beagle are constantly high CPU load, high memory usage and improper termination (or not exiting at all). The first two are well known and oft discussed. The third problem is not directly brought up, but have been found to be the reason upon close investigation. I gained valuable experience trying to find my way through the web of signals, threads and events in beagle code; a number of key issues were spotted and fixed. Oh, and the first two issues were also dealth with, as much as we could diagnose, but that is nothing new. It will sound funny, but a few of the high CPU and memory problems are direct results of some of our decisions that backfired. Some of them were fixed and the others being worked on.

2 experimental features were also added. One is a web interface to search beagle from Firefox (gecko based browsers really). You can also create standard bookmarks for common search terms. The neat thing about this web interface unlike the earlier webservices based one is that there is no heavy weight server running on beagle's side. This one communicates with beagled using BeagleClient XML based API and builds the entire GUI on the client side; a pure Web 2.0 AJAX/XSLT/CSS webapp (ok, these are some cheap buzzwords).

The other fancy feature is searching other beagle daemons over the network. Using Avahi you can even publish your beagle daemon or discover other beagle daemons in the network. We haven't quite figured out how to handle security, authenication and some other issues. So the feature is disabled by default and marked as experimental but I believe it can be used in some innovative way.

We received request from some distributions about global config files; useful for both distributions and sysadmins. Some useful global configuration settings would be to exclude certain directory from indexing for all users, adding or removing file ignore patterns from the default list, disabling of KDE backends by default in pure Gnome distributions. Some of the options were moved from the code to the config files, so that they can be set globally and overriden by individual users.

These are only some of the major ones.

Lastly, the reason I got excited about mono-1.2.6 is because it has some fixes and improvements that will be directly visible when using beagle.

Monday, February 19, 2007

Faceless Bugs and Advanced Users

These are really two very different topics but they came to my mind while reading about Linus' Gnome patches and bugs.

The first is about creating a new account when I need to report a bug or submit a patch for some software. Most of the projects prefer attaching to bugzilla or they send it to their member only mailing list. I am extremely reluctant to create new accounts, so I have created bugzilla and mailing list accounts for KDE and Gnome. That covers a lot of ground. But still now and then I face a need to send something to somewhere else and bam! Sign up for an account sir! There is definitely merit in this approach, since otherwise bugzilla and mailing lists would be flooded with spam. But it definitely keeps me from submitting patches or commenting on something due my lack of interest in new accounts. Last week, a friend of mine (the inventor of Sperner's Game) was trying to install Kubuntu in his brand new Lenovo T60 when he spotted some typos in the installation windows. He was ready and willing to file a bug in Kubuntu and was told to create a new account for kubuntu bugzilla. As always, he was supposed to get a confirmation email.

The email came 12 hours later and I do not know if the bug was ever filed! Even if the email was prompt, the desire to report a bug has to be high enough to cross these technical potential barriers. *sigh*

This week I made extensive addition to beagle query syntax. There is an open bug in bugzilla asking for a visual way to add these advanced query expressions in beagle-search. I was thinking how best to achieve that; it is not easy to capture the power of beagle query expressions in a gui. I found the answer while reading some posts in desktop-architect mailing list about Linus' patch. There is nothing like an expert user or a novice user. Users always try to act as if they are smart and take the path of the expert user. Presenting different set of options for these different class of users does not work in practice.

In a similar style, there is no need for a GUI for advanced query expressions. Novice users i.e. users who will simply enter search terms will never know what a full boolean query expression does (with those OR and excluded expressions). On the other hand, expert users who know how to deal with the boolean expressions, the different keywords to do property search and other advanced syntax can anyway write it by hand. In fact, it is much easier for them to write it by hand than to do it visually. In this matter, I like the approach taken by
Google. I think I will push towards a simpler advanced search UI for beagle-search and Kerry, some simple choices like choosing type of file, extension, date range etc. Write the query by hand if you need that extra ounce.

Sunday, February 11, 2007

beagle memory usage

Setup: Fresh run of beagled with only the kmail backend. IndexInfo report about 13700 items i.e. mails and indexed attachments. Beagle version is post 0.2.16, so that includes the individual items in the archive attachments as well. I started beagled as exercise_the_dog, indexing finished within an hour and this is the state after indexing is over.


VIRT      RES   SHR     COMMAND
--------+------+------+--------------------
167m    55m  11m     mozilla-firefox
248m    29m  2856    X
137m    20m  15m     amarokapp
72812   19m  6884    beagled-helper
89560   18m  13m     kmail
35816   15m  11m     konsole
42780   15m  14m     konqueror
49088   12m  5860    beagled
40320   11m  9524    kdesktop
42004   11m  9.9m    basket
32620   10m  9588    kmix
43896   9252 6220    kicker
37904   5884 3908    kded
34560   5600 2680    net_applet

Remember the rule: an approximate idea of the memory usage is given by RES-SHR.

Thursday, February 08, 2007

beagle:Eat less, talk less be smart

Yesterday, beagle 0.2.16 was released. A couple of weeks back, we released 0.2.15 but I did not write about it. 0.2.15 came with a lot of performance and memory improvements, new backends, new features, lots of important changes . In the process, it also broke a few things. Those were fixed and 0.2.16 is a purely bugfix release for 0.2.15. I am considering 0.2.16 the best ever beagle release. Incidentally, 0.2.13+ releases somehow or the other had some nasty problems.

Combining 0.2.15 and 0.2.16, these are the major improvements:

* Very important, the looping bug is fixed. I would even like to claim, fixed forever. I happened to find some important clue while scanning the logs and other information provided by some of our very friendly and helpful users. Eventually our 3 year old database schema was found to be incorrect. Joe finally cleared the mess. Thanks Brian and Rick! This also means an end to the "log file filling hard disk" or "beagle indexing even after a week" type problems.

* Beagle uses some external tools to filter files e.g. pdfinfo, pdftotext, mplayer yada yada. These programs are well written and almost always work. Except some very malformed or wrongly detected mimetype file is sent to them and they go berseck taking up insane amout of memory or CPU time. Since the early release, we used to maintain that there is no way we can control the external processes. After all, we just use 'em. Joe finally put an end to that excuse by using some smart rlimit tricks to limit the resources used by these external processes. We still cannot control how mplayer might behave if given an word doc file, but if it behaves badly it will be killed before too long.

* Indexing data is a strenuous job. Think about all those heavy applications which process or generate these files. But people want indexing to be as silent as possible. There are frequently recommendations that beagle should use high nice, low system priority. low IO priority etc means to be as unobtrusive as possible. The fact is, beagle already does that. However, now we even go one more step by using SCHED_BATCH scheduler policy.

There are other side improvements too, RTF filter is new. The current one is based on the legendary RTF parser by Paul Dubois. Image filters are almost new; we now have Konversation (KDE IRC client) and KOrganizer (KDE tasks and eve nts scheduler) backends. By the way, soon after 0.2.16 was released, Opera webhistory backend was added to trunk. You can just drop the binary from here into your 0.2.16 /usr/lib/beagle/Backends folder and start using it, err... trying it. I do not know how complete it is.

I would like to end by thanking the excellent user base that beagle has developed. Without them, it would not be possible to fix a whole lot of these problems. Beagle would not be what it is today without them.

Thursday, January 11, 2007

Seekable LineReader

Recently I need a way in StreamReader to get the position of a line which can be stored and later skip to that line directly. C# StreamReader API does not have any way of doing this, except calling Read() repeatedly and then doing the processing yourself. Which is clumsy. Note that, StreamReader.BaseStream.Position might be wrong due to underlying bufferring. I thought this should be a fairly common requirement and indeed, many people have same question on google groups or other forums. No good answer though. One reason is, such a thing does not really make sense for arbitrary Streams, since it might not be possible to seek in them.

I needed such a thing desparately, so I created an interface:

namespace System.IO {

 // A linereader interface
 public interface LineReader {

   // Returns a position marker, which can be used to navigate the lines.
   // Some implementations may only allow moving in the forward direction.
   // Might be different from line number or file offset.
   // Should only be used for traversal.
   long Position {
     get;
     set;
   }

   // Reads and returns the next line, null if EOF
   string ReadLine ();

   // Reads the next line and returns a stringbuilder containing the line
   // The StringBuilder returned could be the same one used while reading,
   // so it should not be modified and its content might change when readline
   // is next called.
   // This is the most worst horriblest API I ever designed, for sake of speed
   // And thats why this should not be a public API.
   StringBuilder ReadLineAsStringBuilder ();

   // Skips the next line, return true if successful
   bool SkipLine ();

   // Skips required number of lines; returns actual number of lines skipped
   long SkipLines (long n);

   // Close the reader
   void Close ();
 }
}

Beagle source contains the interface and several implementations.

Sunday, December 31, 2006

Fasten your seatbelts; we are ready to ship

There was a discussion going on in beagle mailing list sometime ago where I made a comment that I dont think beagle is newbie ready i.e. plug-and-play yet. Beagle fans did not understand my comment and people replied why they think I am wrong. I would be glad to be proved wrong; but all of their arguments were how they were using beagle since version x.y, how beagle was shipped and enabled by default since last z releases of some distribution and how someone is able to install beagle in a large enterprise. Duh! None of these prove that beagle is newbie ready. All they do is show that beagle works and even I know that. 0/100.

I have the feeling that some of the beagle devs and followers live in the garden of Eden surrounded by a high wall of reality. Sometimes they should go out in the streets, check the bugzilla of other distros, go through user blogs (which mostly contain complaints about how beagle does not work and how to disable it), and visit some user forums where a lot of questions are how to disable beagle from starting at startup. These are laborious jobs and not pleasing. A lot of them contain flames and invalid reasons. But almost always they are started by someone who found beagle causing trouble.

Here are some links which can make your task easy:

I sometimes make the rounds and all I see are I make a point of uninstalling beagle on all my machines and The first thing I did after ... was to uninstall beagle and now my machine is happy. Silly men, how can they not like the doggy!

Saturday, December 30, 2006

Subversion arrives. Finally!

KDE uses SVN for their source code management. GNOME used to use CVS till yesterday, which means beagle too was managed with CVS. I do not know the technical details. but time a again we did face technical problems with CVS. I was mostly told that life would be easier with SVN and lo behold! GNOME has switched to Subversion as its SCM (actually, still switching).

The last time this was tried by the awesome GNOME guys, they later found a glitch and had to cancel the migration. As a result I lost a commit that I made within hours of SVN migration. This time I will play safe and watch it for a few days before committing anything. If everything works out, life should be easier. Joe already cleaned up quite a bit of the unused files and directories, renamed the Evo-mail backend correctly and updated the links et al. A New Year with a clean, new repo. Sweet.

PS: There is one downside though. Joe (and others too) would like to use the SVN commit messages for creating Changelog files during creating a tarball. Which basically means others cannot observe the Changelog file between releases to figure out what was changed (neither I nor Joe updated the Changelog while committing, so this is a lame excuse). The real trouble is now I cannot write any lame jokes in my commit messages. Life would be serious now. Boo hoo.

Sunday, December 17, 2006

My time with the doggie

Today I read about the Ohloh project (http://ohloh.net) and added my favourite project beagle (http://beagle-project.org) to it. I was curious what it actually does.

It took them nearly 4 hours to download and analyze the source code. But it was worth the wait. It showed some interesting statistics, like 122,885 LOC codebase, 82 direct contributors (committing in CVS) and 13 of them in last 12 months.

Just for a light comparison, Firefox has a codebase of 157,207 LOC, Amarok has 169,288 LOC and (take this) PHP 6.0 has 599,805 LOC.

It was also amusing to see my share in the project: http://ohloh.net/projects/3826/contributors/21154

Thursday, December 14, 2006

beagle 0.2.14

Joe announced the release of beagle 0.1.4 few hours ago. Its a fascinating and shining new release containing exciting new features, lots of memory/speed optimizations and many bug-fixes as well. Here are the major ones which are readily visible:

Indexes tar, gzipped-tar, bzipped-tar, gzipped and bzipped files, in the filesystem as well as in email attachments. The results show you the exact file in the archive that matched the query.
Do some smart tokenizing to allow matching 001234 to a query of 1234 and better matching of file names. No more missing files.
Beagle can find and extract data itself using its dozen or more backends. But sometimes its better for other applications to send data to beagle for indexing. Beagle had the infrastructure to act as a search/indexing service provider. The release contains an example C code to show how to do that; its pretty simple actually. Obviously python can also be used.
Some cool signal mechanism which help to figure out what file in being currently indexed and for how long. This will be helpful if you feel beagle is taking ages to index some file.
Use Xdg autostart mechanism to auto-start beagle. KDE4 will also implement xdg autostart mechanism. One more step towards being DE agonistic.
The indexing information now explicitly mentions if the initial indexing is in progress. Also clients now have the option of being notified when the initial indexing ends.
Lots of memory fixes. bhale just mentioned in the irc channel holy crap, startup RSS for beagled is 15m... beagled is below nautilus in mem usage... im not believing my eyes :) Thank you for your myth on how beagle is a bloatware.
API and beage-search support to know the total number of documents that matched any query. Not the superficial imposed limit of 100 documents.

Go, get it!

d-Tech-t