Sunday, April 20, 2008
Takes Two to Release
I noticed several forum posts where users wanted to use beagle like locate/find-grep. The desire was two pronged - no intention to run a daemon continuously and return files from everywhere doing basic searches in name and path. That is not how beagle is supposed to be used but users are the boss in a community project. So I added blocate, a wrapper to beagle-static-query. Currently it only matches the -d DBpath parameter of locate but works like a charm. Sample uses
$ blocate sondesh
$ blocate -d manpages locate
The other thing I added was a locate backend. I absolutely do not recommend using this one. Yet if you insist ... when enabled and used with the FileSystem backend, it will return search results from the locate program. Yes, results from eVeRyWhErE, as you wished.
You can use both the GMail and the locate backends in beagle-search as well. But both the new backends are rather primitive, so I have taken enough precautions againsts n00bs accidentally using them. So in summary, 0.3.6 is not going to do you any good. Oops... did I just say that ?!
The title is based on the empiricial count of the number of actual releases (including brown bag ones) needed for last few releases.
Saturday, March 08, 2008
Beyond Search: arrhh...dee...efff
These days the focus seems to have shifted to Semantic Desktop and Semantic Web. Most blog comments and mailing list posts about Semantic-Fu have a hint of it being vapourware. Its not totally their fault either; the ideas are around for a long time and people are working on it for many many years. But there is no glittering gold in sight. Only recently some interesting Semantic Web ideas have started taking shape. Semantic Desktop is a slightly different game but it should not be far behind. After taking about 40 developer years, Beagle is just about ready to take desktop search beyond simple file content search. Historians might want to take note of the dashboard project and how beagle came into being as a necessary requirement for that truly beyond-desktop-search application.
The core idea behind Semantic Desktop, upto my understanding, revolves around the buzzword jack-of-most-trades RDF. And for the impatient kind, here is a rude shock - RDF is not useful for human beings. Even further, it is not even meant for you, me and us; storing every conceivable data in the RDF format is not going to make our life any easier right away.
RDF or Resource Description Framework is a generic way to describe anything, to be accurate any description of anything. It is a fairly elaborate yet structured format; very easy for programs to analyze that information but extremely redundant to human eyes. Notwithstanding what the AI experts are claiming about the future of AI, human mind can work without immediate deductive reasoning and in fact does that a lot of time. It recognizes familiar words without reading the alphabets one at a time, it deduces the color by merely glancing at it, it conjures up strange connections; its a wonder that will be hard to completely characterize by any set of rules. At least at the current stage, algorithms have to be told the facts and the relations between them for them to do any kind of processing with its data. These are the things that we just know when we see something and is thus the reason why storing the description of something in an RDF format is not going to gain me anything immediately. On the other hand, this is also the reason why applications should be fed data in an RDF format to allow it unhindered access to the semantics of the data.
If that felt hand wavy, try to think about the difference between the semantics of a data and its syntax. An array could be used to represent a linked list, a queue, a stack, a tree or an heap - the latter are the different semantics of the representations, the array is one of the many syntactic representations of one of the latter concepts. A bunch of
Saying all that, for the time being think of the RDF format as a bunch of objects and facts where each object is related to some number of facts. The semantics of related could differ based on the context, and RDF is powerful enough to describe even that semantics and a whole bunch of other facts about the facts. With beagle pulling data from nooks and corners of a user's desktop and providing a service which allows applications to search this data, it is a shame if we cannot exploit the relationships in this data for a better mankind... err... dolphins... err... us.
Consider all the emails I have. Now I know that there some emails that are part of discussion threads. Beagle does not. With the beauty of N3 (a close cousin of Semantic-Fu and RDF), I can write this extra information as a set of rule (the single '.' represents end of one rule). I am using emails msgid to track emails in a thread.
/* an email with subject 'foobar' is in its own thread */
{ ?email :title 'foobar' . ?email :msgid ?msg . } => { ?msg :inthread ?msg } .
/* if any email refers to some email in thread, then this email is also in other email's thread */
{ ?ref :inthread ?parent . ?email1 :reference ?ref . ?email1 :msgid ?msg .} => {?msg :inthread ?parent} .
Using the RDFAdapter of the beagle-rdf branch, I can use this to get all the emails in the thread with foobar in its subject. Note that I am able to write my set of rules only when I see this data as actual emails and not a bunch of lucene documents with fields. The latter carry no meaning. Further note that, I can also use the BeagleClient API to perform field specific queries to obtain the same results. The difference is that the process of using BeagleClient will require me to think about the relationships from scratch and then figure out the right sequence of queries. Instead I could store all the relationship among the emails in the email-index in an RDF format (and also related information not stored in the index e.g. saying a list of email addresses are all mine and should be treated as for one person). Then, whenever I want to extract some information, I can write the question (again in an RDF format) and let the RDF-Magic figure out the how to execute this question against that data given this set of inference rules. Isn't it cool ?
If I missed it earlier, this kind of data-mining operations are not for my everyday use (here my refers to usual computer users) and is not for everybody. Still it is can sometimes come in handy. Imagine the possibilities if you can write the relationships between a file in an mp3 playlist (playlist filter), its download link and how your arrived at that page (webhistory indexing), the email you sent with that file as an attachment in a zip file (email attachment and archive filter), its ratings and usage statistics in Amarok (amarok querydriver) and of course the actual file on the harddisk (user home directory indexing).
Warning: The RDF Adapter in beagle uses the sophisticated SemWeb library which allows anyone to perform graph operations (selecting subgraphs, walking on graphs, pruning nodes and edges etc.) on the RDF graph of the data. Unlike most RDF stores for desktop data, beagle is not optimized for RDF operations and could take quite a bit of time and heat up the CPU. It took me about 4 seconds to find all threads with the word beagle among 500 emails (my actual email index has about 20K emails! I refuse to imagine what will happen if I run it on the full index). If you are interested, checkout the rdf branch and take a look at the test SemWebClient.cs.
Sunday, February 11, 2007
beagle memory usage
VIRT RES SHR COMMAND
--------+------+------+--------------------
167m 55m 11m mozilla-firefox
248m 29m 2856 X
137m 20m 15m amarokapp
72812 19m 6884 beagled-helper
89560 18m 13m kmail
35816 15m 11m konsole
42780 15m 14m konqueror
49088 12m 5860 beagled
40320 11m 9524 kdesktop
42004 11m 9.9m basket
32620 10m 9588 kmix
43896 9252 6220 kicker
37904 5884 3908 kded
34560 5600 2680 net_applet
Remember the rule: an approximate idea of the memory usage is given by RES-SHR.
Thursday, February 08, 2007
beagle:Eat less, talk less be smart
Yesterday, beagle 0.2.16 was released. A couple of weeks back, we released 0.2.15 but I did not write about it. 0.2.15 came with a lot of performance and memory improvements, new backends, new features, lots of important changes . In the process, it also broke a few things. Those were fixed and 0.2.16 is a purely bugfix release for 0.2.15. I am considering 0.2.16 the best ever beagle release. Incidentally, 0.2.13+ releases somehow or the other had some nasty problems.
Combining 0.2.15 and 0.2.16, these are the major improvements:
* Very important, the looping bug is fixed. I would even like to claim, fixed forever. I happened to find some important clue while scanning the logs and other information provided by some of our very friendly and helpful users. Eventually our 3 year old database schema was found to be incorrect. Joe finally cleared the mess. Thanks Brian and Rick! This also means an end to the "log file filling hard disk" or "beagle indexing even after a week" type problems.
* Beagle uses some external tools to filter files e.g. pdfinfo, pdftotext, mplayer yada yada. These programs are well written and almost always work. Except some very malformed or wrongly detected mimetype file is sent to them and they go berseck taking up insane amout of memory or CPU time. Since the early release, we used to maintain that there is no way we can control the external processes. After all, we just use 'em. Joe finally put an end to that excuse by using some smart rlimit tricks to limit the resources used by these external processes. We still cannot control how mplayer might behave if given an word doc file, but if it behaves badly it will be killed before too long.
* Indexing data is a strenuous job. Think about all those heavy applications which process or generate these files. But people want indexing to be as silent as possible. There are frequently recommendations that beagle should use high nice, low system priority. low IO priority etc means to be as unobtrusive as possible. The fact is, beagle already does that. However, now we even go one more step by using SCHED_BATCH scheduler policy.
There are other side improvements too, RTF filter is new. The current one is based on the legendary RTF parser by Paul Dubois. Image filters are almost new; we now have Konversation (KDE IRC client) and KOrganizer (KDE tasks and eve nts scheduler) backends. By the way, soon after 0.2.16 was released, Opera webhistory backend was added to trunk. You can just drop the binary from here into your 0.2.16 /usr/lib/beagle/Backends folder and start using it, err... trying it. I do not know how complete it is.
I would like to end by thanking the excellent user base that beagle has developed. Without them, it would not be possible to fix a whole lot of these problems. Beagle would not be what it is today without them.