d-Tech-t: indexing

Showing posts with label indexing. Show all posts

Sunday, April 20, 2008

Takes Two to Release

The GMail backend I blogged about before is now available for mass abuse in Beagle 0.3.6(.1). We also tried to maintain our love of cutting edge technology by upgrading the Firefox extension to work with Firefox 3.0.

I noticed several forum posts where users wanted to use beagle like locate/find-grep. The desire was two pronged - no intention to run a daemon continuously and return files from everywhere doing basic searches in name and path. That is not how beagle is supposed to be used but users are the boss in a community project. So I added blocate, a wrapper to beagle-static-query. Currently it only matches the -d DBpath parameter of locate but works like a charm. Sample uses
$ blocate sondesh
$ blocate -d manpages locate

The other thing I added was a locate backend. I absolutely do not recommend using this one. Yet if you insist ... when enabled and used with the FileSystem backend, it will return search results from the locate program. Yes, results from eVeRyWhErE, as you wished.

You can use both the GMail and the locate backends in beagle-search as well. But both the new backends are rather primitive, so I have taken enough precautions againsts n00bs accidentally using them. So in summary, 0.3.6 is not going to do you any good. Oops... did I just say that ?!

The title is based on the empiricial count of the number of actual releases (including brown bag ones) needed for last few releases.

Saturday, March 08, 2008

Beyond Search: arrhh...dee...efff

If the news reports and blogs are to be believed, this is the age of Semantic Something. First people wanted to search web, then file contents, and then search emails and other user data. Everybody was talking about desktop search; along came Beagle, Spotlight, Google Desktop Search, Kat, MetaTracker, Pinot, Strigi etc. While desktop search at its core is nothing but a crawler which reads different file formats and stores them in a searchable database, searching is the most trivial and IMO, boring application built on Beagle's infrastructure.

These days the focus seems to have shifted to Semantic Desktop and Semantic Web. Most blog comments and mailing list posts about Semantic-Fu have a hint of it being vapourware. Its not totally their fault either; the ideas are around for a long time and people are working on it for many many years. But there is no glittering gold in sight. Only recently some interesting Semantic Web ideas have started taking shape. Semantic Desktop is a slightly different game but it should not be far behind. After taking about 40 developer years, Beagle is just about ready to take desktop search beyond simple file content search. Historians might want to take note of the dashboard project and how beagle came into being as a necessary requirement for that truly beyond-desktop-search application.

The core idea behind Semantic Desktop, upto my understanding, revolves around the buzzword jack-of-most-trades RDF. And for the impatient kind, here is a rude shock - RDF is not useful for human beings. Even further, it is not even meant for you, me and us; storing every conceivable data in the RDF format is not going to make our life any easier right away.

RDF or Resource Description Framework is a generic way to describe anything, to be accurate any description of anything. It is a fairly elaborate yet structured format; very easy for programs to analyze that information but extremely redundant to human eyes. Notwithstanding what the AI experts are claiming about the future of AI, human mind can work without immediate deductive reasoning and in fact does that a lot of time. It recognizes familiar words without reading the alphabets one at a time, it deduces the color by merely glancing at it, it conjures up strange connections; its a wonder that will be hard to completely characterize by any set of rules. At least at the current stage, algorithms have to be told the facts and the relations between them for them to do any kind of processing with its data. These are the things that we just know when we see something and is thus the reason why storing the description of something in an RDF format is not going to gain me anything immediately. On the other hand, this is also the reason why applications should be fed data in an RDF format to allow it unhindered access to the semantics of the data.

If that felt hand wavy, try to think about the difference between the semantics of a data and its syntax. An array could be used to represent a linked list, a queue, a stack, a tree or an heap - the latter are the different semantics of the representations, the array is one of the many syntactic representations of one of the latter concepts. A bunch of pairs could be stored in a database table; the table is a syntactic representation of the data which has the semantics of a bunch of name, phone-number pairs. It is hard to work with the semantics of an idea, in a sense it is something up in the air; on the other hand storing some data in a suitable working form could fail to capture some concept about the data. Also, once stored in a particular form it is easy to miss the bigger picture; thus limiting the scope of what we could do with that data.

Saying all that, for the time being think of the RDF format as a bunch of objects and facts where each object is related to some number of facts. The semantics of related could differ based on the context, and RDF is powerful enough to describe even that semantics and a whole bunch of other facts about the facts. With beagle pulling data from nooks and corners of a user's desktop and providing a service which allows applications to search this data, it is a shame if we cannot exploit the relationships in this data for a better mankind... err... dolphins... err... us.

Consider all the emails I have. Now I know that there some emails that are part of discussion threads. Beagle does not. With the beauty of N3 (a close cousin of Semantic-Fu and RDF), I can write this extra information as a set of rule (the single '.' represents end of one rule). I am using emails msgid to track emails in a thread. I could not help but notice the similarity of these rules with prolog or other logic programming languages.
/* an email with subject 'foobar' is in its own thread */ { ?email :title 'foobar' . ?email :msgid ?msg . } => { ?msg :inthread ?msg } . /* if any email refers to some email in thread, then this email is also in other email's thread */ { ?ref :inthread ?parent . ?email1 :reference ?ref . ?email1 :msgid ?msg .} => {?msg :inthread ?parent} .
Using the RDFAdapter of the beagle-rdf branch, I can use this to get all the emails in the thread with foobar in its subject. Note that I am able to write my set of rules only when I see this data as actual emails and not a bunch of lucene documents with fields. The latter carry no meaning. Further note that, I can also use the BeagleClient API to perform field specific queries to obtain the same results. The difference is that the process of using BeagleClient will require me to think about the relationships from scratch and then figure out the right sequence of queries. Instead I could store all the relationship among the emails in the email-index in an RDF format (and also related information not stored in the index e.g. saying a list of email addresses are all mine and should be treated as for one person). Then, whenever I want to extract some information, I can write the question (again in an RDF format) and let the RDF-Magic figure out the how to execute this question against that data given this set of inference rules. Isn't it cool ?

If I missed it earlier, this kind of data-mining operations are not for my everyday use (here my refers to usual computer users) and is not for everybody. Still it is can sometimes come in handy. Imagine the possibilities if you can write the relationships between a file in an mp3 playlist (playlist filter), its download link and how your arrived at that page (webhistory indexing), the email you sent with that file as an attachment in a zip file (email attachment and archive filter), its ratings and usage statistics in Amarok (amarok querydriver) and of course the actual file on the harddisk (user home directory indexing).

Warning: The RDF Adapter in beagle uses the sophisticated SemWeb library which allows anyone to perform graph operations (selecting subgraphs, walking on graphs, pruning nodes and edges etc.) on the RDF graph of the data. Unlike most RDF stores for desktop data, beagle is not optimized for RDF operations and could take quite a bit of time and heat up the CPU. It took me about 4 seconds to find all threads with the word beagle among 500 emails (my actual email index has about 20K emails! I refuse to imagine what will happen if I run it on the full index). If you are interested, checkout the rdf branch and take a look at the test SemWebClient.cs.

Sunday, February 11, 2007

beagle memory usage

Setup: Fresh run of beagled with only the kmail backend. IndexInfo report about 13700 items i.e. mails and indexed attachments. Beagle version is post 0.2.16, so that includes the individual items in the archive attachments as well. I started beagled as exercise_the_dog, indexing finished within an hour and this is the state after indexing is over.


VIRT      RES   SHR     COMMAND
--------+------+------+--------------------
167m    55m  11m     mozilla-firefox
248m    29m  2856    X
137m    20m  15m     amarokapp
72812   19m  6884    beagled-helper
89560   18m  13m     kmail
35816   15m  11m     konsole
42780   15m  14m     konqueror
49088   12m  5860    beagled
40320   11m  9524    kdesktop
42004   11m  9.9m    basket
32620   10m  9588    kmix
43896   9252 6220    kicker
37904   5884 3908    kded
34560   5600 2680    net_applet

Remember the rule: an approximate idea of the memory usage is given by RES-SHR.

Thursday, February 08, 2007

beagle:Eat less, talk less be smart

Yesterday, beagle 0.2.16 was released. A couple of weeks back, we released 0.2.15 but I did not write about it. 0.2.15 came with a lot of performance and memory improvements, new backends, new features, lots of important changes . In the process, it also broke a few things. Those were fixed and 0.2.16 is a purely bugfix release for 0.2.15. I am considering 0.2.16 the best ever beagle release. Incidentally, 0.2.13+ releases somehow or the other had some nasty problems.

Combining 0.2.15 and 0.2.16, these are the major improvements:

* Very important, the looping bug is fixed. I would even like to claim, fixed forever. I happened to find some important clue while scanning the logs and other information provided by some of our very friendly and helpful users. Eventually our 3 year old database schema was found to be incorrect. Joe finally cleared the mess. Thanks Brian and Rick! This also means an end to the "log file filling hard disk" or "beagle indexing even after a week" type problems.

* Beagle uses some external tools to filter files e.g. pdfinfo, pdftotext, mplayer yada yada. These programs are well written and almost always work. Except some very malformed or wrongly detected mimetype file is sent to them and they go berseck taking up insane amout of memory or CPU time. Since the early release, we used to maintain that there is no way we can control the external processes. After all, we just use 'em. Joe finally put an end to that excuse by using some smart rlimit tricks to limit the resources used by these external processes. We still cannot control how mplayer might behave if given an word doc file, but if it behaves badly it will be killed before too long.

* Indexing data is a strenuous job. Think about all those heavy applications which process or generate these files. But people want indexing to be as silent as possible. There are frequently recommendations that beagle should use high nice, low system priority. low IO priority etc means to be as unobtrusive as possible. The fact is, beagle already does that. However, now we even go one more step by using SCHED_BATCH scheduler policy.

There are other side improvements too, RTF filter is new. The current one is based on the legendary RTF parser by Paul Dubois. Image filters are almost new; we now have Konversation (KDE IRC client) and KOrganizer (KDE tasks and eve nts scheduler) backends. By the way, soon after 0.2.16 was released, Opera webhistory backend was added to trunk. You can just drop the binary from here into your 0.2.16 /usr/lib/beagle/Backends folder and start using it, err... trying it. I do not know how complete it is.

I would like to end by thanking the excellent user base that beagle has developed. Without them, it would not be possible to fix a whole lot of these problems. Beagle would not be what it is today without them.

d-Tech-t