Friday, March 28, 2008

Better late than never

Beagle 0.3.4 was released into the wild last week. We tried to fix the build problems (new Gnome-sharp, Mono 1.9, missing files from last release) but building Beagle with Ndesk-DBus 0.5.2 remained broken, as per tradition.

Other than that this version builds nicely with Mono 1.9 and contains #ifdef-ed code to use the Mono.Unix.UnixSignal API when built with Mono 1.9. That should ensure that beagled and index-helper will quit when asked to quit. Yes, this is the 21st century.

There is a lot of mapping in Beagle and a lot of them are hardcoded. We are gradually moving them to user configurable files. The config files were moved out earlier; in 0.3.4 we have moved out the query mapping (mapping "ext:html" to right internal property name). If you want to add a mapping for some property you repeatedly query, just add it to the local ~/.beagle/query-mapping.xml

I added another handy tool to use Beagle as updatedb/locate. Use beagle-build-index or beagled to create indexes (like updatedb). And then use beagle-static-query to query the indexes (like locate) - no long running daemon beagled has to be running.

E.g. I created an index for system files in /usr/bin, /bin and other global directories (the --disable-filtering is to disable filtering file contents since here is only care about the file names and such),

$ beagle-build-index --recursive --disable-filtering --target ~/.systembeagle /usr/bin/ /usr/local/bin/ /bin/ /etc /usr/local/etc/

Then I could query just like locate,

$ beagle-static-query --add-static-backend ~/.systembeagle 'net*' --backend none

(--backend none is to tell beagle to not search any other backends). I could have added ~/.systembeagle to beagle using beagle-config so that I dont have to add this path everytime or I could have even created an alias for this.

Why do this when locate/updatedb already does it ? Because I can :). Ok, I actually use this to search monodocs. I am not a big fan of this mouse, point, click business and I stick to the terminal with mod and monop2 at my disposal. These tools are great once you know the fully qualified name of the method or the class. Use this jack-of-all-trade beagle.

Step-1: Enable the system wide monodoc index. Its one of the crawl-rules shipped with beagle but disabled by default.

Step-2: Let cron build it or you run the cron job yourself. Building the monodoc index takes time though. Definitely longer than any special indexer for monodoc files. But thats only a one time cost.

Step-3: Use beagle-static-query. You can also use phrases and wildcards '*' and search only in methods or classes or properties (just look in the returned fields and use beagle property query).

Saturday, March 08, 2008

Beyond Search: arrhh...dee...efff

If the news reports and blogs are to be believed, this is the age of Semantic Something. First people wanted to search web, then file contents, and then search emails and other user data. Everybody was talking about desktop search; along came Beagle, Spotlight, Google Desktop Search, Kat, MetaTracker, Pinot, Strigi etc. While desktop search at its core is nothing but a crawler which reads different file formats and stores them in a searchable database, searching is the most trivial and IMO, boring application built on Beagle's infrastructure.

These days the focus seems to have shifted to Semantic Desktop and Semantic Web. Most blog comments and mailing list posts about Semantic-Fu have a hint of it being vapourware. Its not totally their fault either; the ideas are around for a long time and people are working on it for many many years. But there is no glittering gold in sight. Only recently some interesting Semantic Web ideas have started taking shape. Semantic Desktop is a slightly different game but it should not be far behind. After taking about 40 developer years, Beagle is just about ready to take desktop search beyond simple file content search. Historians might want to take note of the dashboard project and how beagle came into being as a necessary requirement for that truly beyond-desktop-search application.

The core idea behind Semantic Desktop, upto my understanding, revolves around the buzzword jack-of-most-trades RDF. And for the impatient kind, here is a rude shock - RDF is not useful for human beings. Even further, it is not even meant for you, me and us; storing every conceivable data in the RDF format is not going to make our life any easier right away.

RDF or Resource Description Framework is a generic way to describe anything, to be accurate any description of anything. It is a fairly elaborate yet structured format; very easy for programs to analyze that information but extremely redundant to human eyes. Notwithstanding what the AI experts are claiming about the future of AI, human mind can work without immediate deductive reasoning and in fact does that a lot of time. It recognizes familiar words without reading the alphabets one at a time, it deduces the color by merely glancing at it, it conjures up strange connections; its a wonder that will be hard to completely characterize by any set of rules. At least at the current stage, algorithms have to be told the facts and the relations between them for them to do any kind of processing with its data. These are the things that we just know when we see something and is thus the reason why storing the description of something in an RDF format is not going to gain me anything immediately. On the other hand, this is also the reason why applications should be fed data in an RDF format to allow it unhindered access to the semantics of the data.

If that felt hand wavy, try to think about the difference between the semantics of a data and its syntax. An array could be used to represent a linked list, a queue, a stack, a tree or an heap - the latter are the different semantics of the representations, the array is one of the many syntactic representations of one of the latter concepts. A bunch of pairs could be stored in a database table; the table is a syntactic representation of the data which has the semantics of a bunch of name, phone-number pairs. It is hard to work with the semantics of an idea, in a sense it is something up in the air; on the other hand storing some data in a suitable working form could fail to capture some concept about the data. Also, once stored in a particular form it is easy to miss the bigger picture; thus limiting the scope of what we could do with that data.

Saying all that, for the time being think of the RDF format as a bunch of objects and facts where each object is related to some number of facts. The semantics of related could differ based on the context, and RDF is powerful enough to describe even that semantics and a whole bunch of other facts about the facts. With beagle pulling data from nooks and corners of a user's desktop and providing a service which allows applications to search this data, it is a shame if we cannot exploit the relationships in this data for a better mankind... err... dolphins... err... us.

Consider all the emails I have. Now I know that there some emails that are part of discussion threads. Beagle does not. With the beauty of N3 (a close cousin of Semantic-Fu and RDF), I can write this extra information as a set of rule (the single '.' represents end of one rule). I am using emails msgid to track emails in a thread.
I could not help but notice the similarity of these rules with prolog or other logic programming languages.

/* an email with subject 'foobar' is in its own thread */
{ ?email :title 'foobar' . ?email :msgid ?msg . } => { ?msg :inthread ?msg } .
/* if any email refers to some email in thread, then this email is also in other email's thread */
{ ?ref :inthread ?parent . ?email1 :reference ?ref . ?email1 :msgid ?msg .} => {?msg :inthread ?parent} .

Using the RDFAdapter of the beagle-rdf branch, I can use this to get all the emails in the thread with foobar in its subject. Note that I am able to write my set of rules only when I see this data as actual emails and not a bunch of lucene documents with fields. The latter carry no meaning. Further note that, I can also use the BeagleClient API to perform field specific queries to obtain the same results. The difference is that the process of using BeagleClient will require me to think about the relationships from scratch and then figure out the right sequence of queries. Instead I could store all the relationship among the emails in the email-index in an RDF format (and also related information not stored in the index e.g. saying a list of email addresses are all mine and should be treated as for one person). Then, whenever I want to extract some information, I can write the question (again in an RDF format) and let the RDF-Magic figure out the how to execute this question against that data given this set of inference rules. Isn't it cool ?

If I missed it earlier, this kind of data-mining operations are not for my everyday use (here my refers to usual computer users) and is not for everybody. Still it is can sometimes come in handy. Imagine the possibilities if you can write the relationships between a file in an mp3 playlist (playlist filter), its download link and how your arrived at that page (webhistory indexing), the email you sent with that file as an attachment in a zip file (email attachment and archive filter), its ratings and usage statistics in Amarok (amarok querydriver) and of course the actual file on the harddisk (user home directory indexing).

Warning: The RDF Adapter in beagle uses the sophisticated SemWeb library which allows anyone to perform graph operations (selecting subgraphs, walking on graphs, pruning nodes and edges etc.) on the RDF graph of the data. Unlike most RDF stores for desktop data, beagle is not optimized for RDF operations and could take quite a bit of time and heat up the CPU. It took me about 4 seconds to find all threads with the word beagle among 500 emails (my actual email index has about 20K emails! I refuse to imagine what will happen if I run it on the full index). If you are interested, checkout the rdf branch and take a look at the test SemWebClient.cs.