d-Tech-t

Wednesday, December 01, 2010

Internet to TV

This is the age of parallel processing.

Multiple firefox profile running simultaneously (--no-remote / MOZ_NO_REMOTE=1).
Multiple tabs opened in yakuake.
Multiple CPUs (one for dealing with flash/firefox, other for life).
Now added, multiple sound outputs from computer.

I have one pair of speakers which I use in the dining area. Helps me get past barely edibles.

My work area has now moved close to the TV. This TV has HDMI (and is DLNA-compatible), so it made perfect sense to me when the geeky fairy whispered something about using the TV speakers for listening to Magic 106.7FM, Jango and Youtubish songs off of my not-so-great-sounding laptop. minidlna doesn't quite stream non-file contents, so I was left with pleading ALSA to let specially chosen audio to be routed for TV@dBera via HDMI.

$ aplay -l

**** List of PLAYBACK Hardware Devices ****

card 0: Intel [HDA Intel], device 0: AD198x Analog [AD198x Analog]

  Subdevices: 0/1

  Subdevice #0: subdevice #0

card 0: Intel [HDA Intel], device 3: INTEL HDMI [INTEL HDMI]

  Subdevices: 0/1

  Subdevice #0: subdevice #0

$ cat > ~/.asoundrc.tv

pcm.!default {

    type plug

    slave {

        pcm {

            type hw

            card 0

            device 3

        }

    }

}

$ ALSA_CONFIG_PATH=~/.asoundrc.tv firefox -P music -no-remote

Hitting the right button on the TV remote sets its Energy Saving to Picture Off. Voila! I just got myself an awesome pair of speakers.

Edit: If you get no sound, make sure the IEC958 interface is unmuted (you can check/set/unset using alsamixer). IEC958 is the interface for digital output for a lot of soundcards.

Friday, September 03, 2010

forever very-vapor

WHAT ! Duke Nukem Forever is not a vapour ware ? But I thought that is part of the definition of vap...

I am feeling weak at my knees, being compelled to write about a hearsay read on blogs and twitter feeds. It's just that hard to keep cool.

Let's Nuke'em Duke!

Sunday, June 07, 2009

this is his 4th computer

A thin man with a pointed beard loved honeysuckle. He had some honeysuckle trees in an orchard across a huge river. He built a boat and used it to cross the river. The fruits were good. He ate some and sold the rest.
...
10 years later, the boat showed signs of age. He built a new boat. Life was good again.

1 millennium, 6 century and 5 decades later, a very smart man told us how to create a very smart machine. One machine which can do everything. Of course, the machine needed sophisticated instructions but there were a lot of people who can figure out the right instructions for any operation.

6 decades later. A company that sells computers made a wonderful laptop. A fat man who earns a lot of money bought one. The laptop performed top-notch, looked solid and meant business. He was happy.
...
2 years later, the computer slowed down. He bought a new one. The one felt as fast as the old one when he had bought it. He considered the purchase a terribly insightful investment.

200 years later. A group of bespectacled visionaries will figure out every detail of how our body works. A lot of companies will appear who will promise a live-long and live-happy life. "Don't like your middle ear? Here, we will replace it with a machine. 1-day money back guarantee". You can hear a lot in 1 day, so everyone will be satisfied. A young rich business couple will modify themselves to become what they want. They will get married. They will make it to the news as the first perfect couple.
...
7 days later, the man will get bored of his wife. He will find her pose unexciting and her actions banal. He will find a new woman for a huge sum of money.
...
After uranium, monitors and PCBs, the world will face the challenge of a new type of landfill: people.

Wednesday, February 04, 2009

Better than my best

Wow! "Toshiba handheld hits 1GHz"
http://news.cnet.com/8301-13924_3-10155730-64.html

The news isn't that tomorrow's phones will beat my current best computer. Technology, like leaking water, always finds a way to go down ... err... advance. I like it or not.

What surprises me is the abusive intent of folks behind the technology. Like it or not, designing is more art than science. The kind of rationality and precision that goes into using the technology to create something more than a crapware is nothing short of the artistic choices a painter makes when rendering the perfect sunset over an island where he never went.

The minds that control the coding hands do not seem to realise this ... no matter what better hardware I come across, I also come across a newer version of the same software which makes it crawl. Even the upcoming dual-core 1.5GHz is not going to impress me much, not any more. Perfection is the key.

Wednesday, July 23, 2008

< Insert Your Favourite > Desktop Search Hackfest

Good news - a (read: the first ever) Desktop Search Hackfest is being
planned after the Maemo Summit, in Berlin.
http://wiki.maemo.org/Desktop_Search_Hackfest

Not so good news - I will not be able to attend. It's not easy to
sneak out as a grad student from an US university (bonus if you are an
_alien_). Joe is not going either. Our in-house Xesam guru is about a
join a real job, the money-paying kind. It might be hard for him to
attend either. I dearly wish Beagle could somehow benefit from this
meeting. Good luck to the other projects. Get some work done and make
the users happy. It is nice to see that Strigi/Nepomuk devs are
attending. And thanks to Nokia for making this happen.

On a side note, look at the number of participants from the Tracker
project ! Wow, they're booming. Double kudos to them.

Wednesday, July 16, 2008

Creating new backend - help wanted

Three years ago this month, the above was the subject of my first post to the
dashboard-hackers mailing list. I was writing a backend for Akregator. Its
immaterial what Akregator is; I was mostly pattern matching back then and
that is how I got familiar with C#.

I remember submitting a patch to beagle bugzilla a few days before that. It
was to filter JPEG JFIF comments because that is where digikam stored all the
descriptions. Now it uses IPTC tags. Does not really matter since beagle
indexes both.

Backtracking further, I got interested in beagle several months before that. I
had this incredible urge to add /author/, /title/, /subject/ tags to the PDF
research papers that I download and use beagle to search among them. The
first time I visited the beagle website and downloaded the tarball, the huge
list of dependencies really really scared me off. I used to allow more Gnome
on my desktop back then than I do now but nevertheless building beagle seemed
like a daunting task. I later freed beagle from many of its dependencies; I
feel that is my biggest contribution to this project.

I made a second attempt later; soon realized that even though I can get beagle
going, there was (is ?) no easy way to modify PDF document properties (in
linux). Then I switched my goal to index my pictures, which I have started
tagging and adding comments to, using my new found love digikam. Thankfully
the one-liner JFIF patch did not require much C# knowledge. It would not have
mattered anyway, C# is darn easy to pattern match.

After all these months, I still haven't managed to build my repository of well
tagged PDF papers (if I wait for several more months and stop wasting time
then the need would be gone forever). I do keep a static index for my
pictures but rarely search them. What a waste of time :-). Silly me!

As I am about to sign off this blog entry, I noticed that I was siging off
as "Bera" in my initial few emails. I wonder when did I switch to
my "signature sign" dBera.

Did I mention that beagle 0.3.8 was released some 48 hours ago. Go go get it.

Thursday, May 08, 2008

One way to get things fixed in Ubuntu

Become famous and then blog about it. Its easy (the blogging part). And it
works [1a, 1b].

Now if only someone famous blogged about some other literal showstoppers in
beagle and KNetworkManager, some more Kubuntu users would be happy [2].
Beagle is second class citizen in Ubuntopia, but I thought KNetworkManager is
important.

One thing I noted though, since Beagle was moved to Universe from Main, it is
getting better treatment. Periodic sync from debian is actually happening
unlike when it was in main; the core Ubuntu developers rarely found time to
update the Beagle package. Kudos to Masters of the Universe!

[1a]

https://bugs.launchpad.net/ubuntu/+source/galago-sharp/+bug/186049/comments/9
[1b] http://arstechnica.com/reviews/os/hardy-heron-review.ars/3
[2] https://bugs.launchpad.net/ubuntu/+source/beagle/+bug/207157

Monday, April 28, 2008

I knew searching would fail

There is a great report about a usability test on the web about a guy
giving some common computer tasks to his girlfriend on a fresh Hardy
Heron installation. I found it via Slashdot but since fewer and fewer
people are slashdotting these days, so here is a link [1].

The tasks are well chosen and the user is _not_ a first time
grandma-type-user and approaches each task in the obvious possible
way. "Obvious possible" - for people not used to a Linux distribution,
where "doing stuff my way" rules over "getting stuff done".

As I started reading it, I knew at some point the user was going to
"search for something". And I knew she was going to fail. Which she
did.

The problem with most (all?) of the linux desktop search applications
is that they are cut out for a particular task and are (hopefully)
pretty good at it. Indexing is the keyword there - how to index all
kinds of data in the best possible way and then allow users to search
the indexed data. And there are lots of sophistication there.
Unfortunately the common search tasks by an user is not quite that.

- Search for a file by name - most common
- Search for files of certain types
- Search for files in home directory containing some text - slightly
advanced usage
- Search among browsed websites etc.
- As a computer "user", it is not clear why I would search for
websites in the desktop search tool and not in the browser. Of course,
once I am told this can be done in the desktop search tool _too_, I
would be extremely glad and nod in appreciation.

It takes time to write a desktop indexing and searching system. I
didnt believe it when I first heard of it and my friends asked me what
is so difficult about it other than implementing inotify. For some
reason it is. So a lot of effort in invested behind that. But there
has been less effort in presenting a failsafe, minimum capability
search experience in that direction. What do I mean by failsafe and
minimum capability ?

- One obvious way to launch the search tool (there could be more, but
there should be one which may not be the best but works in the worst
case)
- The obvious tool should never fail on the basic searching - never,
never, never. By basic searching I mean searching for non-file content
information - name, size, type (what on earth is _mimetype_ to a
non-CS/IT person ? searching by extension is what I mean. broad
classification like music, picture helps).
- Repeat the above. Let me say it this way - if the user knows a file
exists, she should be able to find it by name. And matches by name
should come _first_. Same for search by types.
- Anything else is a bonus. When we have complete semantic desktops,
where a file is same as an email and same as an addressbook entry,
maybe then users would want to search for everything or specify what
exactly she wants. Not now.

So where does beagle fall behind (or some of the others tools, by
reading about them and looking at their screenshots).
- User want to use them to search for files. The tools return pretty
much everything.
- Give them an option on where to search. There is no need to
include an option for "application/rdf+xul" but list the common
options. A search service has to work for the minimum, a GUI has to
cater to the average. I would be sad if it didnt have ways to cater to
the advanced crowd too but I dont mind if that requires one extra
step.
- User wants to search everywhere (in the filesystem).
- Thats definitely not what beagle was designed for. A beagle search
tool is not expected to do that. But when it is presented as the
_main_ search tool to the users, it will be used to search everywhere.
And it will fail.
- I dont quite know how to design a failsafe GUI search tool but a
good start would be use the indexing service to home directory files
and brute-force 'find /usr/bin' for non-home directory partitions.
- Some users would never need to search by content.
- If searching content was cheap then it would not be a big deal if
searching by content is enabled. But content searching is expensive
from my experience. It would be better if users are allowed to opt-in
for content searching.
- Content searching is not supposed to be expensive. As far as
beagle is concerned, it is halfway on meeting this goal. It still
needs some fault-tolerance feature to detect problems before too much
damage has been done.

There are lots of other ways to make searching "just work". The user
does not even need to know there is any indexing in the background.
The sad part is a lot of what I suggested (or could have suggested) is
already possible with the current beagle infrastructure. What is
lacking is someone with a good GUI knowledge to work on improving the
search experience. I am defending the base by fixing the occassional
simple bugs but a real developer is needed. And needed urgently
otherwise yet another distributions will be released with a lacklustre
search experience.

http://contentconsumer.wordpress.com/2008/04/27/is-ubuntu-useable-enough-for-my-girlfriend/

Sunday, April 20, 2008

Takes Two to Release

The GMail backend I blogged about before is now available for mass abuse in Beagle 0.3.6(.1). We also tried to maintain our love of cutting edge technology by upgrading the Firefox extension to work with Firefox 3.0.

I noticed several forum posts where users wanted to use beagle like locate/find-grep. The desire was two pronged - no intention to run a daemon continuously and return files from everywhere doing basic searches in name and path. That is not how beagle is supposed to be used but users are the boss in a community project. So I added blocate, a wrapper to beagle-static-query. Currently it only matches the -d DBpath parameter of locate but works like a charm. Sample uses
$ blocate sondesh
$ blocate -d manpages locate

The other thing I added was a locate backend. I absolutely do not recommend using this one. Yet if you insist ... when enabled and used with the FileSystem backend, it will return search results from the locate program. Yes, results from eVeRyWhErE, as you wished.

You can use both the GMail and the locate backends in beagle-search as well. But both the new backends are rather primitive, so I have taken enough precautions againsts n00bs accidentally using them. So in summary, 0.3.6 is not going to do you any good. Oops... did I just say that ?!

The title is based on the empiricial count of the number of actual releases (including brown bag ones) needed for last few releases.

Tuesday, April 08, 2008

You said Google owns your life

And I nodded my head and felt sympathy for you. I also used the I_doubt_its_maintained_anymore and just_barely_works xemail-net to write a live GMail backend for beagle. It does not index the emails as of now, but uses the IMAP search protocol to directly search the emails on the GMail IMAP server. And searching followed by retrieving the headers of the matched emails is really slow; the delay is clearly perceptible. It could be due to xemail but I could not find a better alternative for a .Net IMAP library.

It should not be hard to take the current backend and add the ability to download a batch of emails and index them locally. Google also publishes a nice set of GData .NET API for accessing Google documents, calendars and a lot of other services. A backend for them would at least make our beloved maintainer happy.

A basic GMail query-only backend was on my TODO list for more than year. And now its finally done. Proper GMail indexing and Google service is also on my TODO list. So hope to see it sometime... say by next year.

Friday, March 28, 2008

Better late than never

Beagle 0.3.4 was released into the wild last week. We tried to fix the build problems (new Gnome-sharp, Mono 1.9, missing files from last release) but building Beagle with Ndesk-DBus 0.5.2 remained broken, as per tradition.

Other than that this version builds nicely with Mono 1.9 and contains #ifdef-ed code to use the Mono.Unix.UnixSignal API when built with Mono 1.9. That should ensure that beagled and index-helper will quit when asked to quit. Yes, this is the 21st century.

There is a lot of mapping in Beagle and a lot of them are hardcoded. We are gradually moving them to user configurable files. The config files were moved out earlier; in 0.3.4 we have moved out the query mapping (mapping "ext:html" to right internal property name). If you want to add a mapping for some property you repeatedly query, just add it to the local ~/.beagle/query-mapping.xml

I added another handy tool to use Beagle as updatedb/locate. Use beagle-build-index or beagled to create indexes (like updatedb). And then use beagle-static-query to query the indexes (like locate) - no long running daemon beagled has to be running.

E.g. I created an index for system files in /usr/bin, /bin and other global directories (the --disable-filtering is to disable filtering file contents since here is only care about the file names and such),

$ beagle-build-index --recursive --disable-filtering --target ~/.systembeagle /usr/bin/ /usr/local/bin/ /bin/ /etc /usr/local/etc/

Then I could query just like locate,

$ beagle-static-query --add-static-backend ~/.systembeagle 'net*' --backend none

(--backend none is to tell beagle to not search any other backends). I could have added ~/.systembeagle to beagle using beagle-config so that I dont have to add this path everytime or I could have even created an alias for this.

Why do this when locate/updatedb already does it ? Because I can :). Ok, I actually use this to search monodocs. I am not a big fan of this mouse, point, click business and I stick to the terminal with mod and monop2 at my disposal. These tools are great once you know the fully qualified name of the method or the class. Use this jack-of-all-trade beagle.

Step-1: Enable the system wide monodoc index. Its one of the crawl-rules shipped with beagle but disabled by default.

Step-2: Let cron build it or you run the cron job yourself. Building the monodoc index takes time though. Definitely longer than any special indexer for monodoc files. But thats only a one time cost.

Step-3: Use beagle-static-query. You can also use phrases and wildcards '*' and search only in methods or classes or properties (just look in the returned fields and use beagle property query).

Saturday, March 08, 2008

Beyond Search: arrhh...dee...efff

If the news reports and blogs are to be believed, this is the age of Semantic Something. First people wanted to search web, then file contents, and then search emails and other user data. Everybody was talking about desktop search; along came Beagle, Spotlight, Google Desktop Search, Kat, MetaTracker, Pinot, Strigi etc. While desktop search at its core is nothing but a crawler which reads different file formats and stores them in a searchable database, searching is the most trivial and IMO, boring application built on Beagle's infrastructure.

These days the focus seems to have shifted to Semantic Desktop and Semantic Web. Most blog comments and mailing list posts about Semantic-Fu have a hint of it being vapourware. Its not totally their fault either; the ideas are around for a long time and people are working on it for many many years. But there is no glittering gold in sight. Only recently some interesting Semantic Web ideas have started taking shape. Semantic Desktop is a slightly different game but it should not be far behind. After taking about 40 developer years, Beagle is just about ready to take desktop search beyond simple file content search. Historians might want to take note of the dashboard project and how beagle came into being as a necessary requirement for that truly beyond-desktop-search application.

The core idea behind Semantic Desktop, upto my understanding, revolves around the buzzword jack-of-most-trades RDF. And for the impatient kind, here is a rude shock - RDF is not useful for human beings. Even further, it is not even meant for you, me and us; storing every conceivable data in the RDF format is not going to make our life any easier right away.

RDF or Resource Description Framework is a generic way to describe anything, to be accurate any description of anything. It is a fairly elaborate yet structured format; very easy for programs to analyze that information but extremely redundant to human eyes. Notwithstanding what the AI experts are claiming about the future of AI, human mind can work without immediate deductive reasoning and in fact does that a lot of time. It recognizes familiar words without reading the alphabets one at a time, it deduces the color by merely glancing at it, it conjures up strange connections; its a wonder that will be hard to completely characterize by any set of rules. At least at the current stage, algorithms have to be told the facts and the relations between them for them to do any kind of processing with its data. These are the things that we just know when we see something and is thus the reason why storing the description of something in an RDF format is not going to gain me anything immediately. On the other hand, this is also the reason why applications should be fed data in an RDF format to allow it unhindered access to the semantics of the data.

If that felt hand wavy, try to think about the difference between the semantics of a data and its syntax. An array could be used to represent a linked list, a queue, a stack, a tree or an heap - the latter are the different semantics of the representations, the array is one of the many syntactic representations of one of the latter concepts. A bunch of pairs could be stored in a database table; the table is a syntactic representation of the data which has the semantics of a bunch of name, phone-number pairs. It is hard to work with the semantics of an idea, in a sense it is something up in the air; on the other hand storing some data in a suitable working form could fail to capture some concept about the data. Also, once stored in a particular form it is easy to miss the bigger picture; thus limiting the scope of what we could do with that data.

Saying all that, for the time being think of the RDF format as a bunch of objects and facts where each object is related to some number of facts. The semantics of related could differ based on the context, and RDF is powerful enough to describe even that semantics and a whole bunch of other facts about the facts. With beagle pulling data from nooks and corners of a user's desktop and providing a service which allows applications to search this data, it is a shame if we cannot exploit the relationships in this data for a better mankind... err... dolphins... err... us.

Consider all the emails I have. Now I know that there some emails that are part of discussion threads. Beagle does not. With the beauty of N3 (a close cousin of Semantic-Fu and RDF), I can write this extra information as a set of rule (the single '.' represents end of one rule). I am using emails msgid to track emails in a thread. I could not help but notice the similarity of these rules with prolog or other logic programming languages.
/* an email with subject 'foobar' is in its own thread */ { ?email :title 'foobar' . ?email :msgid ?msg . } => { ?msg :inthread ?msg } . /* if any email refers to some email in thread, then this email is also in other email's thread */ { ?ref :inthread ?parent . ?email1 :reference ?ref . ?email1 :msgid ?msg .} => {?msg :inthread ?parent} .
Using the RDFAdapter of the beagle-rdf branch, I can use this to get all the emails in the thread with foobar in its subject. Note that I am able to write my set of rules only when I see this data as actual emails and not a bunch of lucene documents with fields. The latter carry no meaning. Further note that, I can also use the BeagleClient API to perform field specific queries to obtain the same results. The difference is that the process of using BeagleClient will require me to think about the relationships from scratch and then figure out the right sequence of queries. Instead I could store all the relationship among the emails in the email-index in an RDF format (and also related information not stored in the index e.g. saying a list of email addresses are all mine and should be treated as for one person). Then, whenever I want to extract some information, I can write the question (again in an RDF format) and let the RDF-Magic figure out the how to execute this question against that data given this set of inference rules. Isn't it cool ?

If I missed it earlier, this kind of data-mining operations are not for my everyday use (here my refers to usual computer users) and is not for everybody. Still it is can sometimes come in handy. Imagine the possibilities if you can write the relationships between a file in an mp3 playlist (playlist filter), its download link and how your arrived at that page (webhistory indexing), the email you sent with that file as an attachment in a zip file (email attachment and archive filter), its ratings and usage statistics in Amarok (amarok querydriver) and of course the actual file on the harddisk (user home directory indexing).

Warning: The RDF Adapter in beagle uses the sophisticated SemWeb library which allows anyone to perform graph operations (selecting subgraphs, walking on graphs, pruning nodes and edges etc.) on the RDF graph of the data. Unlike most RDF stores for desktop data, beagle is not optimized for RDF operations and could take quite a bit of time and heat up the CPU. It took me about 4 seconds to find all threads with the word beagle among 500 emails (my actual email index has about 20K emails! I refuse to imagine what will happen if I run it on the full index). If you are interested, checkout the rdf branch and take a look at the test SemWebClient.cs.

Tuesday, February 12, 2008

beagle searchhandler for khelpcenter

I realized that it is extremely easy to build searchhandlers for KHelpcenter. I have always wanted to use beagle for searching in KHelpcenter, so I quickly made a searchhandler in python. Screenshot below. To use it, put these files in the right places, chmod khc_beagle.py to make it executable and restart KHelpcenter. This only searches the manpage index for now and can be easily extended to search docbook. And of course you have to remove /usr/share/apps/khelpcenter/searchhandlers/man.desktop. I would submit it to KDE bugzilla but I dont know if this works with KDE4 version of KHelpcenter.

Saturday, February 02, 2008

Grab the 3rd of the third

I announced the release of Beagle 0.3.3 today. It is one of those releases with lots of features here and there which just makes me nervous. I found a problem (which incidentally was uncovered when we started using Sqlite prepared statements couple of weeks ago) today during. I quickly identified the problem and fixed it but I am hoping everything else works out OK with this one.

Apart from using Sqlite prepared statements, which should show some speed improvements during running beagle-build-index, there are few other goodies as well. Beagle-search includes a menu option to view the current index information. I would have liked it better if it kept refreshing but this is better than nothing.

Searching documentation is now enabled in beagle-search. It used to be disabled by default in the early days because apparently it returned a lot of results and messed be Best. The situation is better now but not by much; so you have to pass --search-docs to beagle-search to ask it to search the documentation index. That aside, system wide manpage index is now enabled by default and that includes lzma compressed manpages (so Mandriva users like yours truly will be extremely delighted). Beagle search happily searches manpages and it is a real pleasure to use it instead of man -k.

Another real pleasure is being able to create Qt GUI using C#. Its feels very good and I ended up creating beagle-setting-qt, a Qt GUI for beagle-settings. I never added the bindings for the new beagle-config to libbeagle. So I had to amend my fault by giving KDE users some GUI for beagle-settings.

You also get a fake implementation of searching in a directory, one of the popular requests. You can either search inside a directory by giving its full path or by giving a word in the directories name. One catch is that the search is not recursively under the directory but only in its contents.

Sadly almost all the currently known problems with beagle are outside our control. Fortunately, most of the problems in these dependent libraries or suites are fixed and will be released soon. It is surprising how bugs in these external programs, generally corner cases when used alone, are triggered when using beagle. A long running application for desktop users has to cover a lot of bad ground to be even slightly respectable.

If this release goes well, then we might try to fix all the horribly hacky property names (based on our own ontology) and come out with a 0.4.0. I am also hoping to merge the RDF branch to trunk before that. I should really blog about the RDF branch sometime; the experiment to overlay an RDF store on beagle's data is nearing a sure success.

Tuesday, January 29, 2008

FOSS meets ENG

Techkriti brings back fond memories. Techkriti is the annual technology fest at IIT-Kanpur (y'no the place where they figured out how to decide if a number is prime in a theoretically fast way).

That was one fabulous event I actively took part in during my undergrad days. A perfect mix of all kinds of technology. I was there in its early days of infancy; if I remember correctly the first prize for the software contest in my first year there went to a graphical calculator program in Tcl/Tk. By the time I left it was hugely popular and there participants from all over India. I was one of the organizers for Tech Olympiad in my final year. It was equally fun to come up with challenging problems where various concepts tie in together. The participants loved it.

This year Techkriti is even more exciting. I read in the news and blogs about how FOSS is catching up in the subcontinent. This time they are organizing a FOSS event for Techkriti, probably for the first time. I am hoping the event becomes a success, though the schedule page is a bit empty now. Beagle Xesam adapter author Arun and Web interface dude Nirbheek are among its organizers, so I am sure I will get first hand information about all the exciting things that will happen (how about a hackfest for Dashboard ;-).

.

I will end with two simple brain-stormers.

The first one is of mathematical type, my favourite: Show that any 5 consequtive numbers will always contain some number which is prime to the other 4 numbers.

The second one relates to programming (somewhat), my hobby. You are given two arrays A1 of size m+n and A2 of size n where only the first m slots of A1 are filled with sorted integers (increasing order) and all the slots of A2 are filed with sorted integers (increasing order). You goal is to merge the two arrays into A1, in linear time (no block array copying or other tricks) but (here is the twist) using no extra space (i.e. no placeholder variable to hold temporary values).

Good luck!

Friday, January 25, 2008

Open letter to OpenSUSE users

(long post warning)

Dear OpenSUSE users,

Recently I came across several threads in various OpenSUSE mailing lists
[1], [2], [3]. I was both amused and felt sorry while reading the posts. No
really some of you write funny emails. That aside, people, especially those
using FOSS don't make up things like this. I am sure the problems that you
faced exist (or existed in whatever version you were using).

I joined the project later but I still feel responsible for the sleepless
nights some of you have had due to beagle, trying to imagine what you would
see beagle has done to your computer when you wake up. I would have felt the
same if I were in your position, in fact I sometimes feel the same for one of
the browsers that I use.

There were lots of suggestions and speculations. There were suggestions of
filing bugs with us. While I do appreciate if some you can file bug reports,
I sympathise with those who dont want to open yet another account to file
bugs or email the mailing list. I belong to the latter group, so instead of
replying to the thread, let me take a minute here explaining how we try to be
friendly to your computer hard-disk space, memory and CPU.

* We nice the process and (try to) lower the iopriority.

* Extracting text from the binary files, without rewriting the app which
deals with files of that type, is an expensive operation. So, we index them a
few at a time with sufficiently long wait in between. The wait period is
longer if the system load average is high. But if you are playing games or
doing other CPU intensive operation, you will not miss the CPU spikes. Normal
uses should not be hampered though.

* During crawling (for files, emails or browser cache) we try not to disturb
the existing vm buffer cache.

* We believe once the initial indexing is over there should not be noticable
effect from beagle, so we crawl a bit faster when the screensaver is on. But
we provide options for you to turn it off.

* We use a separate process to actually do the dirty job of reading the
files and extracting data. As a failsafe measure, if the memory usage of that
helper process increase too much we kill it and start a new helper process. I
would like to claim that for the last several versions I did not see/hear the
helper process being killed due to memory problems.

* For certain files we need to store indexable data in a temporary file. We
make sure we delete it as soon as the indexing is over. There were problems
in some old versions where the files would not be deleted (they definitely
wont be deleted if you kill-9 the process) but I have not heard about this
problem in recent times.

* To show you snippets of your searched words, we store the text data of
some types of files (not the text or c,c++ kind of files whose text data is
basically the file itself but the files in a binary format). We try to be
smart here to not create thousands of small files on the disk (I have about
20K mails generating at least 10K snippet files). In addition to it, we
provide ways for you to turn off the snippets completely.

We do care for your experience and certainly for my own experience while
indexing my data. So where do we go wrong:

* Once in a while the indexer encounters a file for which it ends up in an
infinite loop. Most of the times it is generally a malformed file but
sometimes it is also our fault.

* C# has lots of advantages and one of them is that the developer does not
have to worry about freeing memory after it is used. Depending on someone
else (in this case the garbage collector which frees the memory for us) has
its pros and cons. But one thing for sure (assuming mono is not making any
mistake in freeing), there is not going to be any memory leak of the kind we
are afraid of in C or C++. Neither are we afraid of segmentation faults due
to memory corruption. If you are wondering how do some of you see beagle's
memory growing, let me remind you that "to err is human". With sophisticated
tools to prevent simple errors, comes sophisticated errors. A simple example
could be like storing in a list all files beagle finds during crawling, but
forgetting to remove them once the data is written to the index. No, we never
did that but sometimes we make similar mistakes.

* We would be extremely happy if beagle only used C# for all its operations.
Unfortunately, we have to depend on a lot of C libraries for indexing certain
files. Sometimes memory leaks (the C type) and segmentation faults happen in
them. These are harder to spot since mono does not know about the memory
allocated in the C libraries.

* Beagle re-indexes a file as soon as possible once it is saved. It is in
general not possible to know whether it was the user pressing ctrl-s in
KOffice or a torrent app saving the file after downloading one chunk of data.
As a result, beagle performs horribly, yes horribly, if it encounters a file
that is being downloaded by a p2p/torrent app. You are bound to see almost
continuous indexing as beagle strives to index the updated file for you in
real time. Same goes for any large, active mbox file in the home directory
_not_ used by Thunderbird, Evolution or KMail (for mbox files of these apps,
the corresponding backend is smart to index only the changed data).

* NFS shares have their own share of problems with file locking, speed of
data access etc. We have tried to deal with them in the past by copying the
beagle data directory to a local partition, performing indexing and then
copying back the data directory to the users home directory. It is a feature
not continuously tested and I am sure you can think about lots of cases where
this would fail.

* The first attempt to write a Thunderbird backend was a disaster. Well, it
was good learning experience for us but it will cause headache to most users.
We disabled it in the later 0.2 versions. There is a new one in the 0.3
series which reportedly works better.

* There was one design decision which backfired on us. Imagine you dont have
inotify and have a large home directory. To present you changes in real time,
one option is to crawl the directories regularly (kind of what the WinXP
indexer does). You can imagine the rest. Though inotify is present in the
kernel these days, the number of default inotify watches (the number of
directories that can be watched) is pretty low for users with a non-trivial
sized home directory. In the recent versions, we disable the regular
recrawling.

* Besides continuous CPU usage and hard disk activity (for days and weeks
after the initial indexing is over) the above had another effect on the log
file. Add to it the pretty verbose exceptions beagle logs. We want to know
about the errors so we still print verbose exceptions but we dont reprint the
same errors anymore. (I have been told that some of the OpenSUSE packages
have the loglevel reduced to error-only which will automatically generate
smaller log files).

* This is a good excuse to end the list with. C# and beagle architecture
allows us to add lots of goodies. After all, we (read I) work on beagle
solely because I love to play with it. The more the features, more the lines
of code and more errors. The only good part is once spotted, they are easy to
fix. Check our mailing list and wiki for the available freebies.

So in summary, we try to be nice to your computer (and to you ? maybe, if you
are nice ;-) ... just kidding) but there are limitations that we are
constantly trying to improve on. Any of you can look in our bugzilla, our
mailing list archive, our wiki or hang out in our IRC channel to see for
yourself how we try to issue any problem with utmost importance. Ok, I lied,
as much as our time permits. There are lots of features in beagle and some of
them rarely get regular testing, mostly because none of us use those
features. I wont be surprised if there are major problem with these. I assure
you that if you bring any problem to our notice, they will be taken care of,
if not completely resolved.

Lastly, I read in one of the forum posts that beagle-0.3 will land in factory
sometime soon. If any of you wants to verify the facts above (the good ones
or the bad ones ;-), give that a spin. And a friendly suggestion, if you only
want to search for files with certain name or extension, you can do much much
better with find/locate.

Your friendly beagle developer,
- dBera

[1] http://lists4.suse.de/opensuse-factory/2008-01/msg00157.html
[2] http://lists.opensuse.org/opensuse/2007-12/msg01796.html
[3] http://lists.opensuse.org/opensuse/2008-01/msg01083.html (could not find
the parent of this thread)

Friday, December 28, 2007

klik2 klik beagle

Yay! I managed to make a klik2 recipe for beagle. In principle this should enable anyone to just do
$ klik klik2://beagle
and happily run beagle. All the dependency packages will be automatically downloaded and managed in the background. Or to download once and use many times, you can do
$ klik get beagle
$ alias runbeagle='klik run ~/Desktop/beagle.cmg'
$ runbeagle

Not all of the above is happening right away; klik2 is under development and is looking promising but not completely done yet. But if you already have mono and want to run beagle, you can sort of do it now. This works for any distribution that klik2 works for, which is basically almost all the major distributions.

Get and install klik2
Download http://cs-people.bu.edu/dbera/blogdata/beagle.xml
$ klik get beagle.xml

It will do a lot of stuff and end up failing since there is no single app called "beagle" in the beagle package. It will however create the file ~/Desktop/beagle_0.3.1-2.cmg

Run beagled.

klik run beagled ~/Desktop/beagle_0.3.1-2.cmg --fg --backend manpages --backend Files

Browse to http://localhost:4000/ to access beagle using the web interface. You can use it to search, check indexing status and shutdown beagled.
To run the other beagle tools, the pattern is the same

$ klik run beagle-command ~/Desktop/beagle_0.3.1-2.cmg beagle-command-params
For example, to shutdown from command line

$ klik run beagle-shutdown ~/Desktop/beagle_0.3.1-2.cmg

The next time you want to run beagled, you need not run the recipe again; start from step-4 straight away!

Thursday, December 27, 2007

one zero-three-two sailed today

Beagle 0.3.2 was released today. On one hand we are still catching up with the regressions and new bugs that were introduced in the mighty beagle-0.3.0 and on the other hand, new features are streaming in. While yelp remains broken with beagle, I was amazed at how easily I can search within the manpages and double click on the results to open them in yelp. A much better alternative to man -K.

In other news, Lukas has started working on providing spelling suggestions Did you mean ... ? There are some technical limitations which are not fully resolved yet so it did not make it into 0.3.2 . It is currently housed in a branch and I hope to release it into the wild soon.

Beagle was not designed as an RDF store at its inception. It will take quite some work to make it a genuine RDF store. But what if there was an RDF adapter that sat between an RDF client and beagle and talked to each of them in their corresponding language, yet maintaining sanity. There is an ongoing work to overlay a Semweb selectable source on top on beagle. We will see how that goes.

A post on klik2 rekindled my desire to create a klik package for beagle. It will be easier this time since klik2 handles command line programs. I tried the automatic debian generated recipe and it mostly worked. Mostly, because one of the tools in beagle ran with --version and --help but failed to find some libraries to do anything more. I think all I need to do is to teach klik how to set certain PATHs and environment variables. Pretty exciting, what do you think ?

Thursday, December 13, 2007

Enterprise search OR How to index on-demand

If you are like me who keeps their filesystem organized, have a relatively
unchanging home directory or just simply do not want realtime indexing, you
can use beagle-build-index to meet your needs.

Beagle-build-index builds static indexes from files. Static indexes are
created and updated on demand when beagle-build-index is run but the
directories are otherwise not monitored for changes. The next time the
command is run, the changes are registered in the index. Once the static
index is created, you can ask beagled to search in it (by
passing --add-static-backend /location/of/static/index). beagled need not be
stopped while running beagle-build-index, it will automatically use the
updated index for searching once beagle-build-index finishes.

That was for files. If you want to do the same for anything else, say emails
or notes or addressbook and you do not want realtime monitoring, start
beagled normally and let the indexing finish. Then stop beagled and restart
with the parameter --disable-scheduler. Unfortunately, to update the index
with changes, beagled needs to be stopped, started normally and allowed to
run till updating of index is done, stopped and then again started with that
parameter.

If you are a system administrator managing lots of users and you dont want to
run beagled in realtime indexing mode for all of them, you can use the above
procedure to create/update static indexes, say once a day.

Or if you don't like mono, you can use Recoll for files and mairix for emails.
There are probably many more such tools but these are the two I know. Just in
case you have not heard about Mono, it is an open source implementation of
ECMA standard compliant C# compiler and a Common Language Runtime. And some
more goodies, all in all pretty useful.

Wednesday, December 12, 2007

Many reasons to like, what's yours ?

Beagle 0.3.0 was released beginning of December. It is nearly 2 years since 0.2.0, more than 10 months since the last feature release and it has been about 2 weeks since then. In the mean time we identified some problems upgrade problems with 0.3.0 and released 0.3.1 and Mono released 1.2.6.

In contrast to 0.1.0 and 0.2.0, beagle-0.3.0 did not have any single major-impact change. But there were lots of small changes, all over the summer months and the months following them. It was getting increasingly difficult to handle all the small changes without going through the "Release early" trick and at some point we paused development, did a test release and then finally released what we have as a major release. I am personally expecting a fair share of bugs and regressions.

What are these small changes anyway ? I will leave out the invisible ones, some of which I have blogged about before, and only explain the ones that will directly impact your desktop usage.

There are 3 new backends: the Thunderbird backend (newly written, much better than the earlier one), the Opera history backend and the Nautilus metadata backend. There is also the TeX filter, one of our most demanded ones and new audio filter based on Taglib-sharp. There are new Firefox and Epiphany extensions which do a lot more than indexing browsing history and bookmarks.

The UI got some love; specially a bunch of useful options were added to beagle-settings like the backend selecion list. For obvious reasons, users should disable the backends they are never going to use.

One of the side effects of the beagle textcache previously was the creation of thousands of small cache files on the disk. People reported that the external fragmentation was wasting a lot of space. The textcache module was redesigned to minimize the fragmentation; I am sure you will appreciate the recovered space. We also compacted the external attributes; besides other benefits that will save some more space.

Two major enhancements were made to the query syntax, which is already quite rich. Date queries are now possible; date queries do not make complete sense without date range query, so that too is possible. And a new "filetype:" keyword was added e.g. to search for images use "filetype:image", to search among documents use "filetype:document" etc.

The major complains against beagle are constantly high CPU load, high memory usage and improper termination (or not exiting at all). The first two are well known and oft discussed. The third problem is not directly brought up, but have been found to be the reason upon close investigation. I gained valuable experience trying to find my way through the web of signals, threads and events in beagle code; a number of key issues were spotted and fixed. Oh, and the first two issues were also dealth with, as much as we could diagnose, but that is nothing new. It will sound funny, but a few of the high CPU and memory problems are direct results of some of our decisions that backfired. Some of them were fixed and the others being worked on.

2 experimental features were also added. One is a web interface to search beagle from Firefox (gecko based browsers really). You can also create standard bookmarks for common search terms. The neat thing about this web interface unlike the earlier webservices based one is that there is no heavy weight server running on beagle's side. This one communicates with beagled using BeagleClient XML based API and builds the entire GUI on the client side; a pure Web 2.0 AJAX/XSLT/CSS webapp (ok, these are some cheap buzzwords).

The other fancy feature is searching other beagle daemons over the network. Using Avahi you can even publish your beagle daemon or discover other beagle daemons in the network. We haven't quite figured out how to handle security, authenication and some other issues. So the feature is disabled by default and marked as experimental but I believe it can be used in some innovative way.

We received request from some distributions about global config files; useful for both distributions and sysadmins. Some useful global configuration settings would be to exclude certain directory from indexing for all users, adding or removing file ignore patterns from the default list, disabling of KDE backends by default in pure Gnome distributions. Some of the options were moved from the code to the config files, so that they can be set globally and overriden by individual users.

These are only some of the major ones.

Lastly, the reason I got excited about mono-1.2.6 is because it has some fixes and improvements that will be directly visible when using beagle.