Thursday, May 17, 2007

Silence of the Dog

Lately beagle releases have slowed down quite a bit; there were a few bug-fix 0.2.16.x release and another 0.2.17 bug-fix release (it was supposed to 0.2.16.4, but the changelog was too large for a point release). The underlying goal is to get ready for 0.3.0; svn trunk is changing so rapidly these days that it is difficult to isolate the simple ones and make them into a 0.2.x release. On the other hand, the changes are too major to be put into a 0.2.x release (they would also need extensive real life testing).

Recently I moved beagle to use taglib-sharp for filtering music files from entagged-sharp. I was told entagged is no more actively maintained and taglib is definitely seeing a lot of rapid development. My timing was not quite right, the 4th March news "Entagged is unmaintained" is followed by the 28th March news "Entagged is maintained". I came to know about it only after I made the transition. Too late! On the plus side, taglib# has support for larger number of formats and is being used by Muine and Banshee, so expect sharing of taglib-sharp libraries. Unfortunately, there are no taglib-sharp packages out there yet (there is a proposal for a debian package), so all the mono apps are currently including taglib-sharp by its source. We too initially source included it, then removed it and instead linked against the package. But if there are no packages for the major distributions, it might make sense to source include it; compiling Beagle is pretty demanding anyway.

In other news, I used the extremely handy heap-shot to identify that instances of IndexReader were not being GC-ed even long after the corresponding method ended. Explicitly setting them to null immediately freed them. I suspect some thread local storage magic happening behind my back. Note to self, set IndexReaders to null immediately after they are closed. Did I say heap-shot is amazing ?!

There are several more improvements to the speed and memory performance of IndexHelper and BuildIndex. One notable feature I added was to reduce re-indexing of files which could not be filtered before. Due to the inherent distributed nature of beagle indexing, the crawler is always separated from the indexer. So if the crawler finds some file which was not filtered before, it has to re-submit it to the indexer. Who knows! There might be a suitable filter now. The downside was that a lot of files were being repeatedly re-tried by the indexer, slowing down the whole process. I decided to store the files containing the filters and their last modified times in a filterver.dat (akin to mozilla pluginreg.dat) and if the filters were not changed since last run, assume that there is no newer filter. Fair guess I would say.

Beagle knew how to index email attachments for quite some time; some months ago it also got the ability to index archives. However all along this was done by extracting the included files to a temporary file and then indexing it. This was done primarily because of the way included content (aka child indexables) were handled and also due to the fact that some of the filters only worked on physical files and not streams. This whole temporary file business never pleased me, there were race conditions which could leave undeleted temporary files in the system, even small included files had to be written to disk and further, extracting the contents of an archive to index it defeated the whole purpose of archiving it. Last week, I added the infrastructure to allow indexing of archives and email attachments without extracting them, if the filter permits of course. The infrastrusture is there, the archive and email filters should be modified to take advantage of this.

Finally, one feature I personally would like to see in 0.3 is support for XMP sidecars. XMP sidecars allow users to add a separate file.ext.xmp file containing arbitrary metadata (but in the XMP format) about file.ext. Really extensible solution for metadata. The main part of the code is in svn trunk; it still does not support renaming or deleting xmp files. Hopefully it will be finished in time.

This will probably be my last post before my annual break to the land of mangoes (" fruit of the gods"). Sadly, I have (knowingly) only tasted about a dozen varieties of mangoes, out of over 300. I will definitely try to increment the number this time. Next post, July.

No comments: