tagging

You are currently browsing articles tagged tagging.

I’m making this post to elucidate some conversations I had late night last night, none of this is particularly rocket science or necessarily even model rocket science. One hilarious thing that keeps coming up is federating search — combining search results from multiple datastores, which is a moderately hard problem to come up with a general solution for, but relatively easy (frequently) to come up with a solution for a particular purpose.

Complicated general solutions (such as that found in GeoNames for a lot of content information, but it doesn’t federate that with other results and uses a couple of other data sources that aren’t relevant to this.)

Here’s the (relatively trivial) code that does ‘content’ (normally: fulltext but in content management systems called ‘content search’) searching.

module Contentsearch
  module ClassMethods; end
  def self.included(klass)
    klass.extend(ClassMethods)
  end
 
  def ft_index
    logger.debug("[contentsearch::ftindex] submitting #{self.id} #{self.name}")
    Bj.submit "./script/runner jobs/add_to_consearch.rb -t #{self.class.name.downcase} -i #{self.id}"
  end
 
  def ft_deindex
    logger.debug("[contentsearch::ftdeindex] removing #{self.id} #{self.name}")
    Bj.submit "./script/runner jobs/remove_from_consearch.rb -t #{self.class.name.downcase} -i #{self.id}"
  end
 
  module ClassMethods
    def ft_search(kw_string)
      clsname = self.name.downcase + "s"
      return self.find_by_sql(["select distinct #{clsname}.id, lots of stuff i deleted here, MATCH(content) against (?) as relevance FROM #{clsname},consearches WHERE consearches.ftable_id = #{clsname}.id and consearches.ftable_type='#{self.name}' and match(content) against (?) order by relevance limit 6",kw_string,kw_string])
    end
  end
end

This example is written in Ruby (on Rails), and the first part is just a convention for putting class methods into a ruby class. How Ruby (and smalltalk and similar languages) handle methods is a fascinating but different discussion but essentially it’s a metaprogramming party and everyone’s invited.

Ruby makes this relatively simple to add as a module to pretty much any class. The reason that ft_index and ft_deindex run in a background process is because taking a document in or out of a fulltext indexed mysql database is way slower than you would want to present to the user in an interactive process. This is common in web applications and is part of why you see things like “your [whatever] may not appear in searches right away” from a lot of applications. If you leave them to run on their own they’re fast enough but generally would make the user unhappy.

But basically, what’s going on here is that there’s two separate tables (and different table types in mysql — one of which does fulltext searching and the other of which has ACID properties.) By joining these two tables together, you can search against the content tables and get results back from the main table that stores domain objects. This is probably the simplest version of federating two different search types together. It works pretty smoothly and this sort of thing is in a number of different products.

(And for rails people, the relevant string in the model classes for this is has_one :consearch, :as => :ftable)

But this is obviously trivially simple: the objects in the content search table are representations of the objects in the main table, and there are entirely separate semantics between the two tables (and unfortunately i deleted the examples using both the main and consearch table, but it’s a join and you get the idea.) One of the tables does ‘field operator value’ type searching (ie: relational) and the other is the kind referred to these days as ‘google searching.’

Things get progressively more difficult when one of these things aren’t true — that there isn’t a store that has the single unified version of the document or that the semantics are related but not either identical or entirely different. For example, if I’m searching two different instantiations of my own product, it’s fairly easy — all the fields mean the same thing between the two different databases.

If the products differ by schema or meaning of the schema, you have to make a (semantic) translation between the two to make the search work, and also you have to make some sort of translation on the search results to have the results displayed to the user in a way that makes sense. This might be as simple as ‘one repository has names and one things have titles’ or it might be more complex (names versus ids, names in particular formats, URLs versus descriptive strings, date formats that give seconds versus those that are accurate to the day, etc…)

It’s when you start combining these sorts of things that stuff starts getting more complex. (This is also leaving aside the issue that the protocols to access all of this information is different (although these days more and more of this becomes an adventure in XML parsing and not DLL hell.) Let’s take a simple example, sorting.

Say I have three different datastores Repo1 – Repo3, and they both return objects with titles on them, and I’m sorting the titles:

Repo1: [Alpha, Bravo, Charlie, Delta, Echo, Foxtrot ]
Repo2: [Able, Baker, Charlie, Dog, Easy, Fox]
Repo3: [Alligator, Crocodile, Pterosaur]

It’s fairly easy to implement this sort, there are a few small issues (like paging results versus the page sizes of the underlying repositories, but regardless alphabetic sorts are well-understood in most locales.

However, if you’re searching by something more like ‘relevance‘, you get back a number associated with each result (so the document that had the ‘Alpha’ before might have a score of ’0.91′). It’s simple to order numbers as well, but how do you tell that a number in one datastore corresponds to a number in another? For one thing, those numbers are calculated (mostly) with regards to the particular collection of documents on a given datastore and for another, one repository may just tend to return a higher number for documents that are theoretically as relevant (because there isn’t any agreement about what 0.91 means, it’s just what a function returns.)

So those two things are where it starts to get more complex and needing actual customization and specialization.

In conclusion, this is way too (f) long for the blog, but I was typing it up to explain things that I was talking about yesterday anyway. HTH. Feel free to comment, but if you’re the person I’m proximately writing this for, you should probably send email.

Tags: , , , , , , , , ,