infosci

You are currently browsing the archive for the infosci category.

Hey folks, I’ve been working on a couple iPhone applications recently, and things are at the point where I could use a couple beta users. If you’re interested in any of these, let me know by commenting or emailing me — email temp200911@corprew.org. You can also use the contact page at corprew.org.

  • I’m looking for some users for App A, and it would help if you lived in the Seattle or Portland area and had Celiac disease for this one to be helpful for you (and for you to provide useful data for me.)
  • I’m looking for some users for App B, mostly for people who travel a lot. If you’re traveling by air this holiday season, you’re welcome to give it a shot.
  • I’m also testing a game for the iPhone. For this, it would be handy if you lived in the Capitol Hill region of Seattle and liked fun. As strange as it seems, some people don’t like fun. This should be fun.

All of these are useful and/or fun. Android versions will be coming relatively soon after the iPhone versions, ideally.

I’m making this post to elucidate some conversations I had late night last night, none of this is particularly rocket science or necessarily even model rocket science. One hilarious thing that keeps coming up is federating search — combining search results from multiple datastores, which is a moderately hard problem to come up with a general solution for, but relatively easy (frequently) to come up with a solution for a particular purpose.

Complicated general solutions (such as that found in GeoNames for a lot of content information, but it doesn’t federate that with other results and uses a couple of other data sources that aren’t relevant to this.)

Here’s the (relatively trivial) code that does ‘content’ (normally: fulltext but in content management systems called ‘content search’) searching.

module Contentsearch
  module ClassMethods; end
  def self.included(klass)
    klass.extend(ClassMethods)
  end
 
  def ft_index
    logger.debug("[contentsearch::ftindex] submitting #{self.id} #{self.name}")
    Bj.submit "./script/runner jobs/add_to_consearch.rb -t #{self.class.name.downcase} -i #{self.id}"
  end
 
  def ft_deindex
    logger.debug("[contentsearch::ftdeindex] removing #{self.id} #{self.name}")
    Bj.submit "./script/runner jobs/remove_from_consearch.rb -t #{self.class.name.downcase} -i #{self.id}"
  end
 
  module ClassMethods
    def ft_search(kw_string)
      clsname = self.name.downcase + "s"
      return self.find_by_sql(["select distinct #{clsname}.id, lots of stuff i deleted here, MATCH(content) against (?) as relevance FROM #{clsname},consearches WHERE consearches.ftable_id = #{clsname}.id and consearches.ftable_type='#{self.name}' and match(content) against (?) order by relevance limit 6",kw_string,kw_string])
    end
  end
end

This example is written in Ruby (on Rails), and the first part is just a convention for putting class methods into a ruby class. How Ruby (and smalltalk and similar languages) handle methods is a fascinating but different discussion but essentially it’s a metaprogramming party and everyone’s invited.

Ruby makes this relatively simple to add as a module to pretty much any class. The reason that ft_index and ft_deindex run in a background process is because taking a document in or out of a fulltext indexed mysql database is way slower than you would want to present to the user in an interactive process. This is common in web applications and is part of why you see things like “your [whatever] may not appear in searches right away” from a lot of applications. If you leave them to run on their own they’re fast enough but generally would make the user unhappy.

But basically, what’s going on here is that there’s two separate tables (and different table types in mysql — one of which does fulltext searching and the other of which has ACID properties.) By joining these two tables together, you can search against the content tables and get results back from the main table that stores domain objects. This is probably the simplest version of federating two different search types together. It works pretty smoothly and this sort of thing is in a number of different products.

(And for rails people, the relevant string in the model classes for this is has_one :consearch, :as => :ftable)

But this is obviously trivially simple: the objects in the content search table are representations of the objects in the main table, and there are entirely separate semantics between the two tables (and unfortunately i deleted the examples using both the main and consearch table, but it’s a join and you get the idea.) One of the tables does ‘field operator value’ type searching (ie: relational) and the other is the kind referred to these days as ‘google searching.’

Things get progressively more difficult when one of these things aren’t true — that there isn’t a store that has the single unified version of the document or that the semantics are related but not either identical or entirely different. For example, if I’m searching two different instantiations of my own product, it’s fairly easy — all the fields mean the same thing between the two different databases.

If the products differ by schema or meaning of the schema, you have to make a (semantic) translation between the two to make the search work, and also you have to make some sort of translation on the search results to have the results displayed to the user in a way that makes sense. This might be as simple as ‘one repository has names and one things have titles’ or it might be more complex (names versus ids, names in particular formats, URLs versus descriptive strings, date formats that give seconds versus those that are accurate to the day, etc…)

It’s when you start combining these sorts of things that stuff starts getting more complex. (This is also leaving aside the issue that the protocols to access all of this information is different (although these days more and more of this becomes an adventure in XML parsing and not DLL hell.) Let’s take a simple example, sorting.

Say I have three different datastores Repo1 – Repo3, and they both return objects with titles on them, and I’m sorting the titles:

Repo1: [Alpha, Bravo, Charlie, Delta, Echo, Foxtrot ]
Repo2: [Able, Baker, Charlie, Dog, Easy, Fox]
Repo3: [Alligator, Crocodile, Pterosaur]

It’s fairly easy to implement this sort, there are a few small issues (like paging results versus the page sizes of the underlying repositories, but regardless alphabetic sorts are well-understood in most locales.

However, if you’re searching by something more like ‘relevance‘, you get back a number associated with each result (so the document that had the ‘Alpha’ before might have a score of ’0.91′). It’s simple to order numbers as well, but how do you tell that a number in one datastore corresponds to a number in another? For one thing, those numbers are calculated (mostly) with regards to the particular collection of documents on a given datastore and for another, one repository may just tend to return a higher number for documents that are theoretically as relevant (because there isn’t any agreement about what 0.91 means, it’s just what a function returns.)

So those two things are where it starts to get more complex and needing actual customization and specialization.

In conclusion, this is way too (f) long for the blog, but I was typing it up to explain things that I was talking about yesterday anyway. HTH. Feel free to comment, but if you’re the person I’m proximately writing this for, you should probably send email.

Tags: , , , , , , , , ,

Antelope-as-document is a famous article in information science/librarianship, which this song seems to be ‘about,’ more or less.



“Honking Antelope”

Why dont you go photograph
Everything that ever passed in time,
Indigenous traces, tribal chiefs,
Vanishing hereditary lines,
Poets gone wild on the muse
Prophets all destroying the Tao,

When you see that honking antelope,
The secret dance of snakes, the tales of it all,

the rest of the lyrics

Tags: , , , , , , , , ,

The monthly get together for the organization that I’m the secretary/treasurer of the regional chapter of. If this sort of thing interests you, c’mon by.

Once a month, we get together to have drinks, chat, network, and geek out with fellow information architects, librarians, usability experts, user experience designers, and other like-minded user-centered professionals and students. It’s open to anyone, so bring a friend — especially those in other local organizations! The format will be casual, but all are encouraged to bring something to discuss — recent work, an interesting topic, or even your resume. This event is organized by the Pacific Northwest chapter of the American Society for Information Science & Technology.

What: Seattle Monthly Information Architecture Meetup

http://ia.meetup.com/57

Where: Elysian Pub, 1221 E. Pike St., Seattle, WA
When: 7-10pm, May 13th (2nd Tuesday of every month)

This and the next several are when the students will come in a big drove most likely, so if you’re looking to hire new grads in the information professions, it’s a good bet.

Tags: , , , , , ,

This last week was the InfoCamp 2008 kick off meeting. After a successful 2007 event, we’ve decided to do it again and to expand it further. This year we have Aaron, Kristen, Andy, Rachel and myself back again, and we’re also joined by Genevieve, a librarian from PLU, and Joshua, a student at the UW Information School.

Aaron and Kristen graciously cooked food for the lot of us, and we had a great initial planning meeting in which we identified roles and people responsible for roles, and then talked for a bit about the future of the ASIS&T PNW. I’m very much looking forward to doing another InfoCamp with this team, it should be a lot of fun.

We’re looking to have a much easier time this year since we have the experience of doing the conference last year, and are also starting much earlier in the year with our planning. It will continue to be an unconference serving (primarily) the PNW Information Science community.

Tags: , , , , , , , ,

InfoCamp 2007 was a great success, and I am really happy about the way that it went. There were roughly 50 sessions over 2 days, and roughly 85 participants. It was a great relief, and everyone seemed hopeful that there’d be another one next year and there wasn’t too much negative feedback, and that’s about as high praise as you can expect.

My session was called “Thesaurus, Ontology, and Inference” and was about the benefits you get from having a minimal amount of semantic data associated with documents — mostly exposing metadata that’s already present in the system and some things you can do with it. I think my presentation confused a bunch of people because I had to cut so much information out (my session was compressed from 60 minutes to 10 for various reasons), but we’ll see how it plays in the long run. I think there’s a lot of difference in the assumptions that I make based on previous work experience and what the IAs/IxDs in the audience have from their work experience. At any rate, got some decent feedback for refining the presentation.

I made it to about half the sessions I wanted, keeping folks moving in the right direction was a time-consuming project. Low time between interrupts, but a lot of good fun. It was especially fun talking to an audience and getting people to introduce their sessions, and I also had a lot of valuable hallway conversations with people. One particular thing I found was a good venue for publishing professional work other than the one I have already, and the fact that there’s a difference in focus between the two helps, so what isn’t wanted by one may be by another.

All in all, I really enjoyed being one of five people running this conference, it was a great time with a great group of people… I look forward to this community growing.

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian1 classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)2, then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization3. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (‘part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts

This is a social night for the local chapter of the American Society for Information Science and Technology. You should come by and check it out if that’s the sort of thing you find interesting. I’m chapter secretary for this year, and we’re doing all sorts of neat stuff that we’ll tell you about that the meeting or will be updated as relevant on our site.



Join us for some good company and geeky conversation next Thursday (5/10) at the Elysian Pub in Capitol Hill!

What: Seattle Monthly Meet-up, organized by the Pacific Northwest chapter of the American Society for Information Science & Technology.

http://asistpnw.org

Where: Elysian Pub, 1221 E. Pike St., Seattle, WA
When: 7-10pm, 2nd Thursday of every month

Once a month, we’ll get together to have drinks, chat, network, and geek out with fellow information architects, librarians, usability experts, user experience designers, and other like-minded people. It’s open to anyone, so bring a friend — especially those in other local organizations! The format will be casual, but all are encouraged to bring something to discuss — recent work, an interesting topic, or even your resume.

See you there!
-Aaron

The first meeting (that I attended) concerning this fall’s ASIST PNW chapter meeting was held yesterday. Aaron Louie (chapter president) and I met with the UW iSchool student chapter president and vice-president. A year ago, I was just turning over the control of the student chapter to the next year’s president, and these people are one down the line from that (the officers are typically 2nd year students in one of the Masters programs at the iSchool.) Now, the students seem pretty young to me, but in some ways they did at the time as well.

So, I’m not going to go into details yet — that would be jumping the gun, but I think that we’ve got something very exciting lined up. We’d talked previously about how we can best revitalize the chapter, which had been faltering somewhat in recent years. I think we’re doing a pretty good job so far — the ‘Information People Get-Together’ that we’ve had the last two months at the Elysian has been going well, and the way we’re planning to run this conference (unconference style, with ask later sessions similar to Ignite Seattle’s) will be fun, exciting, and informative for folks. We’re on target for our goals, consistent in message, and serving our identified audience.

About that last sentence. (One of) The real benefit(s) from being at the iSchool for me was getting more in line with the User Centered view of the universe. Before the school, I had been largely feature/product/use case oriented (largely as a result of many years of dev background with light project management), and I think the iSchool helped better my sense of the overall context — both social and technical — in which systems exist and are created.

The last several months have been integrative of all the different things I’ve learned in different periods of my life. Someone remarked to me the other day (@ the ASIST Info Social Hour) that I sound like a consultant, but it sounds to me like I’ve integrated all the different things I know.

Enterprise Content Management (ECM) Team Blog : Taxonomy/Tagging Starter Kit for SharePoint Server, also at the Sharepoint blog

Microsoft has made a kit available for Sharepoint that makes it easier to have taxonomy and tagging.  The tagging allows authors to tag items and to also have controlled vocabularies on particular multi-valued properties.  Users can incorporate the controlled vocabularies into searches and also search by tags. 

In the default configuration, users cannot tag items on the fly (although I suspect that they could change taxonomy values if they have permissions.)

I used to work (engineering) at an ECM company, so using the phrase ‘controlled vocabulary’ in place of taxonomy for this is somewhat second nature.  Since I took a lot of classification classes at the Information School, it’s interesting to see how companies implement these concepts.  It could be interesting if these features became widely available in Sharepoint.

Technorati Tags: , , , , ,

« Older entries