search engines

You are currently browsing the archive for the search engines category.

the session today has a lot of stuff based on the concept of “the distance between two arbitrary faces of a hyperdimensional cube,” where they actually mean more like hyperdimensional rectangles from what i can tell. There’s apparently some benefit to this, but I like to think of it as ‘rocking out on the hypercubes.’

because the hypercubes, they rock out. Aside from that, it’s straight up XML Element Retrieval. Feel free to observe the exhibits carefully, don’t touch the points of the <, they’re quite sharp. Please take extreme care not to become entangled in the forest of literal references, the & are quite difficult to detach from one’s clothing once they become caught.

The tendency to use Σ and Π as iteration operators has made for crazy space madness algorithm writing on the slides over the last coupe of days. I’m going to be going back and looking at a lot of the math that i’ve forgotten over the last couple of years. Intense. Learned a lot. It’s about over now, a couple hours are left.

I’ve learned a lot at this, and gotten a couple new ideas about things I should be working on learning. Fun. Now, on to the Semantic Grail meeting tonight.

Technorati Tags: , , ,

Adapting Ranking SVM to Document Retrieval
Yunbo Cao, Microsoft Research Asia
Jun Xu, Nankai University
Tie-Yan Liu, Hang Li, Microsoft Research Asia
Yalou Huang, Nankai University
Hsiao-Wuen Hon, Microsoft Research Asia

traditional: tf, idf, document length
currently: page rank, structural features of document, others
future: ??? Relevance SVM ???

Check it out, they gave a similar talk on Relevance SVM at the WWW conference. The talk itself was hard for me to understand due to speaker ESL issues, so your guess is as good as mine.

Technorati Tags: , , , ,

Document modeling is important to any IR approach — the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language. There have been a bunch of approaches to this sort of thing in the past, but here’s a relatively new one that does well versus various TREC collections.

Here’s a link to the paper: LDA-based Document Models for Ad-hoc Retrieval.

The presentation was largely a crawl of the paper section by section, and I’m going to emulate that approach by just referring to the paper so you can have that experience.

However: it beats previous models because it maps { document vs. topic } for all topics and documents, as opposed to the cluster approaches, for example, which largely assume that all documents belong to one cluster, or for many practical approaches, belong to whatever cluster it matches best. Because documents belong to n topics with probability p(d[i], n), this is better than searching against bag of words models.

All papers in this section are pretty oriented towards the whole ‘topic searching autogenerated’ is better than word-based. See the papers in question for the differentiators, as a lot of it is math that I’m not going to break out the LaTeX for on the fly. I will also note that most presentations in this area are pretty high on the UMLS fetishism.

Technorati Tags: , , ,

Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR
Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, Il-Yeol Song

It is possible to disambiguate homonyms in a probabilistic manner by using Topic Signatures that let you identify which of the topics that the questionable-homonym is actually retrieving. Using Topic Signatures is also more effective for finding documents than the ‘bag of words’ model.

so “‘terms’ -> ‘Topics’ -> ‘find documents for topic’” is more effective for both precision and relevance than “‘terms’ -> ‘find documents for terms’” Doing this topic model is called ’smoothing’ or ’semantic smoothing.’

My reflection on this is that it’s a lot like using an automatically built controlled vocabulary for and mapping both documents and terms to this algorithmically. Strangely, this presentation reminds me a lot of a math-intensive version of Jens-Erik’s classes on Indexing, but, I suspect, only if you’ve already heard JEM talking and have that context.

It works better than WordNet (according to a person who asked a question), because it uses math to eliminate ambiguity of meaning.

Anyway: I plan to read the rest of their stuff. It looks interesting. It would be interesting to see what sort of ontologies (InfoSci sense) can work it with. However, nothing to do with Genomic IR I can see other than that’s probably the non-described domain they’re using.

Technorati Tags: , , , ,

I will be at SIGIR 2006 this Monday through Wednesday on the University of Washington Campus. Although I’m going to be doing a lot of networking and attending conference sessions, this would be a pleasant time for lunch and hanging out if you happen to be around the campus/UD. SGIR is the ACM special interest group on information retrieval, and is interesting and fun if you find that sort of thing interesting and fun. At any rate, I hope to learn a lot. I’ll be around as a conference volunteer on Thursday as well — they were low on volunteers and I wasn’t doing anything in particular that day.

A lot of writing in general, and a lot of web pages specifically, concerns some topic. That writing is about a topic. There are a bunch of different ways of figuring out what something is about, and many of these are hilariously wrong. But that isn’t what this post is about, this post is about ‘aboutness assertions,’ which is how you say what things are about once you’ve decided that something is about something.

sounds confusing? it gets worse but more interesting…
Read the rest of this entry »