<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>corprewland &#187; search engines</title>
	<atom:link href="http://www.corprew.org/categories/infosci/search-engines/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.corprew.org</link>
	<description>(dis)information organization</description>
	<lastBuildDate>Fri, 19 Feb 2010 23:38:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>SIGIR06: rocking the hypercube</title>
		<link>http://www.corprew.org/blog/2006/08/10/sigir-4-rocking-the-hypercube/</link>
		<comments>http://www.corprew.org/blog/2006/08/10/sigir-4-rocking-the-hypercube/#comments</comments>
		<pubDate>Thu, 10 Aug 2006 22:24:39 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[infosci]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2006/08/10/sigir-4-rocking-the-hypercube/</guid>
		<description><![CDATA[the session today has a lot of stuff based on the concept of &#8220;the distance between two arbitrary faces of a hyperdimensional cube,&#8221; where they actually mean more like hyperdimensional rectangles from what i can tell. There&#8217;s apparently some benefit to this, but I like to think of it as &#8216;rocking out on the hypercubes.&#8217; [...]]]></description>
			<content:encoded><![CDATA[<p>the session today has a lot of stuff based on the concept of &#8220;<strong>the distance between two arbitrary faces of a hyperdimensional cube</strong>,&#8221; where they actually mean more like hyperdimensional rectangles from what i can tell.  There&#8217;s apparently some benefit to this, but I like to think of it as &#8216;rocking out on the hypercubes.&#8217;</p>
<p>
because the hypercubes, they rock out. Aside from that, <strong>it&#8217;s straight up XML Element Retrieval</strong>.  Feel free to observe the exhibits carefully, don&#8217;t touch the points of the &lt;, they&#8217;re quite sharp.  Please take extreme care not to become entangled in the forest of literal references, the &amp; are quite difficult to detach from one&#8217;s clothing once they become caught.
</p>
<p>
<strong>The tendency to use &Sigma; and &Pi; as iteration operators has made for crazy space madness algorithm writing on the slides over the last coupe of days</strong>.  I&#8217;m going to be going back and looking at a lot of the math that i&#8217;ve forgotten over the last couple of years.  Intense.  Learned a lot.  It&#8217;s about over now, a couple hours are left.
</p>
<p>
I&#8217;ve learned a lot at this, and gotten a couple new ideas about things I should be working on learning.  Fun.  Now, on to the Semantic Grail meeting tonight.
</p>
<p>Technorati Tags: <a href="http://www.technorati.com/tag/informationretrieval" rel="tag">informationretrieval</a>, <a href="http://www.technorati.com/tag/informationscience" rel="tag">informationscience</a>, <a href="http://www.technorati.com/tag/sigir" rel="tag">sigir</a>, <a href="http://www.technorati.com/tag/sigir2006" rel="tag">sigir2006</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2006/08/10/sigir-4-rocking-the-hypercube/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR06:  Formal Models:  Adapting Ranking SVM to Document Retrieval</title>
		<link>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-adapting-ranking-svm-to-document-retrieval/</link>
		<comments>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-adapting-ranking-svm-to-document-retrieval/#comments</comments>
		<pubDate>Mon, 07 Aug 2006 23:55:41 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[infosci]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2006/08/07/sigir06-formal-models-adapting-ranking-svm-to-document-retrieval/</guid>
		<description><![CDATA[Adapting Ranking SVM to Document Retrieval Yunbo Cao, Microsoft Research Asia Jun Xu, Nankai University Tie-Yan Liu, Hang Li, Microsoft Research Asia Yalou Huang, Nankai University Hsiao-Wuen Hon, Microsoft Research Asia traditional: tf, idf, document length currently: page rank, structural features of document, others future: ??? Relevance SVM ??? Check it out, they gave a [...]]]></description>
			<content:encoded><![CDATA[<p>Adapting Ranking SVM to Document Retrieval<br />
Yunbo Cao, Microsoft Research Asia<br />
Jun Xu, Nankai University<br />
Tie-Yan Liu, Hang Li, Microsoft Research Asia<br />
Yalou Huang, Nankai University<br />
Hsiao-Wuen Hon, Microsoft Research Asia</p>
<p>traditional: tf, idf, document length<br />
currently:  page rank, structural features of document, others<br />
future: ??? Relevance SVM ???</p>
<p>Check it out, <a href="http://portal.acm.org/citation.cfm?id=1062761&amp;dl=GUIDE&amp;coll=GUIDE">they gave a similar talk on Relevance SVM</a> at the WWW conference.  The talk itself was hard for me to understand due to speaker ESL issues, so your guess is as good as mine.</p>
<p>Technorati Tags: <a href="http://www.technorati.com/tag/relevance" rel="tag">relevance</a>, <a href="http://www.technorati.com/tag/relevanceranking" rel="tag">relevanceranking</a>, <a href="http://www.technorati.com/tag/semantics" rel="tag">semantics</a>, <a href="http://www.technorati.com/tag/sigir" rel="tag">sigir</a>, <a href="http://www.technorati.com/tag/sigir2006" rel="tag">sigir2006</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-adapting-ranking-svm-to-document-retrieval/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR06: formal models: LDA-based Document Models for Ad-hoc Retrieval</title>
		<link>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-lda-based-document-models-for-ad-hoc-retrieval/</link>
		<comments>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-lda-based-document-models-for-ad-hoc-retrieval/#comments</comments>
		<pubDate>Mon, 07 Aug 2006 23:19:51 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[computer programming]]></category>
		<category><![CDATA[infosci]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2006/08/07/sigir06-formal-models-lda-based-document-models-for-ad-hoc-retrieval/</guid>
		<description><![CDATA[Document modeling is important to any IR approach &#8212; the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language. There have been a bunch of approaches to this sort of thing in the past, but here&#8217;s a relatively new one that does well versus various TREC collections. Here&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Document modeling is important to any IR approach &#8212; the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language.  There have been a bunch of approaches to this sort of thing in the past, but here&#8217;s a relatively new one that does well versus various TREC collections.</p>
<p>Here&#8217;s a link to the paper: <a href="http://ciir.cs.umass.edu/pubfiles/ir-464.pdf">LDA-based Document Models for Ad-hoc Retrieval</a>.</p>
<p>The presentation was largely a crawl of the paper section by section, and I&#8217;m going to emulate that approach by just referring to the paper so you can have that experience.</p>
<p>However:  it beats previous models because it maps { document vs. topic } for all topics and documents, as opposed to the cluster approaches, for example, which largely assume that all documents belong to one cluster, or for many practical approaches, belong to whatever cluster it matches best.  Because documents belong to n topics with probability p(d[i], n), this is better than searching against bag of words models.</p>
<p>All papers in this section are pretty oriented towards the whole &#8216;topic searching autogenerated&#8217; is better than word-based.  See the papers in question for the differentiators, as a lot of it is math that I&#8217;m not going to break out the LaTeX for on the fly.  I<a href="http://www.nlm.nih.gov/pubs/factsheets/umls.html"> will also note that most presentations in this area are pretty high on the UMLS fetishism.</a></p>
<p>Technorati Tags: <a href="http://www.technorati.com/tag/DocumentModels" rel="tag">DocumentModels</a>, <a href="http://www.technorati.com/tag/puppy" rel="tag">puppy</a>, <a href="http://www.technorati.com/tag/sigir" rel="tag">sigir</a>, <a href="http://www.technorati.com/tag/sigir2006" rel="tag">sigir2006</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-lda-based-document-models-for-ad-hoc-retrieval/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR06: formal models: Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR</title>
		<link>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-context-sensitive-semantic-smoothing-for-the-language-modeling-approach-to-genomic-ir/</link>
		<comments>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-context-sensitive-semantic-smoothing-for-the-language-modeling-approach-to-genomic-ir/#comments</comments>
		<pubDate>Mon, 07 Aug 2006 23:02:48 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[classification]]></category>
		<category><![CDATA[infosci]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2006/08/07/sigir06-formal-models-context-sensitive-semantic-smoothing-for-the-language-modeling-approach-to-genomic-ir/</guid>
		<description><![CDATA[Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, Il-Yeol Song It is possible to disambiguate homonyms in a probabilistic manner by using Topic Signatures that let you identify which of the topics that the questionable-homonym is actually retrieving. Using Topic Signatures is also more [...]]]></description>
			<content:encoded><![CDATA[<p><em>Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR </em><em><br />
</em>Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, Il-Yeol Song </p>
<p>It is possible to disambiguate homonyms in a probabilistic manner by using Topic Signatures that let you identify which of the topics that the questionable-homonym is actually retrieving.  Using Topic Signatures is also more effective for finding documents than the &#8216;bag of words&#8217; model.</p>
<p>so &#8220;&#8216;terms&#8217; -&gt; &#8216;Topics&#8217; -&gt; &#8216;find documents for topic&#8217;&#8221; is more effective for both precision and relevance than &#8220;&#8216;terms&#8217; -&gt; &#8216;find documents for terms&#8217;&#8221;  Doing this topic model is called &#8216;smoothing&#8217; or &#8216;semantic smoothing.&#8217;</p>
<p>My reflection on this is that it&#8217;s a lot like using an automatically built controlled vocabulary for and mapping both documents and terms to this algorithmically.  Strangely, this presentation reminds me a lot of a math-intensive version of Jens-Erik&#8217;s classes on Indexing, but, I suspect, only if you&#8217;ve already heard JEM talking and have that context.</p>
<p>It works better than <a href="http://en.wikipedia.org/wiki/WordNet">WordNet</a> (according to a person who asked a question), because it uses math to eliminate ambiguity of meaning.</p>
<p>Anyway:  I plan to read the rest of their stuff.  It looks interesting.  It would be interesting to see what sort of ontologies (InfoSci sense) can work it with.  However, nothing to do with Genomic IR I can see other than that&#8217;s probably the non-described domain they&#8217;re using.</p>
<p>Technorati Tags: <a href="http://www.technorati.com/tag/folksonomy" rel="tag">folksonomy</a>, <a href="http://www.technorati.com/tag/semantics" rel="tag">semantics</a>, <a href="http://www.technorati.com/tag/sigir" rel="tag">sigir</a>, <a href="http://www.technorati.com/tag/sigir2006" rel="tag">sigir2006</a>, <a href="http://www.technorati.com/tag/subjective" rel="tag">subjective</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2006/08/07/sigir06-formal-models-context-sensitive-semantic-smoothing-for-the-language-modeling-approach-to-genomic-ir/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGIR2006 @ UW Campus</title>
		<link>http://www.corprew.org/blog/2006/08/07/sigir2006-uw-campus/</link>
		<comments>http://www.corprew.org/blog/2006/08/07/sigir2006-uw-campus/#comments</comments>
		<pubDate>Mon, 07 Aug 2006 16:34:23 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[infosci]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2006/08/07/sigir2006-uw-campus/</guid>
		<description><![CDATA[I will be at SIGIR 2006 this Monday through Wednesday on the University of Washington Campus. Although I&#8217;m going to be doing a lot of networking and attending conference sessions, this would be a pleasant time for lunch and hanging out if you happen to be around the campus/UD. SGIR is the ACM special interest [...]]]></description>
			<content:encoded><![CDATA[<div style="float: right; margin-left: 10px; margin-bottom: 10px"><img src="http://www.sigir2006.org/images/sigir.png" /></div>
<p>I will be at <a href="http://www.sigir2006.org">SIGIR 2006</a> this Monday through Wednesday on the University of Washington Campus.  Although I&#8217;m going to be doing a lot of networking and attending conference sessions, this would be a pleasant time for lunch and hanging out if you happen to be around the campus/UD.  SGIR is the ACM special interest group on information retrieval, and is interesting and fun if you find that sort of thing interesting and fun.  At any rate, I hope to learn a lot.  I&#8217;ll be around as a conference volunteer on Thursday as well &#8212; they were low on volunteers and I wasn&#8217;t doing anything in particular that day. <br clear="all" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2006/08/07/sigir2006-uw-campus/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>aboutness</title>
		<link>http://www.corprew.org/blog/2005/11/20/aboutness/</link>
		<comments>http://www.corprew.org/blog/2005/11/20/aboutness/#comments</comments>
		<pubDate>Sun, 20 Nov 2005 20:34:10 +0000</pubDate>
		<dc:creator>corprew</dc:creator>
				<category><![CDATA[classification]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.corprew.org/2005/11/20/aboutness/</guid>
		<description><![CDATA[A lot of writing in general, and a lot of web pages specifically, concerns some topic. That writing is about a topic. There are a bunch of different ways of figuring out what something is about, and many of these are hilariously wrong. But that isn&#8217;t what this post is about, this post is about [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of writing in general, and a lot of web pages specifically, concerns some topic.  That writing is about a topic.  There are a bunch of different ways of figuring out what something is about, and many of these are hilariously wrong.  But that isn&#8217;t what this post is about, this post is about &#8216;aboutness assertions,&#8217; which is how you say what things are about once you&#8217;ve decided that something is about something.</p>
<p>sounds confusing?  it gets worse but more interesting&#8230;<br />
<span id="more-34"></span><br />
[note that this post will get re-edited from time to time based on other posts in this category.]</p>
<p>A topic is a concept.  It can be a simple concept like &#8216;whales&#8217; or &#8216;Wales&#8217; that physically exists in space-time, or it can be a complex thing like &#8216;disintermediation&#8217; or &#8216;velocity of light&#8217; that describe properties of matter or ways that people understand the universe.  There are a set of people who are skilled in figuring how to do these assertions, and they are generally called things like Information Architects or Taxonomists.  People also have an intuitive understanding of what things are about, and this is the basis of folksonomy.</p>
<p>An aboutness assertion is a link between some resource and one of these topics.  A &#8216;resource&#8217; can be virtually anything that can be identified, but for the sake of argument lets say that it is a URL.<sup>1</sup>  The assertion is then the link between the topic and the resource.</p>
<p>There are a bunch of different things that you can derive aboutness assertions from, but the three ones that I&#8217;m going to talk about for the next couple of posts in this blog are content, taxonomy, and folksonomy.  There&#8217;s also a fourth type of interesting aboutness assertions, which is remining existing aboutness assertions.</p>
<p><strong>Content</strong>  A grossly simplified view of what a search engine does is that it takes words in, and returns out links to documents that it asserts are about those words.  Now, the kind of assertion it&#8217;s making &#8212; that the document contains those words &#8212; is very weak.  The user had something in mind when they made that query, though &#8212; for the most part people aren&#8217;t just looking for documents that contain random words, they have a query formulated in their mind that they then turn into words when they enter it into an engine.  This is as close to aboutness as you can get from the simplest search engines (things like google and msn search are beyond this.)</p>
<p><strong>Taxonomy</strong>  A Taxonomy (or Controlled Vocabulary generally), is a organized system of topics that people can use to make assertions about resources.  This is &#8216;taxonomy&#8217; as used as a term in information architecture, not &#8216;Taxonomy&#8217; in the purer Information Science sense.  These taxonomies can be general (like the Library of Congress Subject Headings or the Dewey Decimal System, which are both used in libraries), large and specific to a focus area (like the NIH&#8217;s MeSH headers for describing health issues), or small and specific to a single company or organization&#8217;s concerns or line of business.  These taxonomies are generally maintained by skilled professionals called taxonomists, and what they do is analyze the concept areas that the vocabulary needs to cover, and then build a system.  When people work on these systems, they use particular terms to index a document.</p>
<p><strong>folksonomy</strong>  A folksonomy, in contrast, lets people make aboutness assertions using whatever terms they want.  For that reason, it&#8217;s frequently called &#8216;democratic classification,&#8217; or &#8216;open source classification.&#8217;  It actually has no relation to open source as a concept whatsoever that I can tell, and the usefulness of folksonomy is for two main reasons.  First, because for an individual person, they&#8217;re usually able to figure out why they made assertions.  Second, because if you have enough people making assertions, the words they use to describe any particular topic form big &#8216;clouds&#8217; of tags, that people can navigate.</p>
<p><ttag>folksonomy</ttag><br />
<ttag>classification</ttag><br />
<ttag>aboutness</ttag></p>
<p><sup>1</sup>  Strictly speaking, it&#8217;s a URI and not a URL, but for these purposes the distinction between the two doesn&#8217;t matter.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.corprew.org/blog/2005/11/20/aboutness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

