classification

You are currently browsing the archive for the classification category.

InfoCamp 2007 was a great success, and I am really happy about the way that it went. There were roughly 50 sessions over 2 days, and roughly 85 participants. It was a great relief, and everyone seemed hopeful that there’d be another one next year and there wasn’t too much negative feedback, and that’s about as high praise as you can expect.

My session was called “Thesaurus, Ontology, and Inference” and was about the benefits you get from having a minimal amount of semantic data associated with documents — mostly exposing metadata that’s already present in the system and some things you can do with it. I think my presentation confused a bunch of people because I had to cut so much information out (my session was compressed from 60 minutes to 10 for various reasons), but we’ll see how it plays in the long run. I think there’s a lot of difference in the assumptions that I make based on previous work experience and what the IAs/IxDs in the audience have from their work experience. At any rate, got some decent feedback for refining the presentation.

I made it to about half the sessions I wanted, keeping folks moving in the right direction was a time-consuming project. Low time between interrupts, but a lot of good fun. It was especially fun talking to an audience and getting people to introduce their sessions, and I also had a lot of valuable hallway conversations with people. One particular thing I found was a good venue for publishing professional work other than the one I have already, and the fact that there’s a difference in focus between the two helps, so what isn’t wanted by one may be by another.

All in all, I really enjoyed being one of five people running this conference, it was a great time with a great group of people… I look forward to this community growing.

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian1 classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)2, then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization3. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (‘part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts

Enterprise Content Management (ECM) Team Blog : Taxonomy/Tagging Starter Kit for SharePoint Server, also at the Sharepoint blog

Microsoft has made a kit available for Sharepoint that makes it easier to have taxonomy and tagging.  The tagging allows authors to tag items and to also have controlled vocabularies on particular multi-valued properties.  Users can incorporate the controlled vocabularies into searches and also search by tags. 

In the default configuration, users cannot tag items on the fly (although I suspect that they could change taxonomy values if they have permissions.)

I used to work (engineering) at an ECM company, so using the phrase ‘controlled vocabulary’ in place of taxonomy for this is somewhat second nature.  Since I took a lot of classification classes at the Information School, it’s interesting to see how companies implement these concepts.  It could be interesting if these features became widely available in Sharepoint.

Technorati Tags: , , , , ,

Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR
Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, Il-Yeol Song

It is possible to disambiguate homonyms in a probabilistic manner by using Topic Signatures that let you identify which of the topics that the questionable-homonym is actually retrieving. Using Topic Signatures is also more effective for finding documents than the ‘bag of words’ model.

so “‘terms’ -> ‘Topics’ -> ‘find documents for topic’” is more effective for both precision and relevance than “‘terms’ -> ‘find documents for terms’” Doing this topic model is called ‘smoothing’ or ‘semantic smoothing.’

My reflection on this is that it’s a lot like using an automatically built controlled vocabulary for and mapping both documents and terms to this algorithmically. Strangely, this presentation reminds me a lot of a math-intensive version of Jens-Erik’s classes on Indexing, but, I suspect, only if you’ve already heard JEM talking and have that context.

It works better than WordNet (according to a person who asked a question), because it uses math to eliminate ambiguity of meaning.

Anyway: I plan to read the rest of their stuff. It looks interesting. It would be interesting to see what sort of ontologies (InfoSci sense) can work it with. However, nothing to do with Genomic IR I can see other than that’s probably the non-described domain they’re using.

Technorati Tags: , , , ,

Indexing schemes are largely similar in their outputs – the right answer to many questions, such as ‘controlled vs. non-controlled,’ how many terms to use, and how expressive or deep the vocabulary should be are all matters of indexing policy considered on a per-collection basis and more or less well-defined in the specification (ISO, 1985). The biggest difference in indexing methods is whether they have an objective or subjective view of the world. Much of the difference between methods after that comes down to how objectively or subjectively they view the world, and what the right approach to dealing with that is.

The most subjective approach to indexing is ‘social indexing,’ which makes no assumptions about underlying reality beneath the words it uses for indexing, and in fact many of the original proponents of it as a method felt that this was a benefit, because each user may mean something different by tags (Guy & Tonkin, January 2006). Folksonomy proponents are backing down from this extreme approach over time, but denying a possible shared reality is as far towards subjective as possible.
A large distinction is between document-oriented and request-oriented indexing, with document-oriented indexing using the properties of the document to describe the document and request-oriented indexing considering what the likely requests of the user community are likely to be (Fidel, 1994). Another formulation of request-oriented indexing is tying indexing to supporting the information-seeking behavior of the users; indexing in this sense is about making it possible for users to find documents that satisfy their information needs (Hjørland, 1997).

Document-oriented indexing and request-oriented indexing are present in all collections to various extent. In general, the more general the collection (the broader the scope of the collection and of the population the collection serves), the more oriented on the properties of the document itself the indexer has to be. This is because the broader the scope of the collection, the less adapted the indexing can be to particular topics. Both the depth of indexing and specificity of meaning of terms have to be limited in a more general collection. (Hjørland, 1997) For a collection more limited in scope, more attention can be paid to supporting the sorts of requests that will be coming in from a user community. (Fidel, 1994; Swift, Winn, & Bramer, 1979) Document-oriented indexing is more objective than request-oriented indexing, because it presupposes (or at least pretends) that there is a single reality that the documents are being described in relation to. Request-oriented indexing is more subjective, because it admits the concept of communities of practice (or discourse communities) with their own conceptions of the world (Hjørland, 1997; Mai, 2001).

The process of indexing is in fact more subjective than this, because the subjectivity of the indexer must be taking into account – there is a series of interpretations that the indexer makes in moving from text to a representation in the subject index(es) (Mai, 2001). The indexers’ interpretation can be informed by the domain, which is called domain-centered indexing. When analyzing a document, domain-centered indexing starts with an analysis of the domain and then moves on to the users’ needs (and indexer interpretation) and then role of the document with regards to these (Mai, 2005).

Indexing approaches vary in a number of ways, but one of the most important is how subjective their view of the world is, and how granular that subjectivity is. Document-centered indexing is least granular (most objective), with a single representation claiming to serve all needs; after that is domain indexing, which considers documents in light of their role in a domain (although other information is taken into effect.) User (or request)-oriented indexing considers the information needs of the users as primary, assuming that the users of the catalog have specific purposes in mind for the catalog. Democratic indexing goes further in assuming the fragmentation of reality, assuming that there isn’t consensus in meaning among users, except possibly in aggregate.

For indexing practice, which of these methods should be considered is tied to the collection. The more focused a collection on particular users or domains, the more tied to requests or domain analysis the collection can be. The more general the collection, the more likely it is to need an assertion that there is a single objective view of reality (document-centered) or that there isn’t a meaningful consensus (social-indexing) – and the decision between document-centered and social-indexing may be more usefully made on other means, like whether the collection is developed (like a library or a set of resources chosen for a purpose) or arbitrary (like pages taken off the web.)

bibliography

  • Fidel, R. (1994). User-Centered Indexing. Journal of the American Society for Information Science, 45 (8), 572-576.
  • Guy, M., & Tonkin, E. (January 2006). Folksonomies: Tidying up Tags? D-Lib Magazine, 12(1).
  • Hjørland, B. (1997). Chapter 2: Subject Searching and Subject Representation Data. In Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science (pp. 11-37). Westport, CT: Greenwood.
  • ISO. (1985). 5963: Documentation — Methods for Examining Documents, Determining their Subjects and Selecting Indexing Terms. (No. ISO 5963-1985): International Organization for Standardization.
  • Mai, J.-E. (2001). Semiotics and Indexing: An Analysis of the Subject Indexing Process. Journal of Documentation, 57(5), 591-622.
  • Mai, J.-E. (2005). Analysis in Indexing: Document and Domain Centered Approaches. Information Processing and Management, 41(3), 599-611.
  • Swift, D. F., Winn, V. A., & Bramer, D. A. (1979). A Sociological Approach to the Design of Information Systems. Journal of the American Society for Information Science, 30, 215-223.

Technorati Tags: , , , ,

ISO5963 (ISO, 1985) states indexing ‘extracts concepts from documents by a process of intellectual analysis.’ Any sort of intellectual analysis involves context – at least the context the indexer personally operates in. Hjørland (Hjørland, 1992) distinguishes between ‘content-oriented indexing’ and ‘request-oriented indexing.’ This division also describes what additional contexts are important to the indexing process.

‘Request-oriented indexing’ means indexing documents in a collection in relation to requests that will be made against that collection – using the terminology of whatever field is the object of study. Request-oriented indexing has a high return on investment when the purpose to which the collection to be put and the user population is well-known – both the context of the collection and the user population can be used to develop indexing, both for particular documents and the overall collection policy. Using large amounts of context in this manner for subject indexing would seem to be the clear winner, but it fails to work generally.

For request-oriented indexing to work well, there are several suppositions. First, is that the purpose of the collection remains stable over time. This may fail for several reasons, such as the gradual evolution of a field or members of a different field using the collection (ISO, 1985). Second, that the terminology and object of study of a field remains constant over the lifespan of a collection; but, most fields change over time and it is generally hard to know when creating something how long something will last. Third, the supposition that there is a single shared useful context shared among all the users of a document that will make retrieval in the index system possible. Although it would be possible to index documents within many possible contexts, it becomes an exponential problem as the number of contexts and documents grows (Hjørland, 1992).

Subject indexing is a representation of a document; the purpose for that representation is finding and using that document at some future time – ‘search for items with potential’ for various purposes (problem solving, meeting information needs, gaining general subject understanding) in Hjørland’s formulation (Hjørland, 1997). If the document representations aren’t effective, the searching based on the representation will be of low quality (Mai, 2000). Considering searching as the goal, subject indexing can be analyzed in terms of searching’s metrics, recall and precision. When the indexer is able to focus specifically on analyzing the user’s domain, the user can have high precision and recall. However, as the indexer’s analysis of the documents in a collection moves out of alignment with the user’s context, either precision or recall will drop. Recall drops when the indexer chooses terms that are different than terms the user chooses for searching on a particular subject. Precision drops when the user and indexer categorize the subjects differently.

What determines the right amount of context to use? With a more general collection, the domain will be less specific and the collection less purposively unified and the indexer must make less use of context, leaning more towards a document-centered approach to serve the searches of a broader audience. More specific collections, however, can use a more domain-oriented approach to indexing more aligned to the context that the documents exist in to better serve their specific users.

Bibliography

Hjørland, B. (1992). The Concept of “Subject” in Information Science. Journal of Documentation, 48(2), 172-200.

Hjørland, B. (1997). Chapter 2: Subject Searching and Subject Representation Data. In Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science (pp. 11-37). Westport, CT: Greenwood.

ISO. (1985). Documentation — Methods for Examining Documents, Determining their Subjects and Selecting Indexing Terms. (No. ISO 5963-1985): International Organization for Standardization.

Mai, J.-E. (2000). Deconstructing the Indexing Process. Advances in Librarianship, 23, 269-298.

Malcom Gladwell writes in the New Yorker about how generalizations should be used to make categories. It’s important to be able to make accurate generalizations when one is buiding categories, he claims, because for real-world things like detecting terrorists or identifying dangerous dogs, there are real world consequences.

I found this interesting, reminded me a lot of the book Women, Fire, and Dangerous Things by George Lakoff. Women, Fire, and Dangerous Things Lakoff writes about prototype categories, where category membership is a fuzzy concept. Explicit rules for category membership, like ‘terrorists buy one way tickets,’ ‘pit bulls are dangerous dogs,’ and other concrete rules like that are less useful in establishing membership in categories than general rules like ‘is there something suspicious about this person?’ or ‘does this dog (or the dog’s owner, apparently more useful) have a history of violent behavior?’

The key, apparently, is that some genearalizations are stable and some are unstable. Unstable generalizations are things like ‘drug smugglers buy one way tickets’ and ‘pit bulls are dangerous dogs.’ Drug smugglers can change their behavior, and it used to be other dogs that were the dangerous ones. (It turns out, according to Gladwell, that the most dangerous kind of dog is the kind that people buy to seem dangerous themselves, and this has varied from era to era.) So, the key to building a category is to figure out what are the stable generalizations and which are the unstable ones. Gladwell gives examples ranging from terrorists, to NYC subway searches, to dogs.
Anyway, I thought it made for interesting reading. I recommend it.

The Intellectual Foundation of Information Organization (Digital Libraries and Electronic Publishing) The Intellectual Foundation of Information Organization is somewhat heavy going, but it’s the definitive work about a lot of areas in information organization. A lot of people encounter information organization issues professionally in the technical fields, but a lot of these issues have been around for ages, appearing in business and libraries for a long period of time.

This book has a huge amount of information in a very small amount of space, and can be somewhat heavy going. It has something to say about almost every issue having to do with organizing information. I also highly advise reading it for anyone in a MLIS/MSIM program, I went and looked this book up today for someone whose class wasn’t reading it for some reason and I advise it highly. It requires an Information Architect or other web designer to be able to think in basic principles about the stuff that they’re doing to be able to use this book — looking for information about controlled vocabularies instead of what the latest buzzword is. However, the payoff from having done so is high due to the clarity of the information presented.

The writing in this book is in the ‘little red schoolhouse‘ academic style from the University of Chicago. I found it very easily digestable and understandable, and it had a profound affect on how I thought about information organization; I credit doing very well in my classes on the subject and being able to speak intelligibly on the subject outside of class to having started out by reading this book. I recommend it highly.

So, just as a helpful hint for people doing Taxonomy type stuff, or whatever controlled vocabulary: The difference between pre-coordinate and post-coordinate terms are pretty obvious when you can actually remember them, but here’s a helpful hint:

Precoordinate: concepts are combined into terms before the thesaurus is created.
Postcoordinate: concepts are combined into terms after the thesaurus is created — ie: usually at time of use.

I know that someday someone who needs this information will find it searching the web, and that makes me (relatively) happy.

A lot of writing in general, and a lot of web pages specifically, concerns some topic. That writing is about a topic. There are a bunch of different ways of figuring out what something is about, and many of these are hilariously wrong. But that isn’t what this post is about, this post is about ‘aboutness assertions,’ which is how you say what things are about once you’ve decided that something is about something.

sounds confusing? it gets worse but more interesting…
Read the rest of this entry »

« Older entries