infosci

You are currently browsing the archive for the infosci category.

The monthly get together for the organization that I’m the secretary/treasurer of the regional chapter of. If this sort of thing interests you, c’mon by.

Once a month, we get together to have drinks, chat, network, and geek out with fellow information architects, librarians, usability experts, user experience designers, and other like-minded user-centered professionals and students. It’s open to anyone, so bring a friend — especially those in other local organizations! The format will be casual, but all are encouraged to bring something to discuss — recent work, an interesting topic, or even your resume. This event is organized by the Pacific Northwest chapter of the American Society for Information Science & Technology.

What: Seattle Monthly Information Architecture Meetup
http://ia.meetup.com/57
Where: Elysian Pub, 1221 E. Pike St., Seattle, WA
When: 7-10pm, May 13th (2nd Tuesday of every month)

This and the next several are when the students will come in a big drove most likely, so if you’re looking to hire new grads in the information professions, it’s a good bet.

Tags: , , , , , ,

This last week was the InfoCamp 2008 kick off meeting. After a successful 2007 event, we’ve decided to do it again and to expand it further. This year we have Aaron, Kristen, Andy, Rachel and myself back again, and we’re also joined by Genevieve, a librarian from PLU, and Joshua, a student at the UW Information School.

Aaron and Kristen graciously cooked food for the lot of us, and we had a great initial planning meeting in which we identified roles and people responsible for roles, and then talked for a bit about the future of the ASIS&T PNW. I’m very much looking forward to doing another InfoCamp with this team, it should be a lot of fun.

We’re looking to have a much easier time this year since we have the experience of doing the conference last year, and are also starting much earlier in the year with our planning. It will continue to be an unconference serving (primarily) the PNW Information Science community.

Tags: , , , , , , , ,

InfoCamp 2007 was a great success, and I am really happy about the way that it went. There were roughly 50 sessions over 2 days, and roughly 85 participants. It was a great relief, and everyone seemed hopeful that there’d be another one next year and there wasn’t too much negative feedback, and that’s about as high praise as you can expect.

My session was called “Thesaurus, Ontology, and Inference” and was about the benefits you get from having a minimal amount of semantic data associated with documents — mostly exposing metadata that’s already present in the system and some things you can do with it. I think my presentation confused a bunch of people because I had to cut so much information out (my session was compressed from 60 minutes to 10 for various reasons), but we’ll see how it plays in the long run. I think there’s a lot of difference in the assumptions that I make based on previous work experience and what the IAs/IxDs in the audience have from their work experience. At any rate, got some decent feedback for refining the presentation.

I made it to about half the sessions I wanted, keeping folks moving in the right direction was a time-consuming project. Low time between interrupts, but a lot of good fun. It was especially fun talking to an audience and getting people to introduce their sessions, and I also had a lot of valuable hallway conversations with people. One particular thing I found was a good venue for publishing professional work other than the one I have already, and the fact that there’s a difference in focus between the two helps, so what isn’t wanted by one may be by another.

All in all, I really enjoyed being one of five people running this conference, it was a great time with a great group of people… I look forward to this community growing.

For the last while, I’ve been working on a project that involves scanning large numbers of RSS/Atom feeds, and then using Bayesian1 classifiers to break it into one of a number of categories for summarization and display (the system that I’m using to do this is available as a sample website, but really needs more data in the training sets before it’s ready to entertain all of you.) The categories are pretty straightforward, and they fit into a somewhat neat controlled vocabulary (ontology/thesaurus/whatever.)

There’s a relation, though, between the different terms in this sort of classification and the training data used to build the Bayesian Classifier. If the terms are arranged in a hierarchy (and certain assumptions are made about that hierarchy, like subterms encompassing part of the range of meaning of their parent term and nothing else)2, then the training data used for classifying terms can be shared.

For example, all positive training data that belongs to the child terms can also be used for the parent. So, for (a constructed) example, positive training data for tamiflu also belongs in the positive data for bird flu vaccines. The reverse is true of negative training data. For negative data, the negative data for the parent can also be used for the child terms.

This is highly useful information when you’re making a large scale text classifier (and having it classify texts as belonging to categories or not, as opposed to just clustering texts into the categories that actually appear. It’s easier to use things like bayesian classifiers do to this if you’re looking for somewhat fine-grained detail.

Currently, I’ve been using Classifier4J for doing the classification and text summarization3. The text summarization is sort of annoying, though, because it’s based on a simple statistical choice of sentences which occasionally picks up date-lines and partial phrases because of what’s ‘important.’ I’m resorting the urge to go completely POS-tagging nuts on the whole thing and only selecting sentences of certain types or completeness because this is, after all, a side project. (The number of times I see things like ‘this sentence no verb.’ is astounding, though, and slowly driving me nuts.)

So, another day in the life.

1 although i’m also using a vector space classifier for a related, larger project and it’s driving me less nuts training it.
2 this is called a meronymous (’part-of’) relationship, and given that half the people who regularly read this blog were in LIS530 or its equivalent at some point, you should remember this.
3 and will probably eventually switch to jNBC http://jbnc.sourceforge.net/ before i go nuts

This is a social night for the local chapter of the American Society for Information Science and Technology. You should come by and check it out if that’s the sort of thing you find interesting. I’m chapter secretary for this year, and we’re doing all sorts of neat stuff that we’ll tell you about that the meeting or will be updated as relevant on our site.



Join us for some good company and geeky conversation next Thursday (5/10) at the Elysian Pub in Capitol Hill!

What: Seattle Monthly Meet-up, organized by the Pacific Northwest chapter of the American Society for Information Science & Technology.
http://asistpnw.org
Where: Elysian Pub, 1221 E. Pike St., Seattle, WA
When: 7-10pm, 2nd Thursday of every month

Once a month, we’ll get together to have drinks, chat, network, and geek out with fellow information architects, librarians, usability experts, user experience designers, and other like-minded people. It’s open to anyone, so bring a friend — especially those in other local organizations! The format will be casual, but all are encouraged to bring something to discuss — recent work, an interesting topic, or even your resume.

See you there!
-Aaron

The first meeting (that I attended) concerning this fall’s ASIST PNW chapter meeting was held yesterday. Aaron Louie (chapter president) and I met with the UW iSchool student chapter president and vice-president. A year ago, I was just turning over the control of the student chapter to the next year’s president, and these people are one down the line from that (the officers are typically 2nd year students in one of the Masters programs at the iSchool.) Now, the students seem pretty young to me, but in some ways they did at the time as well.

So, I’m not going to go into details yet — that would be jumping the gun, but I think that we’ve got something very exciting lined up. We’d talked previously about how we can best revitalize the chapter, which had been faltering somewhat in recent years. I think we’re doing a pretty good job so far — the ‘Information People Get-Together’ that we’ve had the last two months at the Elysian has been going well, and the way we’re planning to run this conference (unconference style, with ask later sessions similar to Ignite Seattle’s) will be fun, exciting, and informative for folks. We’re on target for our goals, consistent in message, and serving our identified audience.

About that last sentence. (One of) The real benefit(s) from being at the iSchool for me was getting more in line with the User Centered view of the universe. Before the school, I had been largely feature/product/use case oriented (largely as a result of many years of dev background with light project management), and I think the iSchool helped better my sense of the overall context — both social and technical — in which systems exist and are created.

The last several months have been integrative of all the different things I’ve learned in different periods of my life. Someone remarked to me the other day (@ the ASIST Info Social Hour) that I sound like a consultant, but it sounds to me like I’ve integrated all the different things I know.

Enterprise Content Management (ECM) Team Blog : Taxonomy/Tagging Starter Kit for SharePoint Server, also at the Sharepoint blog

Microsoft has made a kit available for Sharepoint that makes it easier to have taxonomy and tagging.  The tagging allows authors to tag items and to also have controlled vocabularies on particular multi-valued properties.  Users can incorporate the controlled vocabularies into searches and also search by tags. 

In the default configuration, users cannot tag items on the fly (although I suspect that they could change taxonomy values if they have permissions.)

I used to work (engineering) at an ECM company, so using the phrase ‘controlled vocabulary’ in place of taxonomy for this is somewhat second nature.  Since I took a lot of classification classes at the Information School, it’s interesting to see how companies implement these concepts.  It could be interesting if these features became widely available in Sharepoint.

Technorati Tags: , , , , ,

In ISO 5963-1985, sections 1.3 and 1.4 define the scope of the document in a couple of important ways. Mostly, that it’s designed to help indexers index documents in ways that are helpful for users. It helps with this by providing a consistent set of guidelines for analysis that indexers use to promote useful indexes inside organizations and between organizations that exchange indexes. (A note from before that this document specifically deals with humans doing indexing, and not algorithmic indexing done by computers.)

So, the goal of the document is consistent subject indexing through making a guide to the document analysis and concept identification stages of the indexing process. To what degree is consistency possible, likely, or desirable?

Certainly some consistency is useful. As users of this index are (presumably) part of a domain or community of practice that actually exists out in the world somewhere, they probably have a common shared vocabulary that they use to describe things and the index should reflect that as much as possible. But, if you interchange between different groups of people, they will probably have different vocabularies for the same (or similar) things, and parts of documents analyzed may be more or less important, causing the subject of the document to change (with regard to the other groups.)

Depending on the type and scope of documents and the vocabulary used, indexer consistency may be quite low. In some ways, this mirrors the usual problems of recall versus precision when trying to retrieve information from a system. If the indexer has a relatively small set of terms that they’re choosing from, or is only trying to cover things in the broadest of terms, then it is easier to come up with common terms than if they’re creating their own terms on the fly or are trying to be very specific.

It isn’t clear, however, that lack of consistency between indexers is actually a bad thing. As long as the users are well-supported in their searches, which is the point of this exercise, why does it matter if the results are nonstandard. In section 1.4, they indicate that they’re specifically trying to standardize practice rather than results.

This entry is part of my ongoing blog entry series on specs and standards, done largely so I have reference to my thoughts later on. It’s also put up with the expectation that it will be helpful to other people. You’re welcome to comment on this, and I may make new versions of this document later that incorporate other remarks or just reflect my changing understanding. One particular note is that I won’t send you a copy of this spec — you should buy it or get it from your local (university) library.

I went to the gaming session of Seattle dorkbot last night. I went to two of the sessions, and then spent the third in the bar. The third session was a real yawner, correctly identified as such by Ario, that made me sad, as I’m very interested in the topic of “Games for Social Change.” Here’s my notes from the thing, as it might be of interest to readers, I mainly took notes on Jordan Weisman’s session on Alternate Reality Games.

To some extent, it was a marketingish presentation, although as you can see JW is particularly rife with geek cred.

So, the notes:

ARGs tell stories interactively. The premise began for them, based on the Kubrick/Spielberg movie AI, because they’d been licensed to make a number of games based on the product, but the movie wasn’t particularly given to making games. Instead of making games based on the movie itself, they made it based on the universe that the movie took place in.

Their question was, how to tell that story. But they came up with an idea based on the narrative structure organic to the web.

I’m not sure I didn’t replace Jordan’s point with one of my own here: Different models of disseminating information have different methods of telling stories that are organic to them; bards, epic poetry (e.g. Iliad, Odyssey); books, novels; television, sitcoms; movies, summer blockbusters; What is the native activity of the internet at time circa now? JW says looking through a ton of crap looking for relevant partial pieces of information.

What if one were tell stories through scattered shards of information? Deconstruct a narrative, create all the evidence that the story had taken place, and then hide the evidence and throw away the story.

What is the device on which this story will be told? The ‘media sphere,’ which JW describes as ‘all devices with electricity and some without,’ but I think it’s easier to say all information-bearing objects, here, which is a metaphor from InfoSci that is similar in scope.

This is essentially a community effort, the people who take place in the exploration form a ‘hivemind‘ in response to finding shards, and tell stories to each other. The story goes from being the original narrative to being a consensus narrative that comes from the audience’s experience.

The community effect produced is that the hivemind has every skill on the planet, and it can go everywhere and do everything and anything. It has essentially any skill on the planet. It is also, by that same factor, smarter than the people writing the game.

Sample: Ilovebees

  • Use life as a game board — it took place all over the world.
  • radio drama told on payphones — fragments of the story were released as people talked
  • Name of game… campaign for prerelease of Halo 2

This is, in its essence, pop culture hacking, it’s about about the audience crwating fiction and inseminating your references into their everyday consciousness. However, this is against the everyday experience of marketing staff — they want to put up as much collateral as possible and advertise it’s existence as widely as possible to get as many people to notice as possible. But that turns out to not work well with getting people to want to experience this, what you want to do is draw people down the rabbit hole.

How to get audience in? Spend time creating content, not telling them about it.
Allow communication about shards of content to draw people in… People will start looking with a few small clues.

highlights of their work
All of their big campaigns have led to marriages, because collaborate and share rahter than compete, story drives communities, competition drives individualism. This is, to a large extent, their goal — the building of a temporary community, possibly tied to awareness of some product or service that people make them make the game for. It’s an interesting balance between entertainment, advertising, and ‘using the real world as the gameboard.’

William Gibson’s Pattern recognition was a tip of the hat to I love bees. I’m wondering if WG’s perception of ARG makes this a must-read for anyone interested in ARGs. I’m probably going to pick up the book in the next couple of weeks to find out. If anyone has any opinions on that, please feel free to let me know via email or comment.

One thing that JW mentioned at the end of his talk, and I suspect that this was a deliberate seed effort of his, was to say that if you were in front of the Bellagio during CES on 1/6/2007, you might see something interesting in the fountains. Anyway, in the spirit of thanking him for coming to the event, I thought I’d pass this on.


Technorati Tags: , , , , , , ,

Narrativization is a process by which we help provide the context that takes things from being a mere series of events to being story and history. However, this narrativization may not be limited to effects strictly within the text. It may, in fact, function as a version of hypnosis, according to Scott Adams, when it works in appeals to different senses and sense perceptions.

Certainly, there are some pretty good arguments for the unconscious mind affecting the conscious, but how far can you really take this sort of thing?

Neal Stephenson, who has used Bicameralism as a plot device, was the first person who came to mind, so I flipped through Cryptonomicon looking for stuff that met Scott Adams description of the techniques. A bunch of the lengthy digressions that sort of litter Cryptonomicon are full of the sort of appeal to the senses that Adams describes, which makes me wonder whether it was a deliberate technique for manipulation or just an accident of style. Some apparently would claim that Tolkein did something similar.

It makes me wonder how much of all this is tied to the overall topic of framing the message, though.

Technorati Tags: , , , , , ,

One of the better google mashups I’ve seen recently. It remixes information available from a variety of epidemiology sources with google’s now ubiquitous mapping program.

Get Your Daily Plague Forecast:
A new website mashes health data with Google Maps to track global disease outbreaks. By Seán Captain.

It’s an interesting mashup because epidemiology information is very hard to assemble into a coherent picture, being based (as it is) on data about people in particular locations and suchlike. Reports linked to maps is probably clearer than the old agglomeration of report style that it used to use, especially given that you’re talking about locations on a global scale.

the session today has a lot of stuff based on the concept of “the distance between two arbitrary faces of a hyperdimensional cube,” where they actually mean more like hyperdimensional rectangles from what i can tell. There’s apparently some benefit to this, but I like to think of it as ‘rocking out on the hypercubes.’

because the hypercubes, they rock out. Aside from that, it’s straight up XML Element Retrieval. Feel free to observe the exhibits carefully, don’t touch the points of the <, they’re quite sharp. Please take extreme care not to become entangled in the forest of literal references, the & are quite difficult to detach from one’s clothing once they become caught.

The tendency to use Σ and Π as iteration operators has made for crazy space madness algorithm writing on the slides over the last coupe of days. I’m going to be going back and looking at a lot of the math that i’ve forgotten over the last couple of years. Intense. Learned a lot. It’s about over now, a couple hours are left.

I’ve learned a lot at this, and gotten a couple new ideas about things I should be working on learning. Fun. Now, on to the Semantic Grail meeting tonight.

Technorati Tags: , , ,

Adapting Ranking SVM to Document Retrieval
Yunbo Cao, Microsoft Research Asia
Jun Xu, Nankai University
Tie-Yan Liu, Hang Li, Microsoft Research Asia
Yalou Huang, Nankai University
Hsiao-Wuen Hon, Microsoft Research Asia

traditional: tf, idf, document length
currently: page rank, structural features of document, others
future: ??? Relevance SVM ???

Check it out, they gave a similar talk on Relevance SVM at the WWW conference. The talk itself was hard for me to understand due to speaker ESL issues, so your guess is as good as mine.

Technorati Tags: , , , ,

Document modeling is important to any IR approach — the bag of words approach assumes word independence, and this is simple, but inappropriate to natural language. There have been a bunch of approaches to this sort of thing in the past, but here’s a relatively new one that does well versus various TREC collections.

Here’s a link to the paper: LDA-based Document Models for Ad-hoc Retrieval.

The presentation was largely a crawl of the paper section by section, and I’m going to emulate that approach by just referring to the paper so you can have that experience.

However: it beats previous models because it maps { document vs. topic } for all topics and documents, as opposed to the cluster approaches, for example, which largely assume that all documents belong to one cluster, or for many practical approaches, belong to whatever cluster it matches best. Because documents belong to n topics with probability p(d[i], n), this is better than searching against bag of words models.

All papers in this section are pretty oriented towards the whole ‘topic searching autogenerated’ is better than word-based. See the papers in question for the differentiators, as a lot of it is math that I’m not going to break out the LaTeX for on the fly. I will also note that most presentations in this area are pretty high on the UMLS fetishism.

Technorati Tags: , , ,

Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR
Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, Il-Yeol Song

It is possible to disambiguate homonyms in a probabilistic manner by using Topic Signatures that let you identify which of the topics that the questionable-homonym is actually retrieving. Using Topic Signatures is also more effective for finding documents than the ‘bag of words’ model.

so “‘terms’ -> ‘Topics’ -> ‘find documents for topic’” is more effective for both precision and relevance than “‘terms’ -> ‘find documents for terms’” Doing this topic model is called ’smoothing’ or ’semantic smoothing.’

My reflection on this is that it’s a lot like using an automatically built controlled vocabulary for and mapping both documents and terms to this algorithmically. Strangely, this presentation reminds me a lot of a math-intensive version of Jens-Erik’s classes on Indexing, but, I suspect, only if you’ve already heard JEM talking and have that context.

It works better than WordNet (according to a person who asked a question), because it uses math to eliminate ambiguity of meaning.

Anyway: I plan to read the rest of their stuff. It looks interesting. It would be interesting to see what sort of ontologies (InfoSci sense) can work it with. However, nothing to do with Genomic IR I can see other than that’s probably the non-described domain they’re using.

Technorati Tags: , , , ,

I will be at SIGIR 2006 this Monday through Wednesday on the University of Washington Campus. Although I’m going to be doing a lot of networking and attending conference sessions, this would be a pleasant time for lunch and hanging out if you happen to be around the campus/UD. SGIR is the ACM special interest group on information retrieval, and is interesting and fun if you find that sort of thing interesting and fun. At any rate, I hope to learn a lot. I’ll be around as a conference volunteer on Thursday as well — they were low on volunteers and I wasn’t doing anything in particular that day.

Indexing schemes are largely similar in their outputs – the right answer to many questions, such as ‘controlled vs. non-controlled,’ how many terms to use, and how expressive or deep the vocabulary should be are all matters of indexing policy considered on a per-collection basis and more or less well-defined in the specification (ISO, 1985). The biggest difference in indexing methods is whether they have an objective or subjective view of the world. Much of the difference between methods after that comes down to how objectively or subjectively they view the world, and what the right approach to dealing with that is.

The most subjective approach to indexing is ‘social indexing,’ which makes no assumptions about underlying reality beneath the words it uses for indexing, and in fact many of the original proponents of it as a method felt that this was a benefit, because each user may mean something different by tags (Guy & Tonkin, January 2006). Folksonomy proponents are backing down from this extreme approach over time, but denying a possible shared reality is as far towards subjective as possible.
A large distinction is between document-oriented and request-oriented indexing, with document-oriented indexing using the properties of the document to describe the document and request-oriented indexing considering what the likely requests of the user community are likely to be (Fidel, 1994). Another formulation of request-oriented indexing is tying indexing to supporting the information-seeking behavior of the users; indexing in this sense is about making it possible for users to find documents that satisfy their information needs (Hjørland, 1997).

Document-oriented indexing and request-oriented indexing are present in all collections to various extent. In general, the more general the collection (the broader the scope of the collection and of the population the collection serves), the more oriented on the properties of the document itself the indexer has to be. This is because the broader the scope of the collection, the less adapted the indexing can be to particular topics. Both the depth of indexing and specificity of meaning of terms have to be limited in a more general collection. (Hjørland, 1997) For a collection more limited in scope, more attention can be paid to supporting the sorts of requests that will be coming in from a user community. (Fidel, 1994; Swift, Winn, & Bramer, 1979) Document-oriented indexing is more objective than request-oriented indexing, because it presupposes (or at least pretends) that there is a single reality that the documents are being described in relation to. Request-oriented indexing is more subjective, because it admits the concept of communities of practice (or discourse communities) with their own conceptions of the world (Hjørland, 1997; Mai, 2001).

The process of indexing is in fact more subjective than this, because the subjectivity of the indexer must be taking into account – there is a series of interpretations that the indexer makes in moving from text to a representation in the subject index(es) (Mai, 2001). The indexers’ interpretation can be informed by the domain, which is called domain-centered indexing. When analyzing a document, domain-centered indexing starts with an analysis of the domain and then moves on to the users’ needs (and indexer interpretation) and then role of the document with regards to these (Mai, 2005).

Indexing approaches vary in a number of ways, but one of the most important is how subjective their view of the world is, and how granular that subjectivity is. Document-centered indexing is least granular (most objective), with a single representation claiming to serve all needs; after that is domain indexing, which considers documents in light of their role in a domain (although other information is taken into effect.) User (or request)-oriented indexing considers the information needs of the users as primary, assuming that the users of the catalog have specific purposes in mind for the catalog. Democratic indexing goes further in assuming the fragmentation of reality, assuming that there isn’t consensus in meaning among users, except possibly in aggregate.

For indexing practice, which of these methods should be considered is tied to the collection. The more focused a collection on particular users or domains, the more tied to requests or domain analysis the collection can be. The more general the collection, the more likely it is to need an assertion that there is a single objective view of reality (document-centered) or that there isn’t a meaningful consensus (social-indexing) – and the decision between document-centered and social-indexing may be more usefully made on other means, like whether the collection is developed (like a library or a set of resources chosen for a purpose) or arbitrary (like pages taken off the web.)

bibliography

  • Fidel, R. (1994). User-Centered Indexing. Journal of the American Society for Information Science, 45 (8), 572-576.
  • Guy, M., & Tonkin, E. (January 2006). Folksonomies: Tidying up Tags? D-Lib Magazine, 12(1).
  • Hjørland, B. (1997). Chapter 2: Subject Searching and Subject Representation Data. In Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science (pp. 11-37). Westport, CT: Greenwood.
  • ISO. (1985). 5963: Documentation — Methods for Examining Documents, Determining their Subjects and Selecting Indexing Terms. (No. ISO 5963-1985): International Organization for Standardization.
  • Mai, J.-E. (2001). Semiotics and Indexing: An Analysis of the Subject Indexing Process. Journal of Documentation, 57(5), 591-622.
  • Mai, J.-E. (2005). Analysis in Indexing: Document and Domain Centered Approaches. Information Processing and Management, 41(3), 599-611.
  • Swift, D. F., Winn, V. A., & Bramer, D. A. (1979). A Sociological Approach to the Design of Information Systems. Journal of the American Society for Information Science, 30, 215-223.

Technorati Tags: , , , ,

Section 1.2 of iso5963 just discusses the relevance of the document to the community that might want to use it. It’s common for standards documents to have this sort of information in it. A lot of them specify the duration of the standard, what standards they replace or augment, and how generally applicable the standards are. This is important information when you make a standard because it’s the important to say what the standard is, who it is for, and how long it is in effect for.

However, I’ve always found the wording ‘can be employed by any agency by which human indexers…’ to be pretty funny, because ‘agency’ can also mean ‘means.’ So, it’s also a play on words, although I doubt that that was their intention. Personally, I am subject to compulsive misprisions, which is what that sort of misreading is technically called.

Being prone to verbal misunderstanding at a somewhat compulsive level makes doing the indexing thing fun, probably more fun than it should be.

, , ,

I thought I would post a section by section discussion of ISO5963, which is “Methods for examining documents, determining their subjects, and selecting indexing terms.” I’m going to be doing this about a few different specifications that I worked with either at the iSchool (Information School @ UW (University of Washington)) or in other venues. This may be helpful to other people, but is to some extent a mnemonic for me. One of the ways in which it may not be helpful for other people is that the text of ISO specifications is ©opyrighted, and therefore it isn’t available on the web. (All references to sections are to ISO5963:1985)

So, to begin…

ISO5963 deals with indexing, which is the process of representing a document by a couple of subject terms. The specific parts of the indexing process it deals with are “examining documents, determining [the documents’] subjects, and selecting appropriate indexing terms.” (1.1) It doesn’t deal with the other parts of the indexing process, which is usually called ‘indexing policy,’ and reflects broader issues like what sort of vocabulary to use, are indexers using thesauri, folksonomies, simple word lists, or what have you; how many terms to use; and stuff. There are also other things like how the system will be presented to the users that are driven by the underlying system. This document is just about the determining the terms to use.

One of the important points about this is that documents in a collection are represented by their subjects in an index. Why would you want to do this? It’s really hard to read all the documents in a collection when you want to find a specific one, or to find a document on a specific topic. To make this faster, the subjects of a document are ‘extracted’ from the document, and these subjects are put into an index. This index is then presented to the user in some form, common forms are hierarchies, alphabetized lists, and search engines that include things like ’subject’ or ‘topic’ as a field that one can search on.

, , ,

ISO5963 (ISO, 1985) states indexing ‘extracts concepts from documents by a process of intellectual analysis.’ Any sort of intellectual analysis involves context – at least the context the indexer personally operates in. Hjørland (Hjørland, 1992) distinguishes between ‘content-oriented indexing’ and ‘request-oriented indexing.’ This division also describes what additional contexts are important to the indexing process.

‘Request-oriented indexing’ means indexing documents in a collection in relation to requests that will be made against that collection – using the terminology of whatever field is the object of study. Request-oriented indexing has a high return on investment when the purpose to which the collection to be put and the user population is well-known – both the context of the collection and the user population can be used to develop indexing, both for particular documents and the overall collection policy. Using large amounts of context in this manner for subject indexing would seem to be the clear winner, but it fails to work generally.

For request-oriented indexing to work well, there are several suppositions. First, is that the purpose of the collection remains stable over time. This may fail for several reasons, such as the gradual evolution of a field or members of a different field using the collection (ISO, 1985). Second, that the terminology and object of study of a field remains constant over the lifespan of a collection; but, most fields change over time and it is generally hard to know when creating something how long something will last. Third, the supposition that there is a single shared useful context shared among all the users of a document that will make retrieval in the index system possible. Although it would be possible to index documents within many possible contexts, it becomes an exponential problem as the number of contexts and documents grows (Hjørland, 1992).

Subject indexing is a representation of a document; the purpose for that representation is finding and using that document at some future time – ‘search for items with potential’ for various purposes (problem solving, meeting information needs, gaining general subject understanding) in Hjørland’s formulation (Hjørland, 1997). If the document representations aren’t effective, the searching based on the representation will be of low quality (Mai, 2000). Considering searching as the goal, subject indexing can be analyzed in terms of searching’s metrics, recall and precision. When the indexer is able to focus specifically on analyzing the user’s domain, the user can have high precision and recall. However, as the indexer’s analysis of the documents in a collection moves out of alignment with the user’s context, either precision or recall will drop. Recall drops when the indexer chooses terms that are different than terms the user chooses for searching on a particular subject. Precision drops when the user and indexer categorize the subjects differently.

What determines the right amount of context to use? With a more general collection, the domain will be less specific and the collection less purposively unified and the indexer must make less use of context, leaning more towards a document-centered approach to serve the searches of a broader audience. More specific collections, however, can use a more domain-oriented approach to indexing more aligned to the context that the documents exist in to better serve their specific users.

Bibliography

Hjørland, B. (1992). The Concept of “Subject” in Information Science. Journal of Documentation, 48(2), 172-200.

Hjørland, B. (1997). Chapter 2: Subject Searching and Subject Representation Data. In Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science (pp. 11-37). Westport, CT: Greenwood.

ISO. (1985). Documentation — Methods for Examining Documents, Determining their Subjects and Selecting Indexing Terms. (No. ISO 5963-1985): International Organization for Standardization.

Mai, J.-E. (2000). Deconstructing the Indexing Process. Advances in Librarianship, 23, 269-298.

Someone asked me to move a post about what I thought of the IA summit on a public forum, presumably so they could refer to what I said, but you never know. here goes…

I was sort of struck by the way that the conference sessions were running, in that lots of people were grabbing on particular strategies (design patterns, for example) and running with them. It struck me as both a good thing and a bad thing. Good, because it will improve the overall quality of IA, bad because it seems to be a easy hit for toolmakers. Coming back down, for example, I figured out a way to make tools that generate a company’s commonly used design patterns in a relatively straightforward fashion given a couple of assumptions about their site architecture and database storage (but only assumptions found in LIS 540-543.)

What that means is that at some point a tools vendor will come along and make a lot of the things that we’re seeing know available as part of some suite. So, if I have to guess based on previous examples, it means that there will be a large growth in ### of IAs over the next two years and then probably go back to around the number that we have now after that as things move from art to craft. I dunno, I’m guessing that’s not what I was supposed to take away from the conference. I did learn a lot about a variety of different things that I hadn’t been exposed to much, like design patterns for websites and a lot of things about tagging that I hadn’t considered.

Personally, I think that the most insightful presentation with the biggest implications for design of projects whose information organzation deliverables have long expected lifecycles was the presentation by Campbell and Fast, described here: http://www.iasummit.org/2006/conferencedescrip.htm#164 I think that their work had insights into the nature of the lifecycle of websites versus the use of information organization that was pretty useful, and I’m hoping to get more details on it over time.

There are some references to iSchool specific stuff here, LIS54# classes are all databases and information retrieval systems.

Malcom Gladwell writes in the New Yorker about how generalizations should be used to make categories. It’s important to be able to make accurate generalizations when one is buiding categories, he claims, because for real-world things like detecting terrorists or identifying dangerous dogs, there are real world consequences.

I found this interesting, reminded me a lot of the book Women, Fire, and Dangerous Things by George Lakoff. Women, Fire, and Dangerous Things Lakoff writes about prototype categories, where category membership is a fuzzy concept. Explicit rules for category membership, like ‘terrorists buy one way tickets,’ ‘pit bulls are dangerous dogs,’ and other concrete rules like that are less useful in establishing membership in categories than general rules like ‘is there something suspicious about this person?’ or ‘does this dog (or the dog’s owner, apparently more useful) have a history of violent behavior?’

The key, apparently, is that some genearalizations are stable and some are unstable. Unstable generalizations are things like ‘drug smugglers buy one way tickets’ and ‘pit bulls are dangerous dogs.’ Drug smugglers can change their behavior, and it used to be other dogs that were the dangerous ones. (It turns out, according to Gladwell, that the most dangerous kind of dog is the kind that people buy to seem dangerous themselves, and this has varied from era to era.) So, the key to building a category is to figure out what are the stable generalizations and which are the unstable ones. Gladwell gives examples ranging from terrorists, to NYC subway searches, to dogs.
Anyway, I thought it made for interesting reading. I recommend it.

on hackfests (cont)

reposted from a private forum by request, we’re talking here about code4lib, but it also applies to other things like Seattle Mind Camp.

About participating in hackfests: I would advise everyone to participate in things like this as an acculturational process if nothing else. Technical conferences (and technical communication generally) works in different ways than what a lot of us see day to day, and it’s better to learn how to participate and communicate in them when the stakes are low than otherwise.

I’ve seen a LIS professor (not from here) go down in flames really hard when trying to communicate with a small group of technical people. The individual was making the point very strongly that users didn’t understand boolean search, when everyone else in the conversation was talking about something else entirely that happened to contain the words ‘user,’ ‘boolean,’ and ’search.’

(insert statement about ‘training’ programs versus the theoretical aspects of the mlis program that the ischool uses to justify a lot of its skulduggery.)

There’s every reason that the professor could have figured out what was going on, but she was clearly communicating in his/her own world, without referent to the context of the people around him/her, and being entirely condescendingly ignorant of the fact that the people around him/her actually knew what they were talking about.

One[1] of the reasons that the Roman Legions got more or less paved directly into their beautifully constructed roads in the 4th C. was that the technical superiority they’d enjoyed over the ‘barbarians’ had faded through years of communication and trade over the Danube. The recent acquisitions of Flickr and Delicious (among other things) indicate that people outside of library science realize Even More Than Before there is immense value in metadata and books like Ambient Findability and Information Architecture indicate that the understanding gap between ‘information professionals’ and ‘people who want to exploit information like it was liquid dinosaur’ is closing sharply.

One plan for dealing with this sort of thing is to go fetal and hope that nothing bad happens to you and that ‘civilization’ wins eventually. In general, gothic cathedrals are very nice and lovely, but the sack of Rome kind of sucked for the Romans. Other ideas might involve meeting people half-way, and admittedly 540 is a retarded way to start down that path, but hey, nothing’s perfect.

So, overall, I’m ranting and rambling and you know how that sort of thing goes, but my advice is to participate in the hack-fest type things.

[1] it’s over-simplistic analysis day. whoopee!
hackfest
librarianship
technical communication

originally from a private forum, reproduced here by request. it’s commentary on this article about librarians as coders.

In general, I disagree with that diagram. There’s a level *below* application called ’system,’ and it isn’t in that diagram because IBM wants to sell you systems that you write your applications on.

A relatively large number of systems suffer from lack of LIS-type knowledge at their cores, which then bubbles up through the various levels of architecture, and causes things like current OPAC implementations to be produced. Librarians (and people with librarian-type knowledge) are needed at all the different levels.

It also doesn’t discriminate between the different sorts of programmability that you might build into a system. Let’s divide programming (or ‘development’) into two fields, which we’ll call ‘Programming’ and ‘Scripting.’[1] ‘Programming’ belongs at the lowest levels of the system, and, as you might imagine, largely in the ‘Development’ phase of construction.

However, the graph is also missing a column called ‘Deployment,’ which comes after Development. This is where ‘Scripting’ comes in. ‘Scripting’ is the process of writing code to retrieve and manipulate information[2] from the system, and also changing the state of the system. In extension, it’s linking together systems made through the process of the three columns given in the article’s graphs. Why doesn’t it have this column? Because IBM would like you to pay lots of money to have all that nasty system linking done by their tools and IBM Global Services.

This nasty system linking is to a large extent easy to do on a system that makes hooks for it possible[3], and it’s a place where people who understand the information that they’re working with can add huge amounts of value to a business, institution, or whatever.

Following this logic, ‘Scripting’ is highly useful for librarians, etc… to know how to do. To the extent that it isn’t harmful to your business, keeping money and not paying it to large consulting companies is handy[4], but that’s a whole other discussion.

[1] note that the distinction between ‘programming’ and ’scripting’ is to some extent informed by my own predujices and may offend some people, namely most people with programming ‘degrees’ and ‘certifications’ from institution name elided and also to some extent an undergraduate major at UW students. People who don’t have ideological stakes in being thought of as ‘real programmers’ usually don’t care.[1.5]
[1.5] I suggest that having an ideological stake in whether someone thinks you are a ‘real programmer’ is a stupid thing to do, even for ‘real programmers.’
[2] antelopes mostly.
[3] unlike, say, most OPACs.
[4] see how fast i change my tune if i get a job at a large consulting company.

development process
sei level negative infinity
antelope theory
system analysis
librarianship
opac

The Intellectual Foundation of Information Organization (Digital Libraries and Electronic Publishing) The Intellectual Foundation of Information Organization is somewhat heavy going, but it’s the definitive work about a lot of areas in information organization. A lot of people encounter information organization issues professionally in the technical fields, but a lot of these issues have been around for ages, appearing in business and libraries for a long period of time.

This book has a huge amount of information in a very small amount of space, and can be somewhat heavy going. It has something to say about almost every issue having to do with organizing information. I also highly advise reading it for anyone in a MLIS/MSIM program, I went and looked this book up today for someone whose class wasn’t reading it for some reason and I advise it highly. It requires an Information Architect or other web designer to be able to think in basic principles about the stuff that they’re doing to be able to use this book — looking for information about controlled vocabularies instead of what the latest buzzword is. However, the payoff from having done so is high due to the clarity of the information presented.

The writing in this book is in the ‘little red schoolhouse‘ academic style from the University of Chicago. I found it very easily digestable and understandable, and it had a profound affect on how I thought about information organization; I credit doing very well in my classes on the subject and being able to speak intelligibly on the subject outside of class to having started out by reading this book. I recommend it highly.

So, just as a helpful hint for people doing Taxonomy type stuff, or whatever controlled vocabulary: The difference between pre-coordinate and post-coordinate terms are pretty obvious when you can actually remember them, but here’s a helpful hint:

Precoordinate: concepts are combined into terms before the thesaurus is created.
Postcoordinate: concepts are combined into terms after the thesaurus is created — ie: usually at time of use.

I know that someday someone who needs this information will find it searching the web, and that makes me (relatively) happy.

A lot of writing in general, and a lot of web pages specifically, concerns some topic. That writing is about a topic. There are a bunch of different ways of figuring out what something is about, and many of these are hilariously wrong. But that isn’t what this post is about, this post is about ‘aboutness assertions,’ which is how you say what things are about once you’ve decided that something is about something.

sounds confusing? it gets worse but more interesting…
Read the rest of this entry »

Kiva.org is an organization that focuses on making rural microloans. A microloan for these purposes is a smaller development loan, useful for buying, say, a couple of goats or a truck instead of a large infrastructure loan for buying schools or new highways. It’s useful on more of a personal level, and it’s a smaller need that doesn’t get served well by traditional NGOs. Kiva is a platform for these rural focused loans, with an initial focus in Uganda.

Why this is significant is that Kiva has managed to remove several levels of aggregation of loan in order to more effectively reach the people with the needs and the money, and to connect them directly. Removing layers like this is called disintermediation.
Read the rest of this entry »

Lazerow Lecture

Last week was the annual Lazerow lecture at the Information School. This year’s was given by Gary Marchionini on Human Computer Information Retrieval. It was an interesting talk, about building search engines and how people who are more involved in searching can get better results.

Leilani (the vice chair of UW ASIS&T) taped the lecture, and you can watch it here. Due to the way that the site pages are organized, you’ll have to scroll down to the lecture if you’re viewing this page in the future. It was given on 2005-10-16. The slide deck from Gary’s similar HCIR talk at MIT is online, in case you want to find out what the talk is about before jumping in.

I’m going to ASIS&T’s national conference this year. I really enjoyed it last year, and it changed my perceptions of the field a great deal — both what I was interested in and what I wanted to get out of school. I’m hoping that this conference is as interesting.

It has a upcoming reference here, for those of you who are fond of upcoming. If you’re going, feel free to drop me a line.

asist2005

tagging

I’ve added, with some reluctance, tagging to this journal. To commemorate this fact, I somewhat less than helpfully will tag this post with the tag tagging

Read the rest of this entry »

Today, we were talking for a while about why Capitalization might be important in an information retrieval system. One reason is that for different words that are spelled the same (or the same word that has different definitions, depending on your dictionary), there can be different meanings depending on the Caps, and you might want to factor that in.

Here are some examples

china: In capitalized form, it refers to a country; in lowercase form, it refers to a form of porcelain or dishware made thereof.
aids: Aids are helpful, AIDS certainly is not.
ira: Ira is a male first name (as in Ira Glass), IRA is the Irish Republican Army.
it: ‘it’ is a pronoun in English and IT is an abbreviation for Information Technology.
lox: ‘Lox’ is smoked salmon, and LOX is liquid oxygen. Only one of these is at all good for bagels.
dos: DOS is a primordial operating system, and ‘dos’ is spanish for two.

This is complicated somewhat by the fact that words are capitalized at the beginning on sentences, obviously. The customary solution is to search against lowercase but to display to the user the results as they exist in the original document.

A couple of months ago, I wrote some sample stemmers for a class I was taking in the iSchool. I’ve put some of them up on the website.

The stemmers on this site are the Porter and the Lovins stemmer, both implemented in PHP. The Porter stemmer was downloaded from one of the several sites on the internet that have the stemmer, the Lovins stemmer was converted to PHP from the Java version available at SourceForge, copies of the source are available on request.

The stemmers are available here. There’s also a call into the php implementation of the soundex algo that I added to demonstrate some points at some point.


The Wilkins rampage continues with these modifiers for English’s modal auxiliary verbs. These would be attached to a verb (built through Wilkin’s system), notice how the closely related verbs are similar in shape — with just the tip of the tail changed, and the overall collection of verbs are rotations of a single character.

For more detail on modal verbs, go here. It may also amuse you to read RFC 2119: Key Words for Use to Indicate Requirement Levels.

wilkins numbering system

Part of Wilkins’ Essay towards a Real Character… was a numerical system organized along the principles of his classification system. Note that the basic character in all these columns is made up of variances on Wilkins’ glyph representing category ‘Measure’, a subtype of ‘Quantity.’

Here’s a picture of the top-level categories back for your amusements, measure is in the middle category about halfway down.

Also note that the power of 10 for the particular number being represented is given as a modifier in the lower right corner of the number.

Wilkins

The top level of John Wilkin’s classification (claffification syftem) from his An Essay Towards a Real Character, and a Philosophical Language

wilkin's categories