How much context to use when indexing a document? It depends

ISO5963 (ISO, 1985) states indexing ‘extracts concepts from documents by a process of intellectual analysis.’ Any sort of intellectual analysis involves context – at least the context the indexer personally operates in. Hjørland (Hjørland, 1992) distinguishes between ‘content-oriented indexing’ and ‘request-oriented indexing.’ This division also describes what additional contexts are important to the indexing process.

‘Request-oriented indexing’ means indexing documents in a collection in relation to requests that will be made against that collection – using the terminology of whatever field is the object of study. Request-oriented indexing has a high return on investment when the purpose to which the collection to be put and the user population is well-known – both the context of the collection and the user population can be used to develop indexing, both for particular documents and the overall collection policy. Using large amounts of context in this manner for subject indexing would seem to be the clear winner, but it fails to work generally.

For request-oriented indexing to work well, there are several suppositions. First, is that the purpose of the collection remains stable over time. This may fail for several reasons, such as the gradual evolution of a field or members of a different field using the collection (ISO, 1985). Second, that the terminology and object of study of a field remains constant over the lifespan of a collection; but, most fields change over time and it is generally hard to know when creating something how long something will last. Third, the supposition that there is a single shared useful context shared among all the users of a document that will make retrieval in the index system possible. Although it would be possible to index documents within many possible contexts, it becomes an exponential problem as the number of contexts and documents grows (Hjørland, 1992).

Subject indexing is a representation of a document; the purpose for that representation is finding and using that document at some future time – ‘search for items with potential’ for various purposes (problem solving, meeting information needs, gaining general subject understanding) in Hjørland’s formulation (Hjørland, 1997). If the document representations aren’t effective, the searching based on the representation will be of low quality (Mai, 2000). Considering searching as the goal, subject indexing can be analyzed in terms of searching’s metrics, recall and precision. When the indexer is able to focus specifically on analyzing the user’s domain, the user can have high precision and recall. However, as the indexer’s analysis of the documents in a collection moves out of alignment with the user’s context, either precision or recall will drop. Recall drops when the indexer chooses terms that are different than terms the user chooses for searching on a particular subject. Precision drops when the user and indexer categorize the subjects differently.

What determines the right amount of context to use? With a more general collection, the domain will be less specific and the collection less purposively unified and the indexer must make less use of context, leaning more towards a document-centered approach to serve the searches of a broader audience. More specific collections, however, can use a more domain-oriented approach to indexing more aligned to the context that the documents exist in to better serve their specific users.

Bibliography

Hjørland, B. (1992). The Concept of “Subject” in Information Science. Journal of Documentation, 48(2), 172-200.

Hjørland, B. (1997). Chapter 2: Subject Searching and Subject Representation Data. In Information Seeking and Subject Representation: An Activity-Theoretical Approach to Information Science (pp. 11-37). Westport, CT: Greenwood.

ISO. (1985). Documentation — Methods for Examining Documents, Determining their Subjects and Selecting Indexing Terms. (No. ISO 5963-1985): International Organization for Standardization.

Mai, J.-E. (2000). Deconstructing the Indexing Process. Advances in Librarianship, 23, 269-298.