Aardy R. DeVarque (aardy) wrote,
Aardy R. DeVarque

  • Mood:
  • Music:

Dealing with overreacting idiots

Sometimes I wish I could fire a taser through the Internet at certain people. (Given that RFC 2321 exists, where's the RFC for tasing unruly net.folk?)

In this case, I'm on a mailing list for our library catalog software, and a known issue came up with the way a particular field is handled by the indexing software (which is on a lifetime license from a third-party company that has long since been bought by someone else).

The issue is that, when dealing with tables of contents, there's a way of tagging the data so that authors are flagged and indexed as authors, and titles are flagged and indexed as titles, but only for doing Google-like keyword searches; if you try to browse through the title index or author index, all of the titles and all of the authors are run together into a single "title" and "author".

(For those who know HTML or XML, the basic cause of the problem is somewhat similar the following. Given a normal tag format like this:

    <span name="myid1" value="myvalue1">


    <span name="myid1" value="myvalue1,myvalue2,myvalue3">

references to myid1 return that span tag, and a list of all name/value pairs on the page returns myid1=myvalue1 (or myid1="myvalue1,myvalue2,myvalue3") for this tag. However, given a tag format like this, which is allowed in this case by the ANSI/NISO standard involved:

    <span name="myid1" value="myvalue1" value="myvalue2" value="myvalue3">

references to myid1 return that span tag, but an attempt to generate a list of all name/value pairs on the page returns "myid1=myvalue1 myvalue2 myvalue3" for this tag rather than "myid1=myvalue1" "myid1=myvalue2" "myid1=myvalue3". That is, one entry for each span tag, not one entry for each repetition of a name/value pair within a single span tag--and I hope you can understand from that example why it's difficult to alter the search engine software to behave otherwise, or even to at least alter the indexing routine that feeds information to the search engine software. For the truly geeky out there, the standard in question is ANSI/NISO Z39.2, as implemented in MARC 21.)

So this has been a known issue for a long time now, though one that is not widely advertised and thus sites are still just learning about it. When someone recently inquired on said mailing list about changing some local settings for the "contents" field, someone pointed out this issue, even though it wasn't really related to the actual question, but complained that the issue means it is useless to index the "contents" field. Someone else, who obviously hadn't known about it yet, replied with (lightly paraphrased for comprehension, not for content):
"So, basically, [The Company] doesn't support the formatted contents field for title searching. That means if I want to be able to search by titles, I will be entering individual titles either in "added title" or "added author/title" fields. That strikes me as a lot more work than merely entering the "title" subfield-delimiter symbol in the "contents" field(s). I thought I had understood the sales reps to say that the system supported MARC."
-sigh- Overreact much? The MARC standard simply controls how database records are tagged, and other than that has nothing to do with whether or how a particular field is indexed. Also, my site has had title keyword searching operational for titles in the "contents" field for at least a year now, with no complaints from either patrons or staff. For example, try our music album search. (It's still possible to do a search for "Heartbreak hotel" and have the resulting list of possible hits include an album that contains both "Heading for a heartbreak" and "Hotel California", but given that this doesn't happen very frequently, the results list does include the records you want, and it's possible to get around this by surrounding the title with single quotes to do a "phrase" search, is that really worth getting one's panties in a knot?)

Also, YES, IT'S MORE WORK TO ADD MORE DATA. But up until the standard was changed not too long ago to allow author/title tagging in the "contents" field, that was the only way to get author/title access to that data. And even now, that's still the only way to ensure that author & title searches can always retrieve all results for a specific author or a specific title, when the author or title either have relatively common names (e.g. Jim Smith or "Symphony Number 9") or are commonly known by more than one name (e.g. Cat Stevens = Yusuf Islam). Knowing that and knowing how to work within those constraints are one of the reasons a library degree is supposed to be essential to doing this work. And you know what, IT WOULD BE EVEN MORE WORK to create individual full database records for each song/story/etc. in the collection, but that would provide catalog users with even more access.

Providing more access requires putting in more work and more data. Until this, I wouldn't have thought this was a particularly difficult concept to master.

At the time, I really wanted to smack this person upside the head. (It may not read like it, but it's now two days later, and I'm much calmer now.) The company is far from perfect and the software has a long list of major bugs (any upgrade that fixes more than it breaks is considered a "good" upgrade by the user base), and this particular situation could use some improvement, but if you're going to go onto a company mailing list and blow something up into a major problem, pick a problem that's at least a grassy knoll rather than a barely-used molehill.

Moving on, when I mention on the mailing list that my library has this search up, running, and working pretty well, the person who posted the original complaint that this situation makes title indexing useless replied that they have a service that provides them with an electronic list of the titles of every presentation at every IEEE conference (apparently usually 10-15K titles per conference) when the library has the conference's published papers on file, which they then dump into the "contents" field for the record for the conference's papers, and thus because one has to wade through all of those conference records, it was indeed useless to index titles.

Well, gee, is that really a problem with the indexing software & the search engine, or, when you dump umpteen thousand lines of what you consider to be garbage into your indexes, are you more than likely to get thousands of lines of garbage out? (And how else is someone doing research at that library going to discover that the library has a particular paper from a particular conference on file that is exactly what they need?)

Where are the virtual clue-by-fours when you need them?

Feudalism: Serf & Turf
Tags: idiots, rant, work
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded