Taking a Stand on the Semantic Web

Sometimes I wonder about the stands I take. I was recently on a panel at the World Wide Web Conference that posed the provocative question: “Will the Semantic Web Scale?” I agreed to assume the contrarian position. “No, of course not.” It was roughly comparable to being an atheist in a Southern Baptist tent revival. Members of the audience were popping up and down as if the hotel chairs were designed by William Castle (who, as you recall, installed electrified theater seats for his 1959 classic, The Tingler); you could hear the frantic key-tapping of wireless IM-ing; some W3C people were smirking; Tim Berners-Lee was frowning. The consensus was: “That chick just doesn’t get it.”

The obvious analogy was trotted out: “Some people didn’t see that the Web was going to take off. This proves that the Semantic Web will take off despite your criticism, which is so lacking in insight that I can only snort derisively.”

Flowbee in use

But to me, the Semantic Web can best be viewed as a product you’d see advertised on late-night TV, much like the Flowbee haircutting system, which “uses the suction power of your household vacuum to draw the hair up to the desired length, and then gives it a perfect cut… every time.” Even given the pitchman’s confidence and his well-mown Smokey and the Bandit-style hair, one is forced to ask oneself: “Will it really work? Who needs it? Is it safe?”

To answer that all-important question “Will it work?”, I’m always tempted to fall back on the old highly-polished arguments that skeptics have used for the past twenty years as an indictment against Artificial Intelligence (arguments Frank Shipman and I revisited in our paper “Which Semantic Web”), or the delightfully nasty digs the technology journalists came up with when Apple first introduced its dorky Knowledge Navigator envisionment. But instead, this time I’ll take the high road and compare the Semantic Web to another ambitious semi-universal metadata scheme, the MARC record. I know there are other cataloging schemes in use in the world’s libraries besides the MARC record, but because it’s widely used, I’m going to take it as the foil for my “Will it work?” argument.

The MARC (MAchine Readable Cataloging) record is what you see when you use an online library catalog. MARC records have several useful characteristics. First, the catalogers who create MARC records are members of a community and are trained to create the specialized metadata that describes library holdings. They go to school to learn this practice (just check the course catalog of any library and information science school); they don’t simply start using a representation language and beaver away.

Second, the whole idea of the MARC record is to reduce the cataloger’s interpretive load. While cataloging is still a difficult job, catalogers have guidance in their choices. How? Through the careful choice of attributes; through established authority lists to constrain the values that a cataloger may assign (for example, the Library of Congress Subject Headings can assist a cataloger in choosing among a standard set of subject headings); and through a set of practitioner-negotiated rules (for example, the second edition of the Anglo-American Cataloguing Rules). All of these things help catalogers make consistent, sensible metadata assignments.

Finally, MARC records are controlled for interoperability and consistency through institutions (for example, through the Library of Congress or through catalog record clearinghouses like OCLC, the Online Computer Library Center). OCLC is an interesting institution. My understanding is that it’s the principal metadata middleman. A research library can purchase catalog records from OCLC. It can also contribute records for credit. And — what I find the most interesting — if an institution contributes records others deem to be inaccurate, it can find itself participating merely as a buyer, not as a seller too.

This social and intellectual infrastructure is substantial. And, when libraries use protocols like Z39.50 to interchange descriptions of what they have, the interchange works. Experienced library patrons know what to expect when they access the New York Public Library’s online catalog from their living rooms in California. Catalogers have confidence in their ability to describe (and purchase descriptions of) what the library owns. It’s a system that works pretty well.

“Wait!” I can hear you Semantic Web aficionados saying. “Wait just a minute here! The Semantic Web has comparable scaffolding to MARC. We’ve got emerging standards and representation schemes like RDF, OWL, DAML-OIL, and so on. We’ve already developed extensive ontologies. We’ve got smart, motivated people invested in doing this. We’ve got fabulous success stories in the Knowledge Management arena. You’re just not getting it. In fact, we suspect you might be a bit thick.”

They’re right in some way: I don’t get it. The Flowbee works in theory. Yep. A vacuum cleaner will pull all of those hairs uniformly away from the scalp and the attached clipper will snip them as advertised. Vacuum cleaners and hair clippers are mature technologies, ripe for integration. And it’s even plausible that a regular helmet of hair will result from a felicitous encounter with a Flowbee. Of course, we do have to assume that an even helmet of hair is what’s desired and that theory turned to practice won’t result in user disfigurement. But those are our other two questions, waiting in the wings.

To make my Semantic Web argument work, it does seem that I’d best justify why it’s going to be harder to scale Semantic Web metadata than it was to scale the MARC record. After all, what has been characterized to me as “lightweight semantics” will require even less metadata creation effort than producing a full MARC record, which has a frightening number of fields.

Well, let’s take a look at what I claimed are MARC’s essential features: the social structures and training required for catalogers to create good metadata, MARC’s emphasis on reducing the interpretive load, and the institutional clearinghouses and sources for ready-made MARC-format metadata.

Certainly the social structures for creating universal Semantic Web metadata are missing. Most of the successes that are cited by Semantic Web proponents rely on local culture, local practices, and local needs. The “how-to” training I’ve encountered for the Semantic Web infrastructure, its standards and representation schemes, is available (some of it is offered by the W3C, an organization that’s quite invested in the Semantic Web’s success). But it’s treated in a way more akin to other Web mark-up skills, the fodder for a short course, not the basis for a profession; it’s not part of a discipline.

It’s crucial to acknowledge that Semantic Web metadata — no matter how lightweight — requires substantial interpretation and application of domain knowledge; any underlying assumptions about use are highly situated. How would I describe myself? Well, it depends on whether the description is for the Department of Motor Vehicles or for Match.com, even though portions of the description would involve the same basic attributes, constrained by the same range of values. I might make myself a couple of inches taller on Match.com; but it wouldn’t do to say I was 50 feet tall (even though that’s an impressive height and might attract a certain kind of suitor). I might modify my age too — I wouldn’t want to be using the one that’s on my official documents. It makes me seem old, and I’d probably be screened right out of most matches.

“That’s what you don’t understand, Cathy. This isn’t metadata that’s assigned by hand. We’ll assign it with algorithms and heuristics.” I think Tim Berners-Lee may have made that objection. But I am not placated. Where did those algorithms and heuristics come from? At some point, a software developer has indelibly cast his or her own interpretation into the code. They’ve selected salient characteristics or features, expressed rules, decided what’s important and what can be neglected. The metadata assignment may seem wholly automatic, but it’s not magic; it doesn’t originate from divine forces. The interpretive act is written in the lines of Perl.

And what harm would that be? Why be such a spoilsport? Let a thousand flowers bloom. That’s the attitude of the Semantic Web crowd. Here’s where the analogy to the Web itself is made with great vehemence. People will contribute Semantic Web-compatible data (or metadata) because they want to be participants, to be part of a grassroots movement. They want to be noticed. They want their small measure of fame. But there’s no OCLC to act as a clearinghouse; there’s no way of ensuring interoperability, consistency, or accuracy. As Cory Doctorow has so ably pointed out on his craphound website, one can get a terrific deal on Plam Pilots on eBay. Try it. I did. I found one. And if you want to be dismayed, look at what’s between the title tags on the average Web page. Or “view source” and find out just how people have used the newer, more sophisticated Web mark-up standards.

Classic Beehive

Classic Beehive

Mind you, I’m not trying to be an elitist here and suggest that the world is full of poor spellers and sloppy taggers. I’m just saying that where library catalog users encounter a world of order, Semantic Web users are guaranteed to be surprised. There are beehives and beehives to behold.

We haven’t even gotten to questions two and three yet. As one would most certainly ask with a Flowbee, “Who needs it?” Which of us, when we’re watching late-night TV, hasn’t asked, “Who on earth buys these things?”

Depending on who you ask in the Semantic Web crowd, lots of people need it: scientists sharing their data, shoppers comparing items in the Banana Republic and L.L. Bean catalogs, travelers making complicated plans, and friends trying to coordinate their calendars, to name just a few. At first blush, any of these applications seem plausible. The data is highly structured and the transactions are relatively straightforward.

You might say this is simply data, and not much effort should be involved in sharing data, but for it to interoperate in the predictable ways envisioned in these examples, it must be well-described. Hence, I’m going to relegate it to the realm of metadata, and I have no reservations about saying that it’s expensive and requires choices to be made. In some cases, there are various standards to choose from, and they don’t map all that well (in practice) from one standard into the other.

Let’s say you were describing the contents of a collection you were contributing to the National Science Digital Library (NSDL), a large-scale effort funded by the National Science Foundation. The NSDL uses an XML-based description called OAI; using OAI allows your collection to be federated with the other collections in the library through metadata harvesting. If you wanted to buy into a Semantic Web world as well, other metadata work would be required. Who has the resources to conform to two different sets of metadata creation practices? Wouldn’t it be tempting just to slap it up there on the Web and let Google bring an audience to your work?

As long as we’re asking, “Who needs it?”, we may as well ask if the cost is borne by the parties who benefit from the expenditure. In some cases, this may be a relatively subtle question, but it’s still worth asking. For example, how much work will clothing retailers do to allow you to do a direct comparison of their wares? They expend so much effort now on being distinctive. Is celadon the same color as cool mint green? I’m not sure.

Of course, we have to factor in Google (or the next great search engine, given potential Google-beating strategies). Given its social evaluation through link analysis, a Google-like approach works well enough much of the time. In 1989, Marcia Bates proposed a berry-picking model of how people go after information; people reformulate their information needs and supply the missing bits as they go. Unlike the relatively fragile part of the Semantic Web, the rules that do something with the data, a search engine-based approach is, as computer scientists are fond of saying, highly robust, and has demonstrated its scalability.

Googling for Mohawks

In keeping with my hair-raising theme, I used Google’s image search to find a Mohawk. Would I have found something as good through the Semantic Web? Probably not. I didn’t know exactly what I wanted until I saw it. And this was it. It’s the fellow’s eyes, not the quality of the Mohawk. Nor would the photo pass muster if I’d said I wanted a headshot; it’s cropped very strangely. But it showed up on the very first page of a Google image search.

My favorite question to ask of both the Flowbee and the Semantic Web is: Is it safe? Certainly we’re afraid to look for Flowbee horror stories on the Web; there are sure to be some. But the Semantic Web? It might not work; it might not be cost-effective; but surely it’s safe. Not so. Any computer technology that works like magic, that does too much on your behalf, will raise issues of trust.

We might as well ask ourselves right now how porn sites, creative spammers, identity thieves, and oppressive government regimes will use the Semantic Web. Clever people with something to sell have always found new ways to get through spam filter (and make at least some of the message recipients open the mail). Remember the first time you received an email with the subject line "Re: The information you requested”? And the spammers whose creative spelling brought us ads for “V.i.a.ggg.r.a.” Just yesterday I got a message that looked to be legitimate from eBay that was really a clever phishing technique, looking for information that’d allow them to defraud me, and quickly too. Phony metadata? A porn site technique that’s been around since day one — we can all remember when people learned to game Alta Vista by putting a dictionary’s worth of words in the metadata tags, or in the background in hidden letters.

Why would this be any different?

Let’s look at a competitive market situation, like online pharmacies. Online pharmacies are a great testing ground for deception, because often both buyer and seller are doing something that’s mildly illegal. How do people comparison-shop? Do they just go for the best prices? No. They look for hidden costs (for example, is it a one-time doctor consultation for $130 or does the examination fee crop up for each purchase?). They research generic equivalents. They talk to each other in discussion forums about reliability, scams, and processing times. Certainly they wouldn’t rely on their Semantic Web agent to get them the best deal on Mexican Xanax.

While we’re at it, it’s wise to ask how institutions like governments (or even insurance companies or direct marketers) will use the Semantic Web. It’s not hard for me to envision unexpected federations of data that will cause me to lose my insurance, be arrested, or even just receive coupons for some embarrassing products. After all, safety is not always an inherent property of the technology; some safety factors may only revealed by use.

So is the Semantic Web safe? Is safety ever even discussed by the Semantic Web’s most enthusiastic proponents?

Let’s take a look at a hyperbolic – but possibly accurate – description: “Every so often, an industry will experience a technological breakthrough that revolutionizes the way it operates. In order to achieve such a breakthrough, the new technology must out perform the old, it must introduce new efficiency never before known, and it must save money.” I lifted this quote from the Flowbee website, not from the Scientific American article about the Semantic Web. But it could’ve been referring to either revolutionary new technology. And there’s not a word about safety.

The Mullet: Unsafe at any Speed

If we read delve into the Flowbee site’s FAQ, we can see that questions of safety are lurking just below the surface. For example, one inquisitive Flowbee user asks if a Flowbee may be used to trim the coat of her beloved pet. Unsurprisingly, the answer is yes, but that she should be certain to “use the pet attachment. This will keep the pets [sic] skin in place.” Does this sound safe to you? What happens when a pet’s skin is out of place? It’s not a pleasant exercise for the imagination. Furthermore, it’s clear that the Flowbee can be used to intentionally create aesthetically unsafe hairstyles. I’m certainly not the first to realize the Flowbee’s potential for styling one’s own mullet.

Like the Flowbee, the Semantic Web can be used in unsafe ways either through user naiveté (analogous to our unsuccessful pet grooming scenario) or by design (analogous to our postulated self-styled mullet). The naïve Semantic Web user might find himself rerouted through Salt Lake City Airport when the agent that organizes his itinerary executes unexpected combinations of rules. Meanwhile, the Government, using the data that the Semantic Web intentionally provides, might be discovering discrepancies in our traveler’s finances that identify him as a potential terrorist.

Picture this: you are stuck in Salt Lake City in a summer thunderstorm as the Government systematically denies you access to your bank accounts. You can’t even use your last bit of pocket change to buy a stiff drink. It evokes a sense horror more profound than skinless pets and their mullet-sporting owners.

The Semantic Web: Unworkable, unnecessary, and unsafe at any speed.


table of contents

copyright 2004 Catherine C. Marshall