One of the unsung wonders of the modern world is the Internet Archive, especially its astonishing "Wayback Machine."
I had fresh occasion to marvel at the power of this tool when I was looking for something that I thought I remembered writing, not that long ago, and in the process ran across a complimentary blog post concerning another article from (pardon the expression) way back in 2002. I wondered what I’d written that someone had liked that much.
Of course, the link to that article was no longer valid — since the content hosting arrangements of that publication have gone through several upheavals since that time — but I went to the Wayback Machine and simply pasted in the URL. A few seconds later, I had my choice of multiple archive snapshots that had captured the original content, even preserving the links to other contemporaneous material that the archive had also retained. I didn’t even need to supply the original publication date.
That’s a neat hack that never loses its charm — and I value it all the more for my inside knowledge of how the trick gets done. It was an interesting coincidence, though, that this happened on the same day that I noted a story about Composite Software’s new "Composite Discovery" appliance, a combined hardware/software package aimed at giving people faster access to more relevant information.
The following language in the product announcement especially caught my eye: "Currently, managers spend an average of two hours per day looking
for information, with nearly half of the information, once found, to be
useless… According to Framingham,
Mass.-based industry analyst firm IDC, enterprise data is growing by a
factor of ten every five years, or at a compound annual growth rate of
nearly 60 percent… Further, business executives are often unable to access the information
in a form that is useful to them."
I don’t doubt any of these statements for a moment, but I wonder if the most powerful tool for improving the relevance of content retrieval is always going to be the human mind rather than any formal algorithm.
My all-time favorite anecdote of data mining concerns the spaceflight engineers who were researching possible landing sites for the Space Shuttle if it failed to reach orbit. The characteristics of a particular isolated island, it turned out, were documented in the records of a 19th-century ornithological expedition that noted (among other things) the merits of the island as a landmark easily seen from near-orbital altitude.
No, of course the information wasn’t labeled as such, but data like the island’s deposits of guano (bright white) and the sheer cliffs all around the island (size doesn’t change with the tides) were highly relevant. It wasn’t an algorithm, though, that yielded this information, but the associative wetware of a librarian.
Similar powers of retrieval, I suggest, are also to be found in Salesforce Content, which does all the things that users have been promised for years: associative search, freedom from folder hierarchies, automatic versioning and version/comment subscription — a lot of good stuff that has somehow failed to find its way to user desktops despite more than a decade of heavily hyped promises.
Data mining in an appliance is seriously cool, but data mining in the cloud — spanning both virtual space and the additional dimension of time — is the kind of thing that will ultimately define the true value of the Web.