Tuesday, August 28, 2007

Caching, Ancestry, Archive, and Google

I have a personal blog where I talk about everything that comes to my mind. This is what it looks like now.

This is what it looked like in March 2005, July 2004, December 2003, and a full five years ago in August of 2002. All the links are courtesy of the WayBack Machine. Archive.org has been archiving the internet for several years now. Here's what another website I maintain looked like in October of 2000

There has been some talk about Ancestry’s caching of genealogical websites – such as USGenNet and genealogy blogs. Such as at Genea-Musings, About.com’s Guide to Genealogy , and Genealogue.

When I blog I know what I blog may appear elsewhere. I consider myself a poet, and have included some poetry in some of my blog posts. I’ve had some of this poetry appear on other sites without credit. (In these instances I emailed the owners and asked them to include a byline…which they did.) I’ve also had poetry I’ve written appear on websites, credited, but without people asking, which legally they are required to do…but I’m not wealthy enough to take them to court, and I don’t really mind, usually. I now have a Creative Commons copyright notice on the blog which allows people to distribute the content as long as they don’t make any money off of it, and as long as they give me credit. I don't have that notice on this blog. It's probably not going to appear here.

Of course, USGenNet doesn’t have a Creative Commons copyright notice on their site. And if you search for their archives at archive.org you will be able to access their archived homepage, but when you try to follow a link, you will receive the error msg: "We're sorry, access to [url] has been blocked by the site owner via robots.txt." Basically, robots.txt files are files webmasters put on their sites to tell searchbots that they shouldn’t archive their pages. I could put these on my site, but I don’t. USGenNet does. Understandably, too. Bots are still physically able to ignore the requests and archive the pages…- it's just respectable archival search engines (such as Google and Archive.org) don’t ignore the requests. Partially probably due to fear of legal retribution. Ancestry, apparently (key word - I'm still stating an opinion here) is ignoring these electronic requests. Note: I've been assured they didn't ignore robots.txt files.

As others have stated, I state as well, what this means legally is beyond me. I’m not a lawyer. I took a media law course in college over ten years ago, and have some clues, and this looks suspicious, but I am certainly not an expert. It should be interesting to watch if Ancestry does insist what they appear to be doing is legitimate, as there are a whole bunch of companies – completely outside of the genealogy industry – who might justifiably be worried about the results of a court case in Ancestry's favor. If a court decides Ancestry can cache pages on sites with robots.txt files specifically requesting pages not be cached … will Google and Archive.org decide to still be nice? I suspect every newspaper in the country has a stake in the answer to that question.

And while it certainly feels more reprehensible for Ancestry to charge for viewing their cached files, I suspect that newspapers or any other website which wishes to protect their content hope that's not the deciding factor, as archival websites making their content available for free likely isn't an acceptable solution from their perspective.


Becky Wiseman said...

I don't object to Ancestry creating a cache file or archive of my website or blog. I don't have a robots.txt file to prohibit them or any other search engines from caching those pages - it would defeat my purpose of putting family research on the web for other people to find. I want them to find it. I want them to use it. I expect them to use it. But I also would like to have some considerations, like proper attribution, as to where they got that information!

What I object to with Ancestry was the fact that this collection was in a subscription database. They were charging people to view my website. Legal or not, it's not right. The fact that they have now made it part of their "free" records is a step in the right direction, but one must still register to access it, so is it really free?

Janice said...




John said...

A step in the right direction probably in the minds of any or most bloggers, but I doubt in the minds of USGenNet or any professional site that's been cached.

Few businesses will ever do business in a 'moral, just' manner for virtue sake alone -- they have to see a business reason to do so. Which is why the law sometimes enforces what is 'just', and where the law doesn't, angry customers do.

I am curious what will happen next in the drama, as it could have repercussions elsewhere on the net.