Saturday, November 26, 2011

The Internet Wayback Machine - and Blogspot

There are occasionally discussions regarding whether or not a blogger should host their blog for free at Blogspot, Wordpress, etc - or buy their own domain name.

One factor I haven't heard mentioned in these discussions is: How Often Will the Internet Wayback Machine at Archive.org 'crawl' your site?

Archive.org has been preserving webpages for over a decade.  This can be very useful.  Bloggers may wonder what will happen to their blog posts if their blog disappears -- but if it's being archived somewhere else, survival isn't completely dependent upon the blogger's backup regimen.

I recently discovered that:

1) This blog, which currently has over 200 subscribers according to Google Reader, has been 'crawled' by the archival spiders at the IWM a grand total of 2 times, both in 2008, and a total of 67 pages have been preserved.

2) A blog I have maintained since 2002 on my personal domain, and which currently has 6 subscribers according to Google Reader, has been 'crawled' 52 times since 2006, and a total of 5359 pages have been preserved.  (Note: This is a Wordpress blog, and has separate 'pages' for comments, trackbacks, and rss feeds, so the number is probably closer to an equivalent of 1000 pages.)

3) I decided to look at the results for some other blogs.  I've decided not to name them.  Those who are curious about their own blogs, can follow the links above, and replace the URLs for my own blogs with any other site they wish to test.

I looked at three other popular genealogy blogs maintained on Blogspot, all with more subscribers than I have through Google Reader.  Two blogging since 2008, and one blogging since 2006.  The former two have been crawled twice each, with 15 and 23 pages preserved.  The one blogging since 2006 has been crawled 7 times, and has 346 pages preserved.

Then I looked at two popular geneabloggers, both blogging since 2006, who switched to a personal domain back in 2008.  Their Blogspot blogs were crawled 2 and 7 times, with 60 and 783 pages preserved respectively.  Their personal domains have been crawled 37 and 40 times since 2008, with 427 and 2118 pages preserved respectively.  While the numbers are different, moving to a personal domain clearly benefited both on this measurement.
    4) The last page preserved for each Blogger-blog has the exact same filename, and may be part of the reason why so few pages are preserved:  robots.txt.  

    Following some links on the archived pages results in this error:

    From what I have found researching so far, Google added the robots.txt files to Blogger blogs in 2007. (Explaining perhaps why those blogging since 2006 were crawled a little more) This file, which cannot be changed, is preventing search 'robots' from following certain links on the blog.  I'm not entirely certain which links are blocked, and which ones aren't. It's certainly not stopping Google from indexing their blogs.  Google has owned Blogger and Blogspot since 2003, and certainly wouldn't do that.  But it appears to have an impact on how other robots crawl the site.

    Some references to the Blogspot Robots.txt suggest its primary purpose is to prevent the 'duplicate' pages that otherwise might result, as exemplified by the 5000 pages the Internet Wayback Machine has preserved for my Wordpress blog.  But it appears to be having a larger impact than that.

    The Robots.txt file is on the Custom Domains as well, so it's not the entire explanation.  The Internet Wayback Machine might treat Blogspot, in general, differently.


    Why did I originally set my genealogy blog up on Blogspot?

    I didn't at first.  For the first few months all my genealogy-related posts were a subset of the personal blog referenced in (2) above.  But as I grew more obsessed with genealogy, I knew I needed a separate space devoted to the one topic.  So many other geneabloggers were using Blogspot, and it was easy to use, so that's the direction I went.

    It wasn't a mistake, per se. Blogspot has been a fine home.  But I've considered moving the blog back 'home' before, and this was just the proverbial straw for me.



    All of this explains why as of this post, this blog is no longer located at http://transylvaniandutch.blogspot.com - but is now at http://blog.transylvaniandutch.com

    All links to the former Blogspot version should forward automatically to the new page.

      5 comments:

      Greta Koehl said...

      Thank you for this information. I'll have to think about it, but I may be taking this path as well.

      Lisa Wallen Logsdon said...

      I have never had my own domain and I would like someone to blog the steps about how you get your own domain and how you move an entire blog over to one. And, if you buy a domain, is it a one time fee? This is not an area I am at all familiar with.

      John said...

      Blogger provides some step-by-step instructions on how to set up a "custom domain"

      http://www.google.com/support/blogger/bin/static.py?page=ts.cs&ts=1233381

      Buying a domain is an annual fee. Currently, though Google, the cost is $10/yr.

      Susan Clark said...

      Food for thought, John. Blogger will publish your blog in perpetuity (or until the terms of service change) but isn't easily accessed by The Wayback Timemachine. A personal website will go dark once the annual fee isn't paid, but is accessible. I wonder how often Weebly sites will be recorded?

      John said...

      If the Internet Wayback Machine robot crawlers are being blocked, I wonder what other search engines are experiencing similar issues.

      I'm still using Blogger, which is owned by Google, which doesn't appear likely to be going under any time soon. If I at some point in time go with a different domain I can just enter the new domain in the settings and everything moves with it.

      I can leave passwords and instructions for next of kin to transfer it back to Blogspot upon my demise, or maintain the domains, as they desire.

      Of course, by the time that happens, there might no longer be an internet, as something else may have replaced it.