Google Sitemaps for Pebble 2

Hansar Cafe / Bar, Sukhumvit Soi 1, Bangkok, Thailand, 2006-12-23

#software_engineering #webdev

Google Sitemaps is a mechanism that gives you some level of control over how Google indexes your website, including:

  • Which URLs to index (this would be especially important if you had pages that had few or no incoming links)
  • How often you expect the content of those pages to change
  • The ranking of pages within your site

In around October of 2006, Google, Microsoft and Yahoo annouced that they would collaborate on the sitemap standard, so the sitemap.xml you create for Google can now be used for similar benefits from the other two big search engines, although at the time of writing MSN Search doesn’t seem to have implemented support for sitemaps. I go through posting your sitemap.xml to the search engines at the end of this article.

Information on the last Google-only sitemap standard is here, and information on the new Google / Yahoo / MSN standard (which is identical to the old Google standard) can be found here.

Since in a blog server like Pebble, all your content is already in a machine readable format, it makes sense to generate the sitemap.xml from that content, rather than hand-coding it. So I’ve written the PebbleSitemapServlet to do just that

UPDATE 2006-12-30: Rather than continuously update this blog post, I’ve created a projekt page for Pebble Addons where the latest versions of the software and documentation will be placed.

How It Works

It is in fact bog-simple. The pebble API provides ready access to all of the information needed to generate the sitemap. All that the generator does is iterates through the blog entries and produces XML that conforms to the sitemaps.org protocol.

Getting the Sitemap Generator

The PebbleSitemapServlet is available as open source under the terms of the GNU GPL. It can be downloaded in source and binary form from SourceForge:

UPDATE 2006-12-30: See the Pebble Addons website for the latest version

Either grab the source distro and run the Krypton build, or just grab the binary distro.

Installing the Servlet

First, copy the JAR file into your Pebble deployment’s WEB-INF/lib directory.

Next, open up WEB-INF/web.xml, and look for the first tag. Right before that tag, add the following servlet declaration:

 <!-- Sitemap Generator for Pebble -->
 <servlet>
   <servlet-name>PebbleAddonsSitemapServlet</servlet-name>
    <servlet-class>com.brendonmatheson.pebbleaddons
      .sitemap.PebbleSitemapServlet</servlet-class>
  </servlet>

And finally, still in web.xml, look for the first tag, and add the following servlet-mapping right before it:

  <!-- Sitemap Generator for Pebble -->
  <servlet-mapping>
    <servlet-name>PebbleAddonsSitemapServlet</servlet-name>
    <url-pattern>/sitemap.xml</url-pattern>
  </servlet-mapping>

Testing the Servlet

Depending on your container, you may have to restart the webapp or the entire container to get the servlet going. Point your browser at the sitemap.xml servlet in your blog. For example:

http://localhost:8080/pebble/sitemap.xml

You should see a bunch of XML code that looks like the samples in the Google documentation. Google Sitemaps requires that the character encoding of your sitemap is UTF-8. The servlet sets the encoding and you can check it by going to the View / Character Encoding menu in Mozilla Firefox or the View / Encoding menu in Internet Explorer to make sure it’s set to Unicode.

Configuration

The sitemap servlet has a number of init-parameter that you can optionally set to tune it’s output. The following excerpt from web.xml shows a fully re-configured version of the servlet:

  <!-- Sitemap XML Generator for Pebble -->
  <servlet>

    <servlet-name>PebbleAddonsSitemapServlet</servlet-name>
    <servlet-class>com.brendonmatheson.pebbleaddons
      .sitemap.PebbleSitemapServlet</servlet-class>

    <init-param>
      <param-name>schemaUrl</param-name>
      <param-value>
        http://www.sitemaps.org/schemas/sitemap/0.9
      </param-value>
    </init-param>

    <init-param>
      <param-name>blogChangeFreq</param-name>
      <param-value>daily</param-value>
    </init-param>

    <init-param>
      <param-name>blogPriority</param-name>
      <param-value>0.1</param-value>
    <init-param>

    <init-param>
      <param-name>blogEntryChangeFreq</param-name>
      <param-value>monthly</param-value>
    </init-param>

    <init-param>
      <param-name>blogEntryPriority</param-name>
      <param-value>0.9</param-value>
    </init-param>

  </servlet>

The meaning of these parameters is as follows:

  • schemaUrl - The URL for the sitemap XML namespace. By default it refers to 0.84, the last Google version which still works with Google and appears to be accepted by Yahoo. The code fragment above configures the servlet to use the latet public namespace
  • blogChangeFreq - The change frequency that the blog’s home URL will be marked with. Default: “weekly”. If you post often you might want to set this to “daily”.
  • blogPriority - The priority that the blog’s home URL will be marked with. Default: 0.3
  • blogEntryFreq - The change frequency that the blog entry URLs will be marked with. Default: “monthly”.
  • blogEntryPriority - The priority that the blog entry URLs will be marked with. Default: 0.8

Note: By default the blog’s home URL is ranked with a lower priority at 0.3 than blog entry URL’s which are ranked at 0.8. This is to make it more likely that entry permalinks will appear in search engine results than the blog’s home page.

See http://www.sitemaps.org/protocol.html for more information on the meaning of these parameters.

To cause the servlet to log it’s configuration parameter loading to the logj appenders, make sure it is running with debug enabled by adding the following line to log4j.properties:

log4j.com.brendonmatheson.pebbleaddons.sitemap=debug

Telling the Search Engines About Your Sitemap

Google

The final step is to tell the GoogleBot to use your sitemap.xml descriptor instead of doing it’s standard indexing.

If you haven’t logged into Google’s Webmaster tools before, you’ll need to link in and verify your website. To access Google Webmaster Tools, all you need is a GMail account.

After that, you can go to the Sitemaps tab, click “Add a new Sitemap”, and point it at the dynamically generated sitemap.xml you now have in your blog. GoogleBot is a busy piece of software, so after you submit your sitemap you’ll probably have to wait a little while, possibly a few hours, before it’s accessed.

Yahoo!

Yahoo has Site Explorer, a management UI quite similar to Google’s Webmaster Tools. which allows you to submit your sitemap.xml’s URL. To access Yahoo’s Site Explorer app, you need a Yahoo account.

If you’re watching your log to see when the bot accesses your sitemap.xml, Yahoo seems to use the UserAgent header:

Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)

Cheeky. Or maybe there really is a guy in some Yahoo basement on an old Win 98 box who has the job of manually entering all sitemap.xml information into the Yahoo index. Poor feller.

I’ve looked all over and haven’t been able to find any place where you can submit a sitemap.xml to MSN, so I guess they haven’t implemented it yet. If anyone has any info on this please let me know so I can update this post.