Login

Generating a Site Map for OnSmalltalk

OK, so any website that wants to be indexed well by Google (and those other guys) should be generating an XML sitemap for the search engines to index. A sitemap is nothing fancy, though it can get more complex if you choose to take advantage of more of its features; I prefer a simple version with everything marked as updated weekly.

I also prefer to invoke the generation of the sitemap manually and to generate it as a static file that Apache can serve up rather than having Seaside build one dynamically (though I'll probably change my mind later). My blog has an admin panel with a menu option to generate site map which invokes...

generateSiteMap
    | siteMap |
    siteMap := SBSiteMapGenerator blogRoot: 'http://onsmalltalk.com/'.
    siteMap generateFromItems: {  (SBPost new)  } , 
        (SBPost findAll: [ :e | e isPublished ]) , SBTag findAll.
    (siteMap pingGoogleWithMap: 'http://onsmalltalk.com/sitemap.xml') 
        ifTrue: [ self message: 'Map generated and Google notified successfully.' ]
        ifFalse: [ self message: 'Map generated but Google notification failed.' ]

The first item in the list, the empty new post, creates and item without a slug which represents the root of the site. I don't bother pinging the other search engines, the vast majority of my traffic comes from Google, the rest will find me eventually. So let's run through the generation of this sitemap, it's only a few methods. The class declaration...

Object subclass: #SBSiteMapGenerator
    instanceVariableNames: 'document root blogRoot'
    classVariableNames: ''
    poolDictionaries: ''
    category: 'OnSmalltalkBlog-Config'

A couple of accessors for the blog root...

blogRoot
    ^ blogRoot

blogRoot: aRoot
    blogRoot := aRoot

And a constructor that uses it...

blogRoot: aRootUrl 
    ^ self new
        blogRoot: aRootUrl;
        yourself

Since I'm going to write the sitemap to disk, I'll need to know where to put it, and I'll want it configurable...

siteMapPath
    ^ (FileDirectory
        on: (SSConfig at: #blogWebRoot default: FileDirectory default fullName))
        fullNameFor: 'sitemap.xml'

Now a method to generate the document, add the items to it, and write the file to disk...

generateFromItems: someItems
    document := XMLDocument new
        version: '1.0';
        encoding: 'UTF-8';
        yourself.
    root := (XMLElement named: 'urlset').
    root attributeAt: 'xmlns' put: 'http://www.sitemaps.org/schemas/sitemap/0.9'.
    root attributeAt: 'xmlns:xsi' put: 'http://www.w3.org/2001/XMLSchema-instance'.
    root attributeAt: 'xsi:schemaLocation' put: 'http://www.sitemaps.org/schemas/sitemap/0.9 
     http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd'.
    document addElement: root.
    someItems do: [ :e | self addItem: e ].
    FileStream forceNewfileNamed: self siteMapPath
        do: [ :f | f nextPutAll: document asString ]

For each item, I'll want to generate an entry. The item is expected to respond to two methods, #createdOn, and #slug. All of my posts and tags respond to these so I can just toss then into a single list of items...

addItem: anItem
    | url location lastModification isoString changeFreq |
    url := root addElement: (XMLElement named: 'url').
    location := url addElement: (XMLElement named: 'loc').
    location addContent: (XMLStringNode string: self blogRoot , anItem slug).
    changeFreq := url addElement: (XMLElement named: 'changefreq').
    changeFreq addContent: (XMLStringNode string: 'weekly').
    lastModification := url addElement: (XMLElement named: 'lastmod').
    isoString := String streamContents: 
        [ :stream | anItem updatedOn printOn: stream withLeadingSpace: false ].
    lastModification addContent: (XMLStringNode string: isoString).

With the file generated, we're ready to let Google know we've updated it...

pingGoogleWithMap: aMap 
    ^ (WAUrl new
        hostname: 'www.google.com';
        addToPath: 'webmasters/tools/ping';
        addParameter: 'sitemap' value: aMap;
        yourself) asString asUrl retrieveContents content 
        includesSubString: 'Sitemap Notification Received'

And that's it, Google knows the site's been changed and all of its valid URLs, and most of the time, is crawling the site within minutes, if not instantly.

I've got to say, I'm not missing Wordpress at all; it's a lot more fun just building your own blog.

Comments (automatically disabled after 1 year)

P Bertels 5795 days ago

I really like your blog and I read it often, when I tried to check out the sitemap you generated at http://onsmalltalk.com/sitemap.xml ...

I got: XML Parsing Error: not well-formed Location: http://onsmalltalk.com/sitemap.xml Line Number 438, Column 60:<lastmod>2008-11-30T23:00:17+00:00</lastmod></url></urlset>/urlset>

You might have a small bug that is duplicating part of the closing tag.

Ramon Leon 5795 days ago

Thanks for the heads up.

Ah, dumb mistake on my part, wasn't creating a new sitemap everytime, but opening up the existing one and appending to it.

about me|good books|popular posts|atom|rss