Canonical URLs and duplicate content

by james on February 13, 2009

Duplicate content is something that many webmasters agonise over, trying to ensure that you have just one copy of the same content on one URL, rather than having several URLs that point to the same content. The issue is often complicated further when you have an analytics package that parses in specific strings to the end of URLs based on campaigns – you can for example end up with a URL something like this:

http://jamesmorell.com/bristol-seo?cid=rss&attr=news

Which points to exactly the same location as the ‘regular’ URL of

http://jamesmorell.com/bristol-seo

But shows the search engines the same content at two separate locations. This means that the search engines have to try and work out which is the most important or relevant link to show to people browsing for your content – something that is not always easy particularly if, as in the example above, you’re syndicating out your content through RSS and including query strings for analytics purposes.

Up until now there have been a few methods of solving the issue with canonical URLs. Firstly, and most impractically, don’t use them! Simply ensure that you only have on URL that people can link to your content with. For smaller sites this is the simple way round the issue – I don’t have any issues (at present!) with canonical URLs on jamesmorell.com as it’s simply not large enough or – bluntly – sophisticated enough to have the problem. However at an enterprise level, which is where I spend the majority of my time, canonical URLs are the bane of my life.

Secondly, use a redirect. It’s an elegant solution with some limitations. If you find where your site processes the HTTP GET request, you can set up a 301 redirect to the URL without the additional tracking information on it. A sample piece of code looks like this:

301 Moved Permanently
Cache-Control: max-age=0

Ensuring you have cache control on means that the requests some through to your server and don’t get sorted out on a network based cache. You can at the same time as creating the redirect add a cookie to track your visitor’s session. Also bear in mind that IIS servers and ASP.Net servers use 302 redirects by default, rather than the 301 redirects seen onan Apache based system. I have seen this technique used to run an in-house affiliate scheme successfully, but it does mean that you need to keep your own database of all the redirects (the tracking URLS) which for the vast majority of people is overkill.

The other main problem with this technique is that many third party analytics systems won’t work with it, including Omniture, Webtrends, Google and Microsoft Adcenter.

Thirdly, use URL fragments. Search engines have long ignored anything after a # in a URL as they denote links in a page. So, going back to our original example, change the URL that you are tracking to:

http://jamesmorell.com/bristol-seo#?cid=rss&attr=news

However as with all of these things, there are ‘issues’. Google Analytics will by default ignore anything after the # which pretty much negates your using it in the first place. Fortunately there is a workaround. By adding the following javascript to your page you should be able to get Google to track your query strings:

var pageTracker = _gat._getTracker("UA-12345-1");
// Solution for domain level only
pageTracker._trackPageview(document.location.pathname + "/" + document.location.hash);
// If you have a path included in the URL as well
pageTracker._trackPageview(document.location.pathname + document.location.search +
"/" + document.location.hash);

Be sure to change the “UA-12345-1″ to your own Google Analytics code!
The other issue with this method is of course if you are using something other than Google Analytics, and you may well have to write custom code to get it working.

The next option is to use robots.txt to tell search engines to ignore tracking code. Taking our example above of http://jamesmorell.com/bristol-seo?cid=rss&attr=news again you would simply modify your robots.txt file to include the following line:


User-agent: *
Disallow: /?cid

Here you are telling search engines to ignore and not index content on your site that contains the parameter ?cid. This seems like the most simple of the options presented thus far, and in many ways it is, but it really doesn’t help you when people are linking to your content using a canonical URL – all the lovely link juice you are getting from having those links in place is going to waste. It is also very hard to make sure you’ve got all the robots.txt Disallows in the right format, and you may well find that you’ve disallowed a whole section of your site that you really do want indexing.

So, this far the best looking option is the 301 redirects option with the internal database of all your redirects. A pain in backside to implement for sure, but once set up correctly it does work.

That is until today…

Today, Google, Yahoo! and Live all announced support for a new tag attribute (similar to rel=nofollow) that webmasters can make use of to avoid canonical URLs and duplicate content.

To break it down to its most simple level, to get around the issue of duplicate content all you need do is add to your main page and any other pages that are duplicate versions of the page the following:

<link rel="canonical" href="http://www.example.com/main-article-you-want-to-rank-for" />

In the section of your page content

So taking our first example, on my Bristol SEO page, I would simply add in:

<link rel="canonical" href="http://jamesmorell.com/bristol-seo" />

to the head tag, and anyone coming to the page or linking to the from:

http://jamesmorell.com/bristol-seo?cid=rss&attr=news

Would have no negative impact on my page’s standing within the search results.

This implementation is a huge win for webmasters the world over, and shows all of the search engine’s commitment to providing the best quality results for searchers, while providing webmasters with simple solutions for problems.

Naturally, and SEO specialist worth their salt will recommend that you get around this issue by setting up your site well in the first place, but there are times when canonical URLs and duplicate content are quite simply unavoidable, particularly in an ecommerce environment, and “rel=canonical” is a major step forwards, with plugins already available for WordPress, Drupal and Magento from Joost De Valk. Let’s hope that we see better quality search results appearing as a result of this!

Tags: canonical URLs, duplicate content, google, Live, redirects, SEO, webmasters, Wordpress, yahoo

Share and Enjoy:
  • Digg
  • Sphinn
  • Yahoo! Buzz
  • StumbleUpon
  • Facebook
  • del.icio.us
  • TwitThis

Related posts

{ 2 comments… read them below or add one }

1 Cole February 13, 2009 at 1:27 am

great summary James, thats pretty much the silver bullet for unavoidable dupe content. How similar to the ‘wrong’ URL does the rel have to be i wonder? Bet it won’t work for a certain type of nav we all know and love ;o)

2 Justin March February 14, 2009 at 10:03 am

Good question Cole the best explanation that I have seen of this can be found here: http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps#jtc78473

Leave a Comment

{ 1 trackback }

Additional comments powered by BackType

Previous post:

Next post: