At Pubcon there was a session on duplicate content. The speakers were from Ask.com, Google, and Yahoo. All the search engines agree that sites do not get penalized for duplicate content. It is a one of the largest seo myths on the web. As you can imagine that almost every large site on the web has some duplicate content issues. But duplicate content can still hurt your site. The largest problem is that when the bots start finding duplicate content they may stop crawling your site. So duplicate content can keep your site from getting completely crawled and indexed. Also when the search engines find duplicate content they must decide which is the original or most important. This may be bad if you have duplicate content on a print page for example with no navigation. Anyway, it is still a large problem and must be addressed to fully achieve your best search rankings.
The most common reasons for duplicate content are:
- Having multiple URLs pointing to the same pages.
- Print pages
- Having the same version of content on different country sites
- Problems with dynamic sites like session IDs
- Syndication of content on other sites
- Mirrors
It also seems that many people have problems with duplicate content due to 3rd parties scraping their sites. To discover this simply search for a sentence of text within your page (without punctuation) in Google and see if any pages come up. If it does the only thing you can do is ask them nicely to take it down and then include the lawyers.
Here are some tips to remove duplicate content issues:
- Act on what you have control over! Use robots.txt and no index tags. Remove session ids, affiliate IDs, and tracking IDs from the URLS. One problem is also if you have a list with multiple sort options many times this creates a new URL. One slick solution to this is to put the sort option in a cookie instead of the URL. This would also provide a better user experience because it would remember the sort order the last used.
- Use distinct TLDs (Top Level Domains) when localizing your site and make the content unique whenever you can.
- Make it hard for scrapers. First use copywrite/Creative Commons notices on your site and make it clear you don’t allow others to use your content. The use of absolute links often messes up scrapers and hosting images locally also helps. Don’t be afraid to use legal action if you need to.
- If you have problems don’t hesitate to contact the search engines and let them know. All the search engine reps seemed very interested in helping solve duplicate content issues.
- Use the webmaster tools that the search engines provide. Google webmaster tools for example lets you easily remove duplicate content links in their interface. Yahoo also has a nocontent tag you can use in your divs to tell the search engines not to crawl parts of a page. They also have a great new feature called dynamic rul rewriting in their Site Explorer.
- It was thought that www.site.com and site.com would cause duplicate content issues. The search engines are smart enough to make this a non-issue nowdays. But everyone still said it was good coding practice t make the non-www version 301 redirect to the www version of your site.
- Make RSS versions uncrawlable.
- Regularly review URL requests at a server level to see what pages are getting crawled. This is a great way to find duplicate content issues.
- Know your URL parameters. Ideally you would not have parameters in the urls. But if you do have them make sure you know what they all do. Make sure you also put them in the same order each time. If you have the parameters in a different order it will appear as a different URL.
- Get a regular crawl report form your analytics program or from your log files.
- USE 301 redirects whenever a page or content moves.
- If you use syndicated content on your site you need to set your expectations. Syndicated content will never rank as well as the original.
There are many of these things I still need to do on my sites. I have removed many duplicate content issues in the last few months (since SES) but I still have a ton to do.








no comment untill now