I created my own duplicate content issue by accident and Google showed me straight away that they rule the search engines.
No messing about, they decided what they wanted to show. Not me.
So let me tell you the story of how it happened before I show you some of the evidence.
I am currently in the process of building a super agency, a consortium of 15 complimentary businesses who can deliver every marketing function and online service. Our goal is to help everyone, from new start-up businesses, to established corporate organisations, achieve better results.
It is called YewBiz.
The first few pages of the YewBiz.com website had been created and I was in the process of uploading it to our live hosting server so that we could start attracting traffic. Now, I was doing this at 11pm at night, which is why I ran into problems.
The website was successfully uploaded, our marketing focussed Umbraco CMS and database were in place and the website could be accessed perfectly fine on http://www.yewbiz.com.
As I went to apply the SSL certificate to the domain to allow a secure https connection, I discovered our hosting partner had no available IP addresses to assign to us. I instantly raised a support ticket asking for a IP address.
Given that this was now close to midnight, needless to say, a quick response was not forthcoming.
As I headed to bed, I knew I would wake up to a response. Sure enough, in the morning a new IP had been allocated. Ten minutes later, the SSL certificate was imported and the website was live on https://www.yewbiz.com
My learning’s and the moral of this part of the story, if you can, don’t upload a website out-with key hosting support hours.
Google Wasn’t Asleep
Whilst I rested my head during the night, Google had been busy crawling the beginnings of our new website. By the time the sun had risen, they had already indexed all seven new pages on the http version (http://www.yewbiz.com). A simple “site:www.yewbiz.com” command proved this.
At least at this point I knew our internal linking structure passed the Google test, as they could find all the pages.
Sometimes Google Need A Bit of Help
Google encourage webmasters to use their search console tool to help inform them about your website(s). If you don’t use Google Search Console (GSC) then I highly recommend that you do. If you need help setting this up or advice on best practises, then please get in touch.
So now that I knew the website was working ok on the https version, I logged into GSC to properly configure it for the non-www, www, http, https variants and amongst other things, to set our preferred version and list our sitemap.xml file.
By the end of the day, Google had already started indexing the web pages on the https version.
Confession - Something You Should Know
I need to confess something at this point in time. When the website was first uploaded, it had self referencing canonical tags on each page. So www.yewbiz.com/about-us/ had rel=canonical href="http://www.yewbiz.com/about-us/"
Then when the website was changed to https, the canonical tag became rel=canonical href="https://www.yewbiz.com/about-us/"
The tags didn’t correctly identify a single variation (as per Google’s guidelines) of our preferred variant. Given that less than 12 hours had passed since the website was first uploaded, I hoped for some forgiveness.
If you are reading this and own a website, and don’t have any idea what a canonical tag is, please get in touch so I can help you.
Back To The Purpose Of This Post
During the course of my work, I still find web pages that operate on different URLs within the same site. More worryingly, I find clients who are sharing the same content across multiple websites (with the only difference being colour schemes and logos). There is not a single canonical tag in sight.
They are harming themselves and in reality, they are unlikely to rank both websites (or pages) highly in Google, so they are diluting their marketing efforts.
Take the time to read the first two paragraphs in the Google content guideline below. Google clearly state that they will view these pages as duplicates and decide themselves which to use. They also state it “might lead to unwanted behaviour”.
Source: Google Guidelines
This is basically what happened to YewBiz when I could not set up the SSL at the time of going live and the non-SSL version was indexed.
As soon as the Google crawlers picked up the SSL (https) version, they chose which version to include in their index. This is shown below with the mixture of https and non-https variants.
Google are even kind enough to inform me that they have chosen to ignore what they have deemed duplicates (highlighted in the red box).
When I click on the link to view the omitted results, I see:
Now you can see the http and https variants of the duplicate pages.
For those who are wondering in their own heads: yes, there are 301 redirects set up to redirect from the http version to https.
To summarise this section, both variants had self-referencing canonical tags in place, the time difference between one version going live before the other, was approximately 8 hours. Proper 301 redirects were in place then. The preferred domain was set in Google Search Console as soon as the https version was live.
And if we remember, Google prefer SSL protected websites. It is a trust signal and small ranking factor.
Yet Google opted (and still do 5 days later) to show the non-https of their own will.
What Does This Mean
Hopefully, this demonstrates how Google select what gets shown in their search result pages or not, even when there is no wilful plagiarism in play. In this case, it was a small technical issue which spanned a period of several hours – not days, weeks, months or years.
For those who steal, borrow or beg content from other websites without the proper referencing in place, you are not doing yourself any favours.
Whenever I conduct a “site:” search on Google and see a message similar to below, I know there is an issue, either technical or straight-up duplicated content, to be fixed.
It is important that you regularly audit your website(s) to check for issues.
Do a simple test now. Search in Google prefixing your website address with site:
If you see a message saying that Google is omitting results (you might have to go to the last page of the Google results to see the message), then you have a problem.
Even The Big Companies Fall Foul
It tends to be, the bigger you are, the harder it is to keep on track of things. When I use Costa Coffee as an example, I note that Google has indexed 636 pages from their site.
Yet, when I do the test on them, I see Google is only showing 330 results for their website. That means Google doesn’t like 336 pages on the Costa site.
Is it a technical issue, duplicate content issue, or perhaps something else? If and when Costa hire me, I will let them know.
The Sales Bit
It would not be business-like to finish off without offering you some form of sales hook.