John Mueller provides brief details on a predictive method that search engines might use to fetch duplicate content. However, Google uses this predictive method to detect the copied content based on URL patterns. This actually can lead to the pages being detected as copied/duplicate web pages.
Google attempts to predict when the web pages are having the same type of content on their URL. This will reduce the crawling and indexing of unnecessary pages.
When Google crawls web pages having similar URL structure and the exact same content. in that case, Google determines other pages in the same category.
Unfair For Site Owners
This is very unfortunate with respect to the website owners as some useful pages having unique content are also considered duplicates just because they have a similar URL structure. Those pages are actually left out from Google’s indexing.
John Mueller On Forecasting Duplicate Content
In general, Google has various levels for determining whether the web pages have duplicate content or not. Two of them are:
- When Google directly looks on the page content
- When Google considers web pages as duplicates on the basis of URLs.
What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has a different content, we should treat them as separate pages.
The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.
As the video goes forward, Mueller starts explaining the reason why Google is saving its resources when it comes to crawling and indexing. According to Google, when they think that a web page is similar to another page then Google doesn’t even bother to crawl that page.
Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases. And I have seen that happen with things like cities.
I have seen that happen with things like, I don’t know, automobiles is another one where we saw that happen, where essentially our systems recognize that what you specify as a city name is something that is not so relevant for the actual URLs. And usually we learn that kind of pattern when a site provides a lot of the same content with alternate names.
How To Website Owners Can Solve This Problem
So what I would try to do in a case like this is to see if you have this kind of situations where you have strong overlaps of content and to try to find ways to limit that as much as possible.
And that could be by using something like a rel canonical on the page and saying, well, this small city that is right outside the big city, I’ll set the canonical to the big city because it shows exactly the same content.
So that really every URL that we crawl on your website and index, we can see, well, this URL and its content are unique and it’s important for us to keep all of these URLs indexed.
Or we see clear information that this URL you know is supposed to be the same as this other one, you have maybe set up a redirect or you have a rel canonical set up there, and we can just focus on those main URLs and still understand that the city aspect there is critical for your individual pages.
Google Search Central SEO Hangout Video