What You Don't Know About Duplicate Content Can Kill You

How to deal with duplicate content right now

All of us create and duplicate our own content unintentionally. Content can be fully or partially scraped by others. Duplicate content can cause your pages to not rank well on search engines, be removed from search results and even lead to legal complications.

Here’s my advice on how to identify and deal with duplicate content.

What is duplicate content?

Google Webmasters (now Google Search Console) defines duplicate content as “substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”

Search engines do a great job of showing the best possible content in response to a search query and of identifying the original content source as well. Adhering to Webmasters guidelines will make it easier for search engines to understand what page holds the original content and help it rank accordingly. Having too much duplicate content on your site will lead to loss in rankings and organic traffic. Having other sites duplicate your content could have your content eliminated from search results, especially if the other sites have greater authority, higher number of links pointing to their sites and they have provided no link attribution indicating yours is the original content.

Needless to say, when dealing with Hispanic SEO, it is quite easy to generate duplicate content in two different languages or by geo-targeting. I have covered some of these issues in my International SEO article.

The first thing to understand are the different types of duplicate content: deliberate or malicious, and non-malicious.

Non-malicious duplicate content could happen on discussion forums that may generate a separate desktop and mobile page, in online stores’ product definition that may be repeated on several different and distinct pages, in comment pagination, in lack of definition for the preferred domain, and even when offering printer-only versions of your website pages. This type of duplicate content is not penalized by Google, although I highly recommend avoiding it as much as possible. What you don’t consider malicious, search engines may.

Malicious or deliberate duplicated content occurs when a website owner attempts to manipulate search results to rank better or increase traffic. This is penalized by Google with removal from search results.

Google Panda and duplicate content

The most malicious duplicate content of them all: Content scraping

Let’s say you are researching a topic you want to write about and happen upon a wonderfully written piece of content. Who hasn’t? What to do? What to do? As long as you do not scrape the article and you do provide proper attribution, you may cite it. Providing proper attribution will allow you to avoid being accused of plagiarism. Plagiarism is passing someone else’s work off as your own. This is different from copyright infringement, that is using someone else’s protected work by Copyright law without permission which exposes you to being taken to civil or criminal court. Is it really worth it?

Even better yet, ask the content owner for permission. This is very easy to do. Usually there’s an email, a contact form, or a Twitter handle where you can simply write: “Hey! I love your content. Can I properly cite it? What type of attribution would be acceptable to you?” What do you think can come out of this? A resounding yes and the likelihood that they will share your content on their own network. Pretty nifty, huh?

How can I properly cite somebody else’s content and why?

You may think that writing the source alone provides clear attribution, but look at your article again. Does it read as if it’s yours and at the very end, there’s a footnote with the source? Then the attribution is not clear.

Another benefit of proper attribution is the respect earned from your own readers who will see you as an honest content developer. A third benefit is the appreciation of the person you have cited. Who knows? They may invite you to guest blog one day.

Are you afraid that your readers will exit your site to read the source instead? Then make your content even more engaging. Provide more value. When was the last time you were reading a great article who cited somebody else with a link and you clicked on it? Probably not recently. The number of people who will leave your site because of a citation/attribution is minimal and is important to remember that people will leave your site eventually, anyway.

My recommendation for proper attribution of content is to enclose the text in quotes, indicate within the paragraph who said it and add a link to the source. There is no need to link to the source every time you mention them but you should mention them each time you are citing something they said. Therefore, it is clear you are not misappropriating content and passing it off as yours.

How to provide proper attribution

If you decide to reword content but still share someone else’s original concept, add the name of the source to the paragraph and link it to the source page from there. Here’s an example:

Example of proper attribution. How to deal with duplicate content

Another example of attribution

Still worried that people might leave your site or that you will lose Google juice if you adhere to these practices? Let me assure you, you will get the exact opposite.

Penalization for duplicate content may not be too severe right now but wait a couple of years and try to save your site from another “Panda.” As far as people leaving your site, think about yourself, do you click on every link a page offers or do you keep on reading the article? If you are still afraid, improve your content, provide greater value to your reader and get rid of your fears.

If you decide to copy the whole article (and I strongly advise that you do not, especially if it’s one of my articles), I highly recommend citing the source at the very beginning and at the end. And, for your own protection as well as respect to the person that wrote the article, canonicalize the URL by pointing to the original URL. Then you can offer the content to your readers while letting search engines know which is the original article. Don’t know what “canonicalize” means? Don’t worry. Keep reading.

If you have a WordPress site and have the Yoast SEO plugin installed, add the URL to the canonical field.

Has somebody scraped my articles?

There are several ways to find out if your content has been scraped. A Google search, is usually my very first check, but you can create a Google Alert out of that search to be alerted when somebody infringes your copyright or scrapes your site. Another great tool is Copyscape. Their free version will allow you to identify if your content has been duplicated on other sites.

Infringing content removed by Google

Here’s a result for one of my articles, copied in its entirety by the first site. Yes, I have requested they take it down. I sent them an email and copied their hosting company. Did they? No, and a month passed by. What are your rights then? Submit a request reporting them to Google thanks to the DMCA (Digital Millennium Copyright Act). Does it work? Check it out on your own. 😉 A short time after I reported them, the page does not show up on search anymore and there’s a very nice footnote from Google about it.

The notice also gets posted to the Chilling Effects database. Chilling Effects is a project of the Berkman Center for Internet & Society at the Harvard University and collects notices of copyright infringement from the web.

Have I given you a great reason not to infringe copyright and to quote, provide proper citations and use lots of “according to…” and loving links instead? Good. Nothing that is not yours should read as if it was yours.

Karma’s a Bitch

How does scraping somebody else’s content affect my life? Simple: infringing copyright is a crime and you may end up in civil court. Have you scraped content and would like to find out if Google has filtered you out? Add “&filter=0” at the end of the search query.

Content scraping SEO penalties

Non-malicious duplicate content and how to address it

We all have duplicate content… duplicated by ourselves!!! Some of these types of duplicate content are very common. One of the most common types are the printer-friendly pages. You know, those pages that pop up without any page formatting so people can print the page? If not properly addressed, these are exact copies of the original content. Always make sure that your printer-friendly page does NOT get indexed on search engines.

Do you understand URLs?

A rose by any other name is a rose. This is a great analogy to understand URLs or Unique Resource Locators. The URL is the true address of a page and it’s where search engines can find the content you have so painstakingly developed.

In real life, every home has an address that’s unique to that home. We can append modifiers to it, like “the last one on the block” or “the one with the blue door” but people know how to resolve these directions and end up on the same address. The problem is that search engines need a bit more direction than that. Let me show you some examples of URLs that we know are the same but search engines understand them as different:

http://domain.com and http://domain.com/index.php

http://domain.com/articlename and http://domain.com/articlename?sessionid=1234

http://domain.com/product and http://domain.com/product/?ref=name

http://domain.com and http://www.domain.com

http://domain.com and http://domain.com/

http://domain.com and https://domain.com

http://domain.com/categoryb/producta and http://domain.com/categoryb/producta

Session ID’s, URL parameters, page printer-friendly versions and even a backslash at the end of an address are interpreted as a different URL by search engines, if proper directions have not been given. To complicate matters more, think about those pages that can be displayed under a couple of categories, if the category is part of the URL. For example, an article that can be found under social media and SEO. There are many more situations where this type of content duplication occurs as this is not an exhaustive list.

Canonicali…. what???

Here’s comes the concept of canonicalization. A tongue twister on its own (try to conjugate the verb really fast!), it ends up being much forgotten by developers and SEOs alike. Not an easy concept unless you have some technical knowledge but I’ll try my best.

Canonicalizing a URL is the equivalent of adding signals for search engines that state, no matter what the address looks like, if this content is the one displayed, then this is the address of the original and indexable content because it’s the original version.

The good news for those on large platforms like WordPress, Shopify, Drupal, Joomla, there are plugins and apps that can help with it. Otherwise, you need to add the canonical tag to the head section of the page. This indicates to search engines which is the original version of the page.

Canonicalization is key on learning how to deal with duplicate content. A word of caution: do not use canonicalization when a re-direct is needed and search engines may choose to ignore your canonical tag.

What are canonical url tags?

And the duplicate content saga continues

URLs are not the only way of unintentionally creating duplicate content. Repeating paragraphs all over your site to emphasize a concept is a great way to tell the search engine that you don’t know which page is the most relevant for it.

Another great way of generating duplicate content is adding your bio to all of your article footers, even though so many website owners feel proud to see their bio there. Do you think it’s a good idea to disseminate the bio that is on your site to everybody that requests your bio? Absolutely not. I make it a point to create a different bio for publishing on other sites. Some of them a bit more alike than others, but definitely different than the one I publish on my site. This is why it really upsets me when somebody scrapes my website bio to add to the sites where I collaborate. Yes, it’s a shortcut and you may think nothing of it. But if somebody is collaborating with you for free, shouldn’t you just ask for them to also provide you with the bio they want published?

Now, let’s tackle content syndication. We all want to see our content shared all over the web. Hey! Let’s plaster it everywhere, what do we care? NOT! When you syndicate your content you are creating copies of it, exact duplicates. Mmmm.. which one is the original one that should be indexed by search engines? I wonder. Back to canonicalization? But how can you control other people pointing at your URL? Do they even know how to do that? Maybe you can ask for their article to carry a no-index tag. And maybe you should only syndicate a particular, different version of your article. Add a slight spin to it and syndicate.

Duplicate titles, descriptions and snippets are another great way to generate duplicate content. Think about it for a minute. If I show you two articles with the same title and description, which one will you choose? They must be the same, correct? But search engines add other factors in order to determine that they are one and the same like the URLs, and thus consider the page to be its own duplicate.

If you have a Google Search Console account, you can identify most of these pieces of duplicate content under Search Appearance >> HTML improvements.

What Duplicate Content Boils Down To

Going back to the physical address analogy, there could be many ways for someone to indicate how to get to the same house, but search engines are not people and they will think each unique description is a different house altogether.

Generating confusion for search engines is not where you want to be. First, because search engines will display only one of your “many pages” as they are identical in content and the search engine has a hard time determining which one is the most relevant. Second, people may link to the different URLs and this reduces the authority of your page.

I suggest you begin by addressing a list of duplicate content with the implementation of canonical tags, using 301 re-directs when needed, linking back to the original content, utilizing Google’s URL parameters tool and Bing’s Ignore Parameters Tool, improving your URL structure and avoiding the creation of duplicate content whenever possible.

I hope this has been a helpful little guide on how to address duplicate content. It is no means exhaustive. There are many other amazing SEOs that have written about it in much more depth. But feel free to ask questions in the comments section below. I’ll do my best to address them as best I can.

Doing the impossible quote

Rebecca says:

09/16/2015 at 22:01

Great informative content about content.(no pun intended) . More and more people are turning to “Blogs” as another venue for Person recognition and Image awareness and there has to be a conscious effort about what is written rather than plagiarism.

Havi Goffan says:
09/17/2015 at 00:14

Agree 100% Rebecca. People should get recognition because of their own merits. If writing is not their thing, they should try something else. Duplicate content always backfires. Thank you for your comment!!
Havi

Jackie de Burca says:

10/30/2015 at 05:02

So very logically thought out, Havi 🙂 I put myself into the head space of someone who knows less about the subject than I do…you’ve explained it perfectly and punctuated with personality, and I adore the Pandas who hate duplicate content. I also entirely agree with Rebecca’s comment, particularly as someone who loves creating what I hope is good content, and making it visually appealing.

Havi Goffan says:
11/06/2015 at 17:50

Thank you so much for your comments, Jackie. Understanding duplicate content can be a bit overwhelming and dry for most people and I tried my best to make it clear and engaging. Happy to hear that you did.
Warmly,
Havi

What You Don’t Know About Duplicate Content Can Kill You

What is duplicate content?

The most malicious duplicate content of them all: Content scraping