The content duplicate is content appearing on the Internet more than in one place. One place is a location with a unique website address (URL).
According to Google: "Duplicate content generally refers to substantive blocks of content within or across domains that either completely matches other content or are appreciably similar. Mostly, this is not deceptive in origin". Generally, about 25 % of all the web content is duplicated.
Due to the fact, that search engines index the websites on the Internet, they don't like content duplicates and want to avoid having them, since the goal is to provide web users with unique and relevant information.
And how does Google handle duplicated content? If search engines find duplicated content, they group them all together and treat them as a single piece of content. They declare only one piece of content as original and further show it in the search result page.
There are 2 types of content duplicates:
- Technical: URL variations, HTTP vs HTTPS, index pages, alternate page version as for example AMPs or print, paging, etc.
- Editorial: scraped and copied content by editors
Speaking about technical content duplicates, the Pangea Team is monitoring the websites regularly and if a technical duplicate is detected, the found issue is being analyzed and fixed, if the duplicity was caused by our code.
Duplicate content coming from editors appears as the result of copying a text from another source in an inappropriate way or recording the show automatically without changing the radio show name.
To avoid having duplicates instead of the original piece of the content, we have developed Copy and Edit the content from another site (Pangea CMS > Search> Search place) and the Syndication tool for media content (Pangea CMS > Settings > Syndication tool) which allows sharing content from one site to another with proper tags for search engines to be able to identify what is an original source for text.
For services who have regular shows and have set up an automatic recording of the show, we strongly recommend recording the show into a draft status and then customizing the show's metadata (title and introduction) based on the episode of the show, before publishing it. In order to resolve the issue that recordings create duplicate content, individual episodes' titles and introductions have to be updated, as merely the difference in date and time between different episodes will not be enough. It is also possible to keep the show name and add a unique identifier of the episode into the metadata such as through Tags. Example of NPR here.
According to the Google announcement, there is currently no penalty for duplicate content. However, given that Google hides duplicates and only shows the original piece of content, it will mean that if your content is seen as duplicate, it will be harder for the end user to find it.
Example of the content duplicate found on Pangea sites:
Copying content from another site
If there is a story/ media content already created and published by a Service and another service wants to publish it on their site, the content should be appropriately copied in Pangea CMS.
To copy the content, go to the main search, look for a Service name you know have the content in the Search place and click on the Copy & Edit from <site> button:
The editor will get a new article edit page opened with all metadata from the original site excluding zones, authors, and slug.
After saving the copied item to your site, it will have a canonical link to the origin (for search engines to determine what is the original content).
If a web user searches for the content, the search engines are always looking for an origin by the canonical link.