Scraping old content and keeping Google in the dark.
Google’s duplicate content filters have proved the stopping of many a great quick profit scheme. Here, we’re going to take a look at how we can scrape content and stay off the Google radar.
For those of you who are unfamiliar with the concept of scraping, it basically involves duplicating content and rehashing it for a new website. There are many scripts in the Black Hat world which can scan through large sites like Amazon and Wikipedia. grabbing the content they need along the way.
Of course, the great dilemma is what can we do to use this data without taking the duplicate content penalty from Google.
Well, there are suggestions galore. We can mash content together from a mixture of websites in the hope that it dissembles enough to pass without warning, or we can apply one of the many keyword re-writing softwares that attempt to render an article unrecognizable.
Both of these methods fall short.
Merging different content sources may work to offset Google’s suspicion, but it will also make a royal mess of your content. Expect articles and pages that read with the fluency of a foreign language - one which will make no sense to the reader.
Keyword re-writing softwares are to remain burdened by limitations until a machine reader is invented that can understand human context. If I chuck a 500 word article in to a re-writing script, the chances are it will pass the duplicate content acid text. But once again, it will read terribly with robotic word-for-word replacements. And these alone can act as a giveaway to Google - especially if a nark sends a tell tale report Matt Cutts’ way.
So what can we do? How can we scrape content and stay both readable to humans and original to search engines?
The answer is to grab content that no longer exists. Content that no longer exists in the index of Google.
Think about it for a second. Google may be the one stop shop for information on the web in 2008, but what it can’t let you do is search like it’s 1999 all over again. Well, it can, if you dip in to the Google cache.
Every day, gigabytes upon gigabytes of web content are dropping out of Google’s index. It could be for one of many reasons. The site in question may have been banned, it may have started using inaccessible code, or it may have fallen by the wayside with an expired domain.
Either way, that web content is there for the taking and we simply need to get to it to claim it as ours.
Archive.org is your new best friend. Acting as a library for web pages using capture technology of how they existed at a point in time, we can target de-indexed websites and recycle their content for own needs.
Now, there are many ways in which you can go about doing this. I’m going to suggest two of my personal favorites but if you put your imagination to use, there’s some fiendishly tight plans that can be rolled out to make a quick and substantial profit.
First of all, expired domain lists. What about them? Well, they give us a great starting point for finding content that’s about to fall out of the Google index.
Now, you’ll find many expired domain lists on the Net. My personal favorite is ExpiredDomains.com, which offers both a free search and a full paid service. You can use the free search to find expiring domains, but it’s worth the upgrade to get ahead of the game and to save time otherwise wasted analyzing redundant domains.
Run a simple keyword search for sites related to the niche that you’re marketing. I might search “Spanish” and receive a list of Spain related expiring domains. From here, I’ll use the member tools to scrutinize which domains have PR and backlinks attached - usually a giveaway sign that there’s content parked on the site.
Happy that I’ve found a soon-to-be delisted website, I’ll run the URL through Archive.org and use the captured archive pages to scrape my content. You may wish to use a simple PHP script for this purpose and set it to grab all content from within a set region. Remember, the content you’re scraping could be from last week, last month or even last century. The crux is that its no longer going to get a duplicate content penalty from Google.
I now have 15 or 20 pages of highly targeted pages for my niche market which I can recycle and implement on my money site.
Next up, we want to be sure that the content we’re scraping hasn’t been moved or duplicated elsewhere. I’ll usually run my pages through a simple Copyscape test and if the results return unique, we’re good to go. Google will soon find the recycled content and treat it as unique - or at worst, assume that the pages have been moved given that there’s no other copies of them.
We sit back and let our scraped content do wonders for our long tail search terms and just like that, we’ve saved ourselves potentially hundreds of dollars that would have been spent outsourcing articles to the far east!
I mentioned that I’d share my two favorite methods of “desert scraping” as it’s called, and you’ve just heard the first.
Well, the second has worked a real treat for me and I’m not even sure I should give it away, but I will regardless and hopefully I can draw some attention with this first post!
We’ve touched on expiring domains, but where this form of scraping really excels is on directories and old article websites. Need I explain the potential riches in scraping a de-indexed article directory from, say, 2001? I have found hundreds and hundreds of niche article directories that were big five years ago.
These days they sit derelict in web archives having been de-indexed from Google and rendered invisible by long-expired domains. By grabbing the contents, I have been able to reproduce highly targeted niche websites which roll in the money handsomely with a little affiliate marketing on top.
Here’s the trick.
Find a nice big PR directory (if you’ve been around the SEO block, you’ll know thousands). Take the URL of the directory and chuck it in to a dead link checking tool. Let the script run. This will probably take a few minutes, but you’ll soon be given a report of the dead links on the website.
Scan through the report and look out for 404 errors where the directory has linked to external third party sites. In no time at all, we have a huge collection of potentially derelict websites which are just waiting to be desert scraped.
If you’re really smart, you can automate this entire process using a simple PHP script. That’s one technique I WON’T be giving out here, but one that’s well worth the time looking in to.
Good luck!
Tags: archive.org, black hat scraping, desert scraping, duplicate content, scraping content Posted in




June 29th, 2008 at 9:28 am
You are suggesting us to duplicate content from last century. From my point, we need to provide content that will focus on our latest and future technology . I don’t think it will have any good response to post last decade content.
June 29th, 2008 at 9:47 am
Thanks for this. Very useful info, will have a play with it
June 29th, 2008 at 9:55 am
Suresh,
It depends entirely on the needs of your website. Many static articles are simply hired rewrites by freelancers who use old content as research for fresh material.
If you’re maintaining a blog on 2008 technology trends, this is undoubtedly a very bad technique to use. However, if you’re creating websites of static content on topics that are unlikely to be affected by time, it’s workable.
Also, please note, the content does not have to be a decade old. It could feasibly be six months old - just as long as Google has dropped it from the index.
July 5th, 2008 at 8:41 am
Yah, Kumar…. it doesn’t have to be a century old. But if your niche is sports, not much has changed to the basic rules and outlines of football, so if you were to scrape a football site that’s been deindexed, then you’d be golden. (not that the niche would do much for you, just an example.)
February 11th, 2009 at 4:44 pm
Interesting article, but if the purpose of selecting target sites for scraping (or aggregation) is based on them no longer being in the active Google index… in order to circumvent the duplicate content rule, well… we don’t know if the Google dupe algorithms use data that is no longer visible on the public index.
That deserted data could actually be used as part of the Google algorithm to check for dupe content - even as far back as the 1990s, and even if it is no longer indexed (for public use)…
Keep in mind that Google indexes a fair amount of information that is not viewable through public search (for example, data from gMail, Google Enterprise Search, etc) so why not keep indexes of old, stale, expired content specifically in their war against content scrapers / aggregators / repurposers
Rob
February 20th, 2009 at 8:03 pm
@Robin Majumdar: If you duplicate content that doesn’t exist anymore on public accessible websites then you’re the only one that has this content. How Google could consider that as duplicate content? Where is your logic Rob?
May 4th, 2009 at 8:22 pm
@silvermario
The logic lies in that content that isn’t necessarily on “publicly accessible” sites can still be penalized. For example, Google may ban and strip sites from their public index, yet keep it it in their backend as part of their ranking algo… much like law enforcement agencies maintain historical information on criminal convictions beyond pardon periods.
In fact, LEO types have a legal obligation to disregard (and possibly destroy) historical records of “bad behaviour” but that doesn’t apply to private entities (such as search engines, for example) that may wish to keep a breadcrumb pattern of sites they have penalized in the past… to use as they see fit in the future.