Scraping old content and keeping Google in the dark.


Google’s duplicate content filters have proved the stopping of many a great quick profit scheme. Here, we’re going to take a look at how we can scrape content and stay off the Google radar.

For those of you who are unfamiliar with the concept of scraping, it basically involves duplicating content and rehashing it for a new website. There are many scripts in the Black Hat world which can scan through large sites like Amazon and Wikipedia. grabbing the content they need along the way.

Of course, the great dilemma is what can we do to use this data without taking the duplicate content penalty from Google.

Well, there are suggestions galore. We can mash content together from a mixture of websites in the hope that it dissembles enough to pass without warning, or we can apply one of the many keyword re-writing softwares that attempt to render an article unrecognizable.

Both of these methods fall short.

Merging different content sources may work to offset Google’s suspicion, but it will also make a royal mess of your content. Expect articles and pages that read with the fluency of a foreign language - one which will make no sense to the reader.

Keyword re-writing softwares are to remain burdened by limitations until a machine reader is invented that can understand human context. If I chuck a 500 word article in to a re-writing script, the chances are it will pass the duplicate content acid text. But once again, it will read terribly with robotic word-for-word replacements. And these alone can act as a giveaway to Google - especially if a nark sends a tell tale report Matt Cutts’ way.

So what can we do? How can we scrape content and stay both readable to humans and original to search engines?

The answer is to grab content that no longer exists. Content that no longer exists in the index of Google.

Think about it for a second. Google may be the one stop shop for information on the web in 2008, but what it can’t let you do is search like it’s 1999 all over again. Well, it can, if you dip in to the Google cache. 

Every day, gigabytes upon gigabytes of web content are dropping out of Google’s index. It could be for one of many reasons. The site in question may have been banned, it may have started using inaccessible code, or it may have fallen by the wayside with an expired domain.

Either way, that web content is there for the taking and we simply need to get to it to claim it as ours.

Archive.org is your new best friend. Acting as a library for web pages using capture technology of how they existed at a point in time, we can target de-indexed websites and recycle their content for own needs.

Now, there are many ways in which you can go about doing this. I’m going to suggest two of my personal favorites but if you put your imagination to use, there’s some fiendishly tight plans that can be rolled out to make a quick and substantial profit.

First of all, expired domain lists. What about them? Well, they give us a great starting point for finding content that’s about to fall out of the Google index.

Now, you’ll find many expired domain lists on the Net. My personal favorite is ExpiredDomains.com, which offers both a free search and a full paid service. You can use the free search to find expiring domains, but it’s worth the upgrade to get ahead of the game and to save time otherwise wasted analyzing redundant domains. 

Run a simple keyword search for sites related to the niche that you’re marketing. I might search “Spanish” and receive a list of Spain related expiring domains. From here, I’ll use the member tools to scrutinize which domains have PR and backlinks attached - usually a giveaway sign that there’s content parked on the site.

Happy that I’ve found a soon-to-be delisted website, I’ll run the URL through Archive.org and use the captured archive pages to scrape my content. You may wish to use a simple PHP script for this purpose and set it to grab all content from within a set region. Remember, the content you’re scraping could be from last week, last month or even last century. The crux is that its no longer going to get a duplicate content penalty from Google.

I now have 15 or 20 pages of highly targeted pages for my niche market which I can recycle and implement on my money site.

Next up, we want to be sure that the content we’re scraping hasn’t been moved or duplicated elsewhere. I’ll usually run my pages through a simple Copyscape test and if the results return unique, we’re good to go. Google will soon find the recycled content and treat it as unique - or at worst, assume that the pages have been moved given that there’s no other copies of them. 

We sit back and let our scraped content do wonders for our long tail search terms and just like that, we’ve saved ourselves potentially hundreds of dollars that would have been spent outsourcing articles to the far east!

I mentioned that I’d share my two favorite methods of “desert scraping” as it’s called, and you’ve just heard the first. 

Well, the second has worked a real treat for me and I’m not even sure I should give it away, but I will regardless and hopefully I can draw some attention with this first post!

We’ve touched on expiring domains, but where this form of scraping really excels is on directories and old article websites. Need I explain the potential riches in scraping a de-indexed article directory from, say, 2001? I have found hundreds and hundreds of niche article directories that were big five years ago.

These days they sit derelict in web archives having been de-indexed from Google and rendered invisible by long-expired domains. By grabbing the contents, I have been able to reproduce highly targeted niche websites which roll in the money handsomely with a little affiliate marketing on top.

Here’s the trick.

Find a nice big PR directory (if you’ve been around the SEO block, you’ll know thousands). Take the URL of the directory and chuck it in to a dead link checking tool. Let the script run. This will probably take a few minutes, but you’ll soon be given a report of the dead links on the website. 

Scan through the report and look out for 404 errors where the directory has linked to external third party sites. In no time at all, we have a huge collection of potentially derelict websites which are just waiting to be desert scraped.

If you’re really smart, you can automate this entire process using a simple PHP script. That’s one technique I WON’T be giving out here, but one that’s well worth the time looking in to.

Good luck!

Share/Save/Bookmark

Tags: , , , ,   Posted in Content Scraping

4 Responses to “Scraping old content and keeping Google in the dark.”

  1. A Suresh Kumar Says:

    You are suggesting us to duplicate content from last century. From my point, we need to provide content that will focus on our latest and future technology . I don’t think it will have any good response to post last decade content.

  2. Andrew Stone Says:

    Thanks for this. Very useful info, will have a play with it

  3. admin Says:

    Suresh,

    It depends entirely on the needs of your website. Many static articles are simply hired rewrites by freelancers who use old content as research for fresh material.

    If you’re maintaining a blog on 2008 technology trends, this is undoubtedly a very bad technique to use. However, if you’re creating websites of static content on topics that are unlikely to be affected by time, it’s workable.

    Also, please note, the content does not have to be a decade old. It could feasibly be six months old - just as long as Google has dropped it from the index.

  4. Link Building Bible Says:

    Yah, Kumar…. it doesn’t have to be a century old. But if your niche is sports, not much has changed to the basic rules and outlines of football, so if you were to scrape a football site that’s been deindexed, then you’d be golden. (not that the niche would do much for you, just an example.)

Leave a Reply