How 100 AA blogs list BECOME a list of 100,000 Sites!

Started by quocbinh, 06-10-2014, 10:05:07

Previous topic - Next topic

quocbinhTopic starter

Basically, with this trick you will increase your site lists EXPONENTIALLY.

'll show you to use a seed list of as little as 100 AA blogs to create a list of 100,000 sites (after deduping both URLs and domains) that you can import to GSA SER and get a generous percentage of them on successful and verified. This process shouldn't take more than a few hours.

This list should then be imported in to a dedicated project in GSA (which I will explain how to setup later in the tutorial).

Note that you will need Scrapebox for this.


Last time I did this (which was a few days ago) I started with 2000k AA blogs taken from one of my projects and ended up with 2mil+ DIFFERENT DOMAINS to import in to SER. This resulted in 30k verified links. And that's with using out-of-the-box Captcha Breaker (See why I prefer CB to SC) and highly spun content. If I was to use more readable content and a more optimized version of CB, the submitted / verified numbers would be MUCH higher.

Idea behind this is the following:

If you managed to post a comment to a particular blog using GSA SER, there's a good chance that a lot of other GSA SER users have managed to post there as well. This means that there are a lot of other people building their Tier 2 / 3 in those same blogs.

By scraping all internal links on those blogs, we get a huge list of blog posts that have been violated by SER

Then by extracting all external links from the huge list of blog posts, we get a massive list of potential targets for SER.

Detailed Version:

Step 1 - Getting a seed list

First you need a list to start with. When you're doing this for the first time, you can go to your verified folder and copy all URLs from blog comments. Don't worry if you get very low number of URLs this way (I got around 500), you will see it's still as effective.

For every other time you're doing this, you can just use a list of verified blog comments you created from the last time you did this whole process. Other than that, you can simply harvest a list of blogs and check which ones are auto-approve by posting there with SB / SER.

When you found a list that you can use, import it to SB, trim to root, dedup URLs, save the list as "Step 1 - seed list" and proceed to step 2.

Step 2 - Extracting internal links

Fire up the "Link Extractor" plugin in Scrapebox and load the "Step 1 - seed list" file. Set the mode to "internal", use as much connections as your box can handle and start.

After it's done, import that list to scrapebox and close the link extractor. Scroll through the list and see if there's many comment (usually ending with /#comment), categories, tag links, etc. Use the "Remove URLs containing" to try and get rid of as many of those as possible. Ideally you'd want a list consisting of nothing but blog posts.

This filtering isn't necessary but depending on the size of your initial list, the next step could take considerably longer if you skip this.

After you're finished, save the list and name it "Step 2 - all internal links".

OPTIONALLY
Setup a dedicated project in SER and feed it those lists to filter out the junk from the actual blog posts
As a bonus, this is also a good way to build your AA list.

Step 3 - Extracting external links

Before starting, you should split "Step 2 - all internal links" file in to smaller ones, no more than 10k per file and then process them one at a time. The reason behind this is that Step 3 usually produces up to 200x more URLs than the number of links you start this step with. (For example, I usually use batches of 5k links which result in list of 400-900k de-duped URLs)

If you use large lists in this step, you will end up with couple of million of URLs and as far as my experience goes, Scrapebox doesn't handle more than 1mil URLs all that well.

You can use the "Dup Remover" Scrapebox plugin for splitting the files.

So open the link extractor again and load the first batch of the "Step 2 - all internal links" file. Set the mode to "external" and hit start. Go make yourself a coffee and once the link extract has finished, transfer the list to Scrapebox, dedup if needed and save the file as "Step 3 - Needs sorting".

Repeat the process until you have gone through all the smaller batches of the "Step 2 - all internal links" file.

You can run multiple instances of the link extractor simultaneously as long as your box can handle that if you want to speed this up (tho I wouldn't recommend more than 2-3 per SB instance since it will most likely crash).

Step 4 - Sorting the list

Now you should have quite large lists of sites that need to be sorted somehow. Luckily, GSA SER will do this automatically without too much hassle.

Setup a new project with the following settings:

Note that this setting will filter out all unindexed sites. If you don't care about PR and just want as many links as possible, just untick all PR filters, it should skip PR checking for sites, probably making your project faster.

Also make sure to untick all search engines and site lists, you want this project to ONLY post to target URL lists you import.

Double check that you're only using Captcha Sniper / Captcha Breaker for this project as anything else will very quickly deplete your balance.

Now just generate some relevant spun content (the type you would be using in all your other projects) and your new "Sitecheck" project is good to go.

After you got it setup, import the "Step 3 - Needs sorting" as target URLs for that project. (I don't know the limitations of GSA SER but I split everything above 1mil URLs in to smaller files and then let GSA go through the files one at a time.)

Now this project will now go through the list and fill up your identified / successful / verified lists!

Double check that you're saving Identified, Submitted and Verified sites in SER.

Now wait till it's finished, grab a list of verified blog comments / image comments you just created and repeat the process!

Obviously GSA SER won't be able to post to all of these sites but you will be surprised at how much it will!

Short Version

1. Get a list of verified blog comments from GSA
2. Extract all internal links
3. Extract all external links (from all internal)
4. Import the list to a project in GSA and let it sort it for you
5. Repeat from 1. with the verified links you just created

Problems:

Ideally you would keep a database of all sites you already imported to GSA SER (master list) and everytime you complete Step 3, you would filter your new list vs. the master list of all sites already found.
While GSA SER will handle this when sorting and display "already parsed" for all sites you imported before (assuming you're using just 1 sitecheck project), it still takes MUCH LONGER for GSA SER to process the imported lists then it takes you to compile them with this process. And that's with 200-300 LPM.

Right now it takes me a couple of hours to do this process and then around 5 days for GSA to process all found sites.

I tried keeping a deduped list of ALL sites found this way and then using the "Remove URLs containing entries from..." option in Scrapebox everytime I get a new list but with little luck. Even after a week of doing this, the master list got so large that SB crashes every time I try to do it.

That's it, Goodluck!

Source: collectibles