Scrape Bing using Perl

In my last article I discussed the different techniques that could be used in order to scrape Google’s search engine without being detected. This week we’ll be looking at creating a script to scrape another popular search engine – Bing, using the API which Microsoft provide. It should be noted that Google also provide an API; which allows you to make 100 queries a day (approx 3000 / month); compared to 5000 queries a month with Microsoft Bing.

Using the API is certainly the recommended method for scraping results from either search engine, and the previous article should only be used for education purposes. Wink Wink.

The Good

There’s  clearly a number of positives to using the Microsoft (or Google) provided API over scraping results with wget (or whichever alternative tool you may use); firstly – you’r not pissing anyone off. As we discussed last week, most search engines don’t like you scraping their content, and the terms of service reflect this. If you get caught, you run the risk of being banned either temporally or permanently; but not when using their API.

And because you don’t have to worry about upsetting anyone, you don’t have to limit your script to running once every 20 or 30 seconds. When using the API you can hit Microsoft’s servers as hard as you want and they won’t mind.

Finally, you don’t have to worry about how the results page is constructed, and you won’t need to change your script every time Microsoft or Google make some minute change. They handle the difficult stuff.

The Bad

There’s only really one negative to using the API and that’s the limit that is imposed upon us. Whereas a scraper can theoretically keep going for ever, using the API we’ll be limited to 5000 queries / month (per Microsoft account). Should this not be enough you can always pay for additional queries but this can quickly get expensive.

The Ugly (Writing our script)

We’ve been through the pros and cons, and now we’re almost ready to jump in and start writing our script; but there are a few things we need to cover first such as creating a Microsoft account, getting our account key and install Perl onto our machine, so let’s begin.

Installing Perl

We’ll begin by installing Perl and the required modules onto our machine. It should be noted that I’m using Ubuntu and that the instructions below are specific to Debian based Linux distributions; however I’ve included instructions for alternative Operating Systems and Linux distributions where applicable.

To install Perl, you’ll need to open a new teminal window (CTRL + ALT + T) and type in the following command:

sudo apt-get install perl

For anyone using an alternative Operating System, you can download Perl from here.

Now we need to install some Perl modules, specifically we need WWW/Mechanize, MIME/Base64 and JSON. You can find some instructions on installing these modules from here. On Ubuntu, enter the following into your terminal:

sudo apt-get install cpanminus
sudo cpanm WWW::Mechanize MIME::Base64 JSON

* The first line installs cpanminus, a program designed to download and install Perl modules easily.

Creating  a Microsoft account

With Perl installed, we now need to grab a Microsoft account key, and for this we’ll need to create a Microsoft account which can you can do from here. If you already have a Microsoft account then you can skip this step.

Once you’ve set up an account, make sure to sign in to your Microsoft account, then head over to the Bing API page, scroll to the very bottom of the page and select the ‘sign up’ button on the left of the page under the heading ‘5000 Transactions / Month’ which will take you through to Microsoft’s sign up page.

Once you’re set up you just need to grab your account key, head over to the My Account section of the website where your Primary Account Key will be displayed towards the bottom of the page, you’ll need this later.

Get Programming

Finally we can begin writing our script. Open up your favourite text editor to begin. If you’re using Windows I highly recommend Notepad++.

Copy and paste the below code into your text editor, if you’re unsure what this does then you might want to head over here for some tutorials on Perl.

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize;
use MIME::Base64;
use JSON;

The above code simply sets everything up, the first line should point to the directory in which Perl is installed, and lines 3 through to 7 include the necessary modules for this script.

Now we need to set up some variables to hold key information, such as the application key from Microsoft, and any parameters that the user passes to the script from the command line. Lines 1 through to 5 set up the WWW::Mechanize module, line 7 should hold your Microsoft account key; and lines 8 through to 12 fetch any parameters from the command line using @ARGV. Should a parameter not be provided, a fall back will be used; so if the user doesn’t specify a query or how many results to fetch, we’ll fetch 50 results for the query “Test Query”.

# Create new WWW::Mechanize object
my $mech = WWW::Mechanize->new();

# Set UserAgent
$mech->agent_alias('Linux Mozilla');

my $accKey = "Enter Account Key Here!";

# Variables
my $query   = $ARGV[0] || "Test Query";
my $results = $ARGV[1] || 50;
my $source  = $ARGV[2] || "Web";
my $market  = $ARGV[3] || "en-GB";

Everything above simply sets the script up to run, but doesn’t yet do anything. The part of the script that will fetch results from Bing will be executed within a while loop; starting at 0 and increasing each time it executes until it reaches the $results variable (which will be 50 by default). Set up the while loop as below:

my $x = 0;

while($x < $results)
{
    # The important stuff will go here
}

Now we can start on the cool stuff; everything from now on should be placed inside the while loop we created above unless otherwise stated.

The first thing we need to do is construct the URL that will be used to fetch the results. Each time the loop executes we’ll want to fetch the next set of results which will require the parameters of the URL to change slightly.

my $server = "api.datamarket.azure.com";
my $url = "https://$server/Bing/Search/v1/$source?Query=%27$query%27&Market=%27$market%27&Adult=%27Off%27&\$top=$results&\$skip=$x&\$format=JSON";
my $loc = "$server:443";

The only line that changes is the $url variable which sets which set of results to fetch, using the $results variable that is updated in each iteration of the loop.

With the URL constructed, we now need to retrieve the data from Microsoft, which requires some authentication on our part. The following code sets up our authorization using the Microsoft account key we created earlier as the username and password.

my @args = (
Authorization => "Basic " .
MIME::Base64::encode($accKey . ':' . $accKey)
);

Time to fetch the results using the WWW::Mechanize module, and the authorization we set up above. Then we grab the content and decode it, before looping through an array of results and printing each URL to the screen.

# Important shizzle
$mech->credentials($loc,'','', $accKey);
$mech->get($url, @args);

# Get Content
 my $content = $mech->content();

 # Decode
 my $res = from_json($content);
 my $arref = $res->{'d'}->{'results'};

foreach(@{$arref})
 {
 print $_->{'Url'} . "\n";
 $x++;
 }

And that’s it, done. Simply run the program by (on Ubuntu) running the following command in your terminal:

chmod +x scrapeBing.pl
./scrapeBing.pl "Test Query" 100

As it executes, each URL will be printed to the screen, you can redirect the output to a file if you wish by executing the following command:

./sacrapeBing.pl "Test Query" 100 > output.txt

Can’t be bothered to write it out yourself? No problem; Download!

3 thoughts on “Scrape Bing using Perl

Leave a Reply

Your email address will not be published. Required fields are marked *