Monday, December 20, 2010

Web Crawlers

How do search engines know what websites are linked? How do spammers get hold of our email addresses? Simple answer: Web crawlers! Yes these tiny programs do a wonderful job of skimming through websites and picking up website or email addresses. One can then build a network of web pages. These are the basis for search engines. Web crawlers, spiders and bots(robots) are similarly functioning software agents which differ with the functionality of application.
Web crawler follows the different hyperlinks and this data is used for Search Engines indexing and ranking.
Web Spiders download the web pages again by traversing the links
Bots are computer programs which visit websites & perform predefined task
Thus all of them are parsing through the html content of one web page to pick up the referenced hyperlinks.
Let us now build a small web crawler which takes an URI as an input and returns a list of all the websites that have been linked in given page. This can be recursively used to create a smarter program to crawl bigger websites.
webCrawler(String strURL){
URL url;
try {
//create the url from the string
url = new URL(strURL); // try opening the URL
URLConnection urlConnection = url.openConnection();
System.out.println("successful opening the URL " + strURL);
InputStream urlStream = url.openStream();
// read the entire content from URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
content = new String(b, 0, numRead);
while (numRead != -1) {
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
urlStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Here we are taking the URI in form of string. We convert it into a URL using the Java URL library function. If we are successful in opening this link, we read the entire content into a variable. This variable will be further used to look for the hyper links.
public ArrayList<String> getListOfPlatforms(){
ArrayList<String> platformList = new ArrayList<String>();
//covert all content to lowerCase
String lowerCaseContent = content.toLowerCase();
int index = 0;
//search for links referred in the html source
while((index = lowerCaseContent.indexOf("<a",index))!= -1)
{
if((index = lowerCaseContent.indexOf("href",index))== -1)
break;
if ((index = lowerCaseContent.indexOf("=", index)) == -1)
break;
index++;
String moreContent= content.substring(index);
StringTokenizer str 
= new StringTokenizer(moreContent, "\t\n\"");
String strLink = str.nextToken();
//addit to the ArrayList of strings
platformList.add(strLink);

//return the list of folder names
return platformList;
}
Here we parse the content and search for all the href values. The links are added to an ArrayList object and returned. This list can be used for further crawling. So we can recursively crawl many webpages.

No comments:

Post a Comment