Monday, December 20, 2010

Web Crawlers

How do search engines know what websites are linked? How do spammers get hold of our email addresses? Simple answer: Web crawlers! Yes these tiny programs do a wonderful job of skimming through websites and picking up website or email addresses. One can then build a network of web pages. These are the basis for search engines. Web crawlers, spiders and bots(robots) are similarly functioning software agents which differ with the functionality of application.
Web crawler follows the different hyperlinks and this data is used for Search Engines indexing and ranking.
Web Spiders download the web pages again by traversing the links
Bots are computer programs which visit websites & perform predefined task
Thus all of them are parsing through the html content of one web page to pick up the referenced hyperlinks.
Let us now build a small web crawler which takes an URI as an input and returns a list of all the websites that have been linked in given page. This can be recursively used to create a smarter program to crawl bigger websites.
webCrawler(String strURL){
URL url;
try {
//create the url from the string
url = new URL(strURL); // try opening the URL
URLConnection urlConnection = url.openConnection();
System.out.println("successful opening the URL " + strURL);
InputStream urlStream = url.openStream();
// read the entire content from URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
content = new String(b, 0, numRead);
while (numRead != -1) {
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
urlStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Here we are taking the URI in form of string. We convert it into a URL using the Java URL library function. If we are successful in opening this link, we read the entire content into a variable. This variable will be further used to look for the hyper links.
public ArrayList<String> getListOfPlatforms(){
ArrayList<String> platformList = new ArrayList<String>();
//covert all content to lowerCase
String lowerCaseContent = content.toLowerCase();
int index = 0;
//search for links referred in the html source
while((index = lowerCaseContent.indexOf("<a",index))!= -1)
{
if((index = lowerCaseContent.indexOf("href",index))== -1)
break;
if ((index = lowerCaseContent.indexOf("=", index)) == -1)
break;
index++;
String moreContent= content.substring(index);
StringTokenizer str 
= new StringTokenizer(moreContent, "\t\n\"");
String strLink = str.nextToken();
//addit to the ArrayList of strings
platformList.add(strLink);

//return the list of folder names
return platformList;
}
Here we parse the content and search for all the href values. The links are added to an ArrayList object and returned. This list can be used for further crawling. So we can recursively crawl many webpages.

Thursday, December 16, 2010

Common Language Runtime - CLR

What is a CLR?

This is something you need to know if you say you are working on .Net framework.  

Common Language Runtime (CLR) is an application virtual machine where in programmers need not consider the capabilities of the specific CPU that will execute the program. The .NET Framework provides a run-time environment, which runs the code and provides services that make the development process easier. The class library and the CLR together constitute the .NET Framework.

                    

Developers using the CLR, write code in a language such as C# or VB.NET. At compile time, a .NET compiler converts such code into CIL code. At runtime, the CLR's just-in-time compiler converts the CIL code into code native to the operating system.

The common language runtime makes it easy to design components and applications whose objects interact across languages. Objects written in different languages can communicate with each other, and their behaviors can be tightly integrated. For example, you can define a class and then use a different language to derive a class from your original class or call a method on the original class. You can also pass an instance of a class to a method of a class written in a different language. This cross-language integration is possible because language compilers and tools that target the runtime use a common type system defined by the runtime, and they follow the runtime's rules for defining new types, as well as for creating, using, persisting, and binding to types.

          CLR also provides some of the other services namely:
  • Memory management
  • Thread management
  • Exception handling
  • Garbage collection
  • Security

I will talk about these in detail in the coming blogs.