Monday, December 20, 2010

Web Crawlers

How do search engines know what websites are linked? How do spammers get hold of our email addresses? Simple answer: Web crawlers! Yes these tiny programs do a wonderful job of skimming through websites and picking up website or email addresses. One can then build a network of web pages. These are the basis for search engines. Web crawlers, spiders and bots(robots) are similarly functioning software agents which differ with the functionality of application.
Web crawler follows the different hyperlinks and this data is used for Search Engines indexing and ranking.
Web Spiders download the web pages again by traversing the links
Bots are computer programs which visit websites & perform predefined task
Thus all of them are parsing through the html content of one web page to pick up the referenced hyperlinks.
Let us now build a small web crawler which takes an URI as an input and returns a list of all the websites that have been linked in given page. This can be recursively used to create a smarter program to crawl bigger websites.
webCrawler(String strURL){
URL url;
try {
//create the url from the string
url = new URL(strURL); // try opening the URL
URLConnection urlConnection = url.openConnection();
System.out.println("successful opening the URL " + strURL);
InputStream urlStream = url.openStream();
// read the entire content from URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
content = new String(b, 0, numRead);
while (numRead != -1) {
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
urlStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Here we are taking the URI in form of string. We convert it into a URL using the Java URL library function. If we are successful in opening this link, we read the entire content into a variable. This variable will be further used to look for the hyper links.
public ArrayList<String> getListOfPlatforms(){
ArrayList<String> platformList = new ArrayList<String>();
//covert all content to lowerCase
String lowerCaseContent = content.toLowerCase();
int index = 0;
//search for links referred in the html source
while((index = lowerCaseContent.indexOf("<a",index))!= -1)
{
if((index = lowerCaseContent.indexOf("href",index))== -1)
break;
if ((index = lowerCaseContent.indexOf("=", index)) == -1)
break;
index++;
String moreContent= content.substring(index);
StringTokenizer str 
= new StringTokenizer(moreContent, "\t\n\"");
String strLink = str.nextToken();
//addit to the ArrayList of strings
platformList.add(strLink);

//return the list of folder names
return platformList;
}
Here we parse the content and search for all the href values. The links are added to an ArrayList object and returned. This list can be used for further crawling. So we can recursively crawl many webpages.

Thursday, December 16, 2010

Common Language Runtime - CLR

What is a CLR?

This is something you need to know if you say you are working on .Net framework.  

Common Language Runtime (CLR) is an application virtual machine where in programmers need not consider the capabilities of the specific CPU that will execute the program. The .NET Framework provides a run-time environment, which runs the code and provides services that make the development process easier. The class library and the CLR together constitute the .NET Framework.

                    

Developers using the CLR, write code in a language such as C# or VB.NET. At compile time, a .NET compiler converts such code into CIL code. At runtime, the CLR's just-in-time compiler converts the CIL code into code native to the operating system.

The common language runtime makes it easy to design components and applications whose objects interact across languages. Objects written in different languages can communicate with each other, and their behaviors can be tightly integrated. For example, you can define a class and then use a different language to derive a class from your original class or call a method on the original class. You can also pass an instance of a class to a method of a class written in a different language. This cross-language integration is possible because language compilers and tools that target the runtime use a common type system defined by the runtime, and they follow the runtime's rules for defining new types, as well as for creating, using, persisting, and binding to types.

          CLR also provides some of the other services namely:
  • Memory management
  • Thread management
  • Exception handling
  • Garbage collection
  • Security

I will talk about these in detail in the coming blogs. 

Monday, November 22, 2010

File Transfer using Socket Programming in C# with File Split

Here we will see how to do multiple client server file transfer using Socket Programming. You can see a lot of documents for a simple file transfer. But this is different. Here, if the file size is large, say larger than 1 MB, I am splitting the file for sizes of 1MB and then transferring each file to the server. At the server side, the split files will be stored until all the files are received and once it is received, I am merging all the files to create the single file.
Right now this code can handle upto 14MB of files with a split size of 1MB. You can increase the split size and proportionally the file to be sent size.

Server Side code:
Please do remember to add the timer and the background worker in your form. 

using System;
using System.Data;
using System.Text;
using System.Windows.Forms;
using System.Net;
using System.Net.Sockets;
using System.IO;
using System.Threading;

// File transfer protocol Server function by Vinay

namespace Server
{
    public partial class Server : Form
    {
        public Server()
        {
            InitializeComponent();
            FTServer.receivedPath = "";
        }

        // Called when the Server start button is clicked
        private void btnStart_Click(object sender, EventArgs e)
        {
            //Checks if the destination path is set
            if (FTServer.receivedPath.Length > 0)
                backgroundWorker1.RunWorkerAsync();
            else
                MessageBox.Show("Please select file receiving path");
        }
         private void timer1_Tick(object sender, EventArgs e)
        {
            label5.Text = FTServer.receivedPath;
            label3.Text = FTServer.curMsg;
        }
         FTServer obj = new FTServer();
         private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
        {
            obj.StartServer();
        }

        // Sets the destination path
        private void btnReceive_Click(object sender, EventArgs e)
        {
            FolderBrowserDialog fd = new FolderBrowserDialog();
            if (fd.ShowDialog() == DialogResult.OK)
            {
                FTServer.receivedPath = fd.SelectedPath;
            }
        }

        // Server is stopped and all the temporary files are deleted
        private void btnStop_Click(object sender, EventArgs e)
        {
            FTServer.sock.Close();
            FTServer.curMsg = "Server Stopped";
            FTServer.deleteFiles();
        }
    }
    //FILE TRANSFER USING C#.NET SOCKET - SERVER
    class FTServer
    {
        IPEndPoint ipEnd;
        public static Socket sock;
        public FTServer()
        {
            //To accept any IP address with port number 5656
            ipEnd = new IPEndPoint(IPAddress.Any, 5656);
            // Creating a new socket with TCP protocol
            sock = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            //Bind the socket with the Ip
            sock.Bind(ipEnd);
           
        }
         public static string receivedPath;
        public static string curMsg = "Stopped";

       // Method to receive the data sent from the client.
        public void StartServer()
        {
            try
            {
                curMsg = "Starting...";
                //The socket is waiting for client connection. The socket can handle maximum
                // of 100 client connections at a time.
                sock.Listen(100);
                curMsg = "Running and waiting to receive file.";
                //When a request comes from the client the socket is ready to accept
                Socket clientSock = sock.Accept();
                //5Mb of buffer space is allocated for the data transfer
                byte[] clientData = new byte[1024 * 5000];
                //The receive method returns the size of the data that is transferred.
                //The size will be in bytes.
                int receivedBytesLen = clientSock.Receive(clientData);                                
                int fileNameLen = BitConverter.ToInt32(clientData, 0);
                string fileName = Encoding.ASCII.GetString(clientData, 4, fileNameLen);
                //If the file exists in the path, the existing file is deleted
                if (File.Exists(receivedPath + "\\" + fileName))
                {
                    File.Delete(receivedPath + "\\" + fileName);
                }
                curMsg = "Receiving data..." + fileName;
                //Binary stream writer to save the data received from the client.
               BinaryWriter bWrite = new BinaryWriter(File.Open(receivedPath + "/" + fileName, FileMode.Append));                 
               bWrite.Write(clientData, 4 + fileNameLen, receivedBytesLen - 4 - fileNameLen);

                Thread.Sleep(1000);
                curMsg = "Saving file...";
              
                //Closing the binary writer
                bWrite.Close();
                //Closing the socket connection
                clientSock.Close();

        //Calling the merger method to combine the split files into a single readable file.
                mergeFiles(fileName);
                StartServer();
            }
            catch (Exception ex)
            {
                curMsg = "File Receving error.";
            }
        }

        // Merge method to combine the split files into a single readable file.
        public void mergeFiles(string recvdfileName)
        {
            //Get all the files with the extension .vin
            string[] filePaths = Directory.GetFiles(receivedPath, "*.vin");
            string fileName = " ";
            int parts = filePaths.Length;
            for (int i = 0; i < parts; i++)
            {
                fileName = getFilename(0, filePaths[i]);
                if (filePaths[i].IndexOf(recvdfileName) > -1)
                {
                    //Deleting the file if the file is already present
                    if (File.Exists(receivedPath + "\\" + fileName))
                    {
                        File.Delete(receivedPath+ "\\" + fileName);
                    }
                    //Creating a new file with the filename
                    FileStream createFile = File.Create(receivedPath + "\\" + fileName);
                    createFile.Close();
                }
            }

            string path = receivedPath + "\\" + fileName;
            FileStream outFile = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Write);
            if (outFile != null)
            {
                outFile.Flush();
            }
            for (int i = 0; i < parts; i++)
            {
                if (filePaths[i].IndexOf(fileName) > -1)
                {
                    FileInfo f = new FileInfo(filePaths[i]);
                    int size = (int)f.Length;

                    int data = 0;  
                    byte[] buffer = new byte[1024 * 50000];
                    FileStream inFile = new FileStream(filePaths[i], FileMode.OpenOrCreate, FileAccess.Read);
                    //reading the data from the split file and putting it into a single file.
                    while ((data = inFile.Read(buffer, 0, size)) > 0)
                    {
                        outFile.Write(buffer, 0, data);
                    }

                    inFile.Close();
                }
            }
            outFile.Close();
        }

        // Method to get the filenames with the extension
        public string getFilename(int i, string fileName)
        {
            string[] array = fileName.Split('\\');
            int size = array.Length;
            string[] tempname = array[size - 1].Split('.');
            return tempname[i] + "." + tempname[i+1];
        }

        // Method to delete all the files when the server stop button is clicked.
        public static void deleteFiles()
        {
            string[] filePaths = Directory.GetFiles(receivedPath, "*.vin");
            int parts = filePaths.Length;
            for (int i = 0; i < parts; i++)
            {
                File.Delete(filePaths[i]);
            }
        }
    }

Client Side:

using System;
using System.Data;
using System.Windows.Forms;
using System.Net;
using System.Net.Sockets;
using System.IO;
using System.Threading;

// File transfer protocol Client function by Vinay
namespace Client
{
    public partial class Client : Form
    {
        public Client()
        {
            InitializeComponent();
        }

        // Called when the button is clicked.
        private void btnSend_Click(object sender, EventArgs e)
        {
            string filePath = "";
            FTClientCode.curMsg = "Idle";
            //Loads the windows explorer to select a file.
            FileDialog loadFile = new OpenFileDialog();
            if (loadFile.ShowDialog() == DialogResult.OK)
            {
                FileInfo file = new FileInfo(loadFile.FileName);                                                              
                long fileLength = file.Length;    //the length of file
                //extract the name of the File to be FTPed
                string fileName = getFilename(1, loadFile.FileName);
                //size of File in MBs
                double fileinMB = converToMB(fileLength);
                string changedFileName  = "";
                int parts = 1;
                //Checks if the sie of the file chosen in greater than 1MB
               //If yes then splits the file with name with an extension .vin
                if (fileinMB > 1)
                {
                    parts = splitFile(file, fileinMB, fileName);
                }
                for (int i = 0; i < parts; i++)
                {
                    if (parts == 1)
                    {
                        changedFileName = loadFile.FileName;
                    }
                    else
                    {
                        changedFileName = "C:\\Test\\" + fileName + "." + i + ".vin";
                    }

                    changedFileName = changedFileName.Replace("\\", "/");
                    while (changedFileName.IndexOf("/") > -1)
                    {
                        filePath += changedFileName.Substring(0, changedFileName.IndexOf("/") + 1);
                        changedFileName = changedFileName.Substring(changedFileName.IndexOf("/") + 1);
                    }

                    byte[] fileNameByte = Encoding.ASCII.GetBytes(changedFileName);
                                                                                //Reads all the data in the file
                    byte[] fileData = File.ReadAllBytes(filePath + changedFileName);
                    byte[] clientData = new byte[4 + fileNameByte.Length + fileData.Length];
                    byte[] fileNameLen = BitConverter.GetBytes(fileNameByte.Length);
                    fileNameLen.CopyTo(clientData, 0);
                    fileNameByte.CopyTo(clientData, 4);
                    fileData.CopyTo(clientData, 4 + fileNameByte.Length);

                    FTClientCode.SendFile(clientData, changedFileName);
                    filePath = " ";
                }
            }
            //deletes the split files after they have been trasnferred throuhg FTP
            deleteFiles(loadFile.FileName);
        }


       // spilts the big file into smaller files of size 1MB or lesser 
        public static int splitFile(FileInfo f, double fileSize, string fileName)
        {
            int parts = (int)Math.Ceiling(fileSize / 1);
            int eachSize = (int)Math.Ceiling((double)fileSize / parts) * 1024 * 1024;
            FileStream inFile = new FileStream(f.ToString(), FileMode.OpenOrCreate, FileAccess.Read);
            for (int i = 0; i < parts; i++)
            {
                string path = "C:\\Test\\" + fileName + "." + i + ".vin";
                if (File.Exists(path))
                    File.Delete(path);

                FileStream outFile = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Write);
                int data = 0;
                byte[] buffer = new byte[eachSize];
                if ((data = inFile.Read(buffer, 0, eachSize)) > 0)
                {
                    outFile.Write(buffer, 0, data);
                }
                outFile.Close();
            }
            inFile.Close();
            return parts;
        }
                               
    // converts file length into Megabytes
        public static double converToMB(long fileLength)
        {
            return (fileLength / 1024f) / 1024f;
        }
        public static string getFilename(int i, string fileName)
        {
            string[] array = fileName.Split('\\');
            int length = array.Length;
            return array[length - 1];
        }
        public void deleteFiles(string recvdfileName)
        {
            string[] filePaths = Directory.GetFiles("C:\\Vinay\\", "*.vin");
            string fileName = " ";
            int parts = filePaths.Length;
            for (int i = 0; i < parts; i++)
            {
                fileName = getFilename(0, recvdfileName);
                if (filePaths[i].IndexOf(fileName) > -1)
                {
                    File.Delete(filePaths[i]);
                }
            }
        }
        private void timer1_Tick(object sender, EventArgs e)
        {
            label3.Text = FTClientCode.curMsg;
        }
    }

    //FILE TRANSFER USING C#.NET SOCKET - CLIENT
    class FTClientCode
    {
        public static string curMsg = "";
        public static void SendFile( byte[] clientData, string fileName)
        {
            try
            {
                 //create a new client socket
                Socket clientSock = new Socket(AddressFamily.InterNetwork, SocketType.Stream,           ProtocolType.Tcp);
                // get the ipaddress
                IPAddress[] ipAddress = Dns.GetHostAddresses("localhost");
                 // Make IP end point same as Server.
                IPEndPoint ipEnd = new IPEndPoint(ipAddress[1], 5656);

                curMsg = "Connection to server ...";
                clientSock.Connect(ipEnd);
                curMsg = "Buffering ...";
                curMsg = "File sending..." + fileName;
                                                                //add some delay to let the process finish
                Thread.Sleep(3000);
                                                                //send the data
                clientSock.Send(clientData);
                                                                //close the client
                clientSock.Close();
               
                curMsg = "Disconnecting...";

                curMsg = "Files are transferred.";
            }
            catch (Exception ex)
            {
                if (ex.Message == "No connection could be made because the target machine actively refused it")
                    curMsg = "File Sending fail. Because server not running.";
                else
                    curMsg = "File Sending fail." + ex.Message;
            }
        }      
    }
}