This code snippet demonstrates how to grab the content from a webpage and put it in a string variable. This snippet can be used in several applications like a webcrawler or spider.
Since Bandwith (most of the times) is an issue I also have added code that enables and handles GZIP/DEFLATE compressed content. Compressed content can save up to 80% of the needed bandwith, since most of the content is text based. This feature can easily be disabled by removing the 'Accept encoding' line in the first snippet.
Now lets start!
First we need a routine that grabs content from a valid URL:
public string GrabURL(string in_URL)
{
try
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(in_URL);
webRequest.Timeout = 6000;
webRequest.ReadWriteTimeout = 8000;
//Accept GZIP and DEFLATE compressed content.
//You can decide to disable this part. Then the decompression functions
//Are not needed
webRequest.Headers.Add("Accept-Encoding: deflate, gzip");
//Defaine the user agent
webRequest.UserAgent = "MyUserAgent";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
string responseEncoding = webResponse.ContentEncoding.Trim();
if (responseEncoding.Length == 0)
responseEncoding="utf-8";
StreamReader responseReader = new StreamReader(responseStream);
//Decompress te content when GZIP compression is used.
if (webResponse.ContentEncoding.IndexOf("gzip") > -1)
{
return (DecompressGzip(responseStream));
}
//Decompress te content when DEFLATE compression is used
if (webResponse.ContentEncoding.IndexOf("deflate") > -1)
{
return (DecompressDeflate(responseStream));
}
else
{
return responseReader.ReadToEnd();
}
}
catch
{
return "error";
}
}
In case you want to remove the HTML formatting from the webpage you might be interested in this snippet. In case you want to extract all URL/Anchor combinations from a webpage you might be interested in this snippet.
Of course, when you have enabled the GZIP and/or DEFLATE encoding, the decompression algoritms for GZIP and DEFLATE are needed.
using System.IO.Compression;
First we need a function that handles GZIP-encoded content:
private string DecompressGzip(Stream in_InputStream)
{
Stream lv_OutputStream = new MemoryStream();
try
{
byte[] lv_Buffer = new byte[4096];
using (GZipStream lv_gzip = new GZipStream( _
in_InputStream, CompressionMode.Decompress))
{
int i;
while ((i = lv_gzip.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
{
lv_OutputStream.Write(lv_Buffer, 0, i);
}
}
}
catch(Exception ex)
{
WriteEventLog(ex.Message);
}
return Stream2String(lv_OutputStream);
}
Then we need the routine to decompress DEFLATE encoded content:
private string DecompressDeflate(Stream in_InputStream)
{
Stream lv_OutputStream = new MemoryStream();
try
{
byte[] lv_Buffer = new byte[4096];
using (DeflateStream lv_Deflate = new DeflateStream(_
in_InputStream, CompressionMode.Decompress))
{
int i;
while ((i = lv_Deflate.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
{
lv_OutputStream.Write(lv_Buffer, 0, i);
}
}
}
catch (Exception ex)
{
WriteEventLog(ex.Message);
}
return Stream2String(lv_OutputStream);
}
After implementing the code above the function GrabURL can be used the following way:
string lv_HTML;
lv_HTML = GrabURL("http://www.dotnet4all.com");
Enjoy!
Posted by Xander Zelders

0 Comments:
Post a Comment
<< Home