How to extract URL and Anchor from HTML content
(16 April 2007)
This C# snippet shows how to extract URL/Anchor combinations from HTML content and store the URLs in an ArraList.
to achieve this two regular expressions are used.
- One regular expression extracts all URL/Anchor combination from HTML content
- One regular expression check if a string is a qualified URL. (for example not a relative one)
Snippet:
//Define ArrayList in which all URLs will be stored
ArrayList lv_URLs = new ArrayList();
//This Regular Expression finds URL/anchor combinations
Regex lv_FindAllURLs = new Regex(_
@"]*href\s*=\s*[\""\']?(?[^""'>\s]*)[\""\']?[^>]*>(?[^<]+|.*?)?");
// get all the matches depending upon the regular expression
MatchCollection mMatchCollection = lv_FindAllURLs.Matches(lv_MyHTML);
foreach(Match mMatch in mMatchCollection)
{
string lv_URL = mMatch.Groups["URI"].Value;
string lv_Anchor = mMatch.Groups["Name"].Value;
//This Regular Expression checks if a string is a valid URL
Regex lv_IsURL = new Regex(@"http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?");
//Check if the fount URL is a valid URL. (It can also be a relative URL)
if (lv_IsURL.IsMatch(lv_URL))
{
lv_URLs.Add(lv_URL);
}
}
Posted by Xander Zelders
Grab the content of a (GZIP) webpage using C#
This code snippet demonstrates how to grab the content from a webpage and put it in a string variable. This snippet can be used for several applications like a webcrawler or spider.
Since Bandwith (most of the times) is an issue I also have added code that enables and handles GZIP/DEFLATE compressed content. Compressed content can save up to 80% of the needed bandwith, since most of the content is text based. This feature can easily be disabled by removing the 'Accept encoding' line in the first snippet.
Now lets start!
First we need a routine that grabs content from a valid URL:
public string GrabURL(string in_URL)
{
try
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(in_URL);
webRequest.Timeout = 6000;
webRequest.ReadWriteTimeout = 8000;
//Accept GZIP and DEFLATE compressed content.
//You can decide to disable this part. Then the decompression functions
//Are not needed
webRequest.Headers.Add("Accept-Encoding: deflate, gzip");
//Defaine the user agent
webRequest.UserAgent = "MyUserAgent";
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
Stream responseStream = webResponse.GetResponseStream();
string responseEncoding = webResponse.ContentEncoding.Trim();
if (responseEncoding.Length == 0)
responseEncoding="utf-8";
StreamReader responseReader = new StreamReader(responseStream);
//Decompress te content when GZIP compression is used.
if (webResponse.ContentEncoding.IndexOf("gzip") > -1)
{
return (DecompressGzip(responseStream));
}
//Decompress te content when DEFLATE compression is used
if (webResponse.ContentEncoding.IndexOf("deflate") > -1)
{
return (DecompressDeflate(responseStream));
}
else
{
return responseReader.ReadToEnd();
}
}
catch
{
return "error";
}
}
Of course, when you have enabled the GZIP and/or DEFLATE encoding, the decompression algoritms for GZIP and DEFLATE are needed.
using System.IO.Compression;
First we need a function that handles GZIP-encoded content:
private string DecompressGzip(Stream in_InputStream)
{
Stream lv_OutputStream = new MemoryStream();
try
{
byte[] lv_Buffer = new byte[4096];
using (GZipStream lv_gzip = new GZipStream( _
in_InputStream, CompressionMode.Decompress))
{
int i;
while ((i = lv_gzip.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
{
lv_OutputStream.Write(lv_Buffer, 0, i);
}
}
}
catch(Exception ex)
{
WriteEventLog(ex.Message);
}
return Stream2String(lv_OutputStream);
}
Then we need the routine to decompress DEFLATE encoded content:
private string DecompressDeflate(Stream in_InputStream)
{
Stream lv_OutputStream = new MemoryStream();
try
{
byte[] lv_Buffer = new byte[4096];
using (DeflateStream lv_Deflate = new DeflateStream(_
in_InputStream, CompressionMode.Decompress))
{
int i;
while ((i = lv_Deflate.Read(lv_Buffer, 0, lv_Buffer.Length)) != 0)
{
lv_OutputStream.Write(lv_Buffer, 0, i);
}
}
}
catch (Exception ex)
{
WriteEventLog(ex.Message);
}
return Stream2String(lv_OutputStream);
}
After implementing the code above the function GrabURL can be used the following way:
string lv_HTML;
lv_HTML = GrabURL("http://www.dotnet4all.com");
Enjoy!
Posted by Xander Zelders
How to extract the host name from an URL (C#)
This snippet demonstrates how to extract the host/domain name from a valid URL using regular expressions.
using System.Text.RegularExpressions;
...
public static string ExtractDomainFromURL(string in_URL)
{
string regexPattern = @"^(?(?[^:/\?#]+):)?(?"
+ @"//(?[^/\?#]*))?(?[^\?#]*)"
+ @"(?\?(?[^#]*))?"
+ @"(?#(?.*))?";
Regex re = new Regex(regexPattern, RegexOptions.ExplicitCapture);
Match m = re.Match(in_URL);
return m.Groups["s1"].Value + m.Groups["a1"].Value;
}
Labels: Domain, Host, Regular Expression, URL
Posted by Xander Zelders
How to Send an email using SMTP (C#)
Sending Email is easy. The only conditions is that you have to have access to an SMTP mailing service.
using System.Web.Mail;
...
public static void SendF4Mail(string in_Body, string in_Subject, _
string in_From, string in_To, string in_Bcc)
{
MailMessage lv_Mail = new MailMessage();
string lv_MySMTPServer = "smtp.xxxxx.xxxx"; //Your SMTP server
lv_Mail.To = in_To; //Valid Emailadres
lv_Mail.Bcc = in_Bcc; //Valid Emailadres
lv_Mail.Subject = in_Subject;
lv_Mail.From = in_From; //Valid Emailadres (that is allowed to relay).
lv_Mail.Body = in_Body;
lv_Mail.Priority = MailPriority.Normal;
lv_Mail.BodyFormat = MailFormat.Html; //or MailFormat.Text
SmtpMail.SmtpServer.Insert(0,lv_MySMTPServer);
SmtpMail.Send(lv_Mail); //Send the email
}
Posted by Xander Zelders
How to remove HTML-tags from web content (C#)
Using webcontent in applications can be very annoying since webcontent usually contains lots of HTML elements. With one simple action, using regular expressions, all of these HTML elements can be removed from the content. What's left is a clean string, without HTML formatting.
Snippet:
using System.Text.RegularExpressions;
...
public static string RemoveHTML(string in_HTML)
{
return Regex.Replace(lv_HTML, "<(.|\n)*?>", "");
}
Labels: HTML, Regular Expression
Posted by Xander Zelders
How to convert DateTime to SQL valid string
Using DateTime-fields in an SQL string is not difficult. Therefore the DateTime format must be 'YYYY-MM-DD HH:mm'.
For Example:
"Select * from MyTable Where MyDateTimeField > '2007-01-01 12:00'"
For converting a .NET DateTime object to a valid SQL-compliant string using C# the following code snippet can be used:
public static string ConverteerDatum(DateTime in_Datum)
{
return = in_Datum.Year + "-" + in_Datum.Month + "-" + _
in_Datum.Day + " " + in_Datum.ToLongTimeString();
}
Labels: Convert, DateTime, SQL
Posted by Xander Zelders