Home | Index | Dotnet4all forum | Dotnet4all Snippets | Submit resources 
About | Mail us 
How to extract URL and Anchor from HTML content (16 April 2007)


This C# snippet shows how to extract URL/Anchor combinations from HTML content and store the URLs in an ArraList.
to achieve this two regular expressions are used.
- One regular expression extracts all URL/Anchor combination from HTML content
- One regular expression check if a string is a qualified URL. (for example not a relative one)

Snippet:
//Define ArrayList in which all URLs will be stored
ArrayList lv_URLs = new ArrayList();

//This Regular Expression finds URL/anchor combinations
Regex lv_FindAllURLs = new Regex(_
@"]*href\s*=\s*[\""\']?(?[^""'>\s]*)[\""\']?[^>]*>(?[^<]+|.*?)?");

// get all the matches depending upon the regular expression
MatchCollection mMatchCollection = lv_FindAllURLs.Matches(lv_MyHTML);

foreach(Match mMatch in mMatchCollection)
{
string lv_URL = mMatch.Groups["URI"].Value;
string lv_Anchor = mMatch.Groups["Name"].Value;

//This Regular Expression checks if a string is a valid URL
Regex lv_IsURL = new Regex(@"http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?");

//Check if the fount URL is a valid URL. (It can also be a relative URL)
if (lv_IsURL.IsMatch(lv_URL))
{
lv_URLs.Add(lv_URL);
}
}

Posted by Xander Zelders
 


0 Comments:

Post a Comment

<< Home

 
Previous Posts
    - Grab the content of a (GZIP) webpage using C#
    - How to extract the host name from an URL (C#)
    - How to Send an email using SMTP (C#)
    - How to remove HTML-tags from web content (C#)
    - How to convert DateTime to SQL valid string



Disclaimer & Terms of Use | DotNet4All.Com concept & © 2004 - 2007 by Zelders² - Holland