Home | Index | Dotnet4all Snippets | Submit resources
About | Mail us 
How to extract URL and Anchor from HTML content (16 April 2007)


 
This C# snippet shows how to extract URL/Anchor combinations from HTML content and store the URLs in an ArraList. to achieve this two regular expressions are used. - One regular expression extracts all URL/Anchor combination from HTML content - One regular expression check if a string is a qualified URL. (for example not a relative one) Snippet:
//Define ArrayList in which all URLs will be stored
ArrayList lv_URLs = new ArrayList();

//This Regular Expression finds URL/anchor combinations
Regex lv_FindAllURLs = new Regex(_
      @"]*href\s*=\s*[\""\']?(?[^""'>\s]*)[\""\']?[^>]*>(?[^<]+|.*?)?");

// get all the matches depending upon the regular expression
MatchCollection mMatchCollection = lv_FindAllURLs.Matches(lv_MyHTML);

foreach(Match mMatch in mMatchCollection) 
{
  string lv_URL = mMatch.Groups["URI"].Value;
  string lv_Anchor = mMatch.Groups["Name"].Value;
  
  //This Regular Expression checks if a string is a valid URL
  Regex lv_IsURL = new Regex(@"http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?");

  //Check if the fount URL is a valid URL. (It can also be a relative URL)
  if (lv_IsURL.IsMatch(lv_URL))
  {
    lv_URLs.Add(lv_URL);  
  }
}

Posted by Xander Zelders



0 Comments:

Post a Comment

<< Home

 
Previous Posts
    - Grab the content of a (GZIP) webpage using C#
    - How to extract the host name from an URL (C#)
    - How to Send an email using SMTP (C#)
    - How to remove HTML-tags from web content (C#)
    - How to convert DateTime to SQL valid string



Disclaimer & Terms of Use | DotNet4All.Com concept & © 2004 - 2007 by Zelders² - Holland