This C# snippet shows how to extract URL/Anchor combinations from HTML content and store the URLs in an ArraList.
to achieve this two regular expressions are used.
- One regular expression extracts all URL/Anchor combination from HTML content
- One regular expression check if a string is a qualified URL. (for example not a relative one)
Snippet:
//Define ArrayList in which all URLs will be stored
ArrayList lv_URLs = new ArrayList();
//This Regular Expression finds URL/anchor combinations
Regex lv_FindAllURLs = new Regex(_
@"]*href\s*=\s*[\""\']?(?[^""'>\s]*)[\""\']?[^>]*>(? ");[^<]+|.*?)?
// get all the matches depending upon the regular expression
MatchCollection mMatchCollection = lv_FindAllURLs.Matches(lv_MyHTML);
foreach(Match mMatch in mMatchCollection)
{
string lv_URL = mMatch.Groups["URI"].Value;
string lv_Anchor = mMatch.Groups["Name"].Value;
//This Regular Expression checks if a string is a valid URL
Regex lv_IsURL = new Regex(@"http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?");
//Check if the fount URL is a valid URL. (It can also be a relative URL)
if (lv_IsURL.IsMatch(lv_URL))
{
lv_URLs.Add(lv_URL);
}
}
Posted by Xander Zelders

0 Comments:
Post a Comment
<< Home