Strip HTML tags, HEAD content and SCRIPT tags from content using Regex and C#
(08 December 2008)
This code snippet shows how to remove HTML tags, content within HEAD-elements and content within SCRIPT-elements using C# and Regex.
Fill the string 'in_HTML' with a HTML-formatted webpage. The funtion will return a string with the content from that webpage, excluding the Header info and Javascripts.
private string RemoveHTML(string in_HTML)
{
string lv_HTML = in_HTML;
//Exclude all content between HEAD tags
lv_HTML = Regex.Replace(lv_HTML, "<head.*?</head>", ""
, RegexOptions.Singleline | RegexOptions.IgnoreCase);
//Exclude all content between SCRIPT tags
lv_HTML = Regex.Replace(lv_HTML, "<script.*?</script>"
, "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
//Strip all HTML tags
lv_HTML = Regex.Replace(lv_HTML, "<(.|\n)*?>", "");
return lv_HTML;
}
Posted by Xander Zelders

0 Comments:
|