Home | Index | Dotnet4all Snippets | Submit resources
About | Mail us 
Strip HTML tags, HEAD content and SCRIPT tags from content using Regex and C# (08 December 2008)


 
This code snippet shows how to remove HTML tags, content within HEAD-elements and content within SCRIPT-elements using C# and Regex. Fill the string 'in_HTML' with a HTML-formatted webpage. The funtion will return a string with the content from that webpage, excluding the Header info and Javascripts.
        private string RemoveHTML(string in_HTML)
        {
            string lv_HTML = in_HTML;

            //Exclude all content between HEAD tags
            lv_HTML = Regex.Replace(lv_HTML, "<head.*?</head>", ""
                      , RegexOptions.Singleline | RegexOptions.IgnoreCase);
 
            //Exclude all content between SCRIPT tags
            lv_HTML = Regex.Replace(lv_HTML, "<script.*?</script>"
                      , "", RegexOptions.Singleline | RegexOptions.IgnoreCase);
 
            //Strip all HTML tags
            lv_HTML = Regex.Replace(lv_HTML, "<(.|\n)*?>", "");
 
            return lv_HTML;
        }

Posted by Xander Zelders



0 Comments:

Post a Comment

<< Home

 
Previous Posts
    - 9 Tips for creating indexes in SQL Server
    - Performance Tip 1: Avoid non-sargable WHERE-clause...
    - 23 Tips to improve the performance of your SQL que...
    - A cheat sheet for SQL Server developers
    - How to replace certain word with a hyperlink using...
    - How to Highlight a specific word in HTML content (...
    - how to extract SRC from IMG elements in HTML code
    - How to extract URL and Anchor from HTML content
    - Grab the content of a (GZIP) webpage using C#
    - How to extract the host name from an URL (C#)



Disclaimer & Terms of Use | DotNet4All.Com concept & © 2004 - 2007 by Zelders² - Holland