Benjamin Tinker Team : Web Development

ASCII Character Replacement for Imported Documents

Benjamin Tinker Team : Web Development

When importing content from 3rd party documents into HTML some styles and text formatting can be carried across. One such example is single and double quotation marks which most document editing tools have opening and closing versions of. HTML uses the same double quotation mark for opening and closing and one non-slanted version of single quotations.

While browsers are fully capable of rendering non-ASCII characters on screen by using UTF-8 you may want consistency across all content that uses the HTML defaults. This can be helpful when exporting content from your site to CSV format to be used by 3rd party software. Content coming from the will contain no special characters that could cause the import to fail or be misread.

The Code

private string FilterSpecialCharacters(string input)
{
     if (!string.IsNullOrEmpty(input))
     {
        string[,] filterArray = new string[,] {
             { "\\u2018", "'" }, { "\\u2019", "'" } //single quotation marks
            ,{"\\u201C","\""}, {"\\u201D","\""} //double quotation marks
         };

         for (int i = 0; i < filterArray.GetLength(0); i++)
         {
             input = Regex.Replace(input, filterArray[i,0].ToString(), filterArray[i,1].ToString());
         }
             
         input = HttpUtility.HtmlEncode(input);
     }
    return input;
}

A 2 dimensional array is used to hold the Unicode to scan for and the plain text replacement. This can be expanded on by simply adding extra entries to the array and what they should be replace with. By using Regex.Replace the array can contain any valid regular expression to search for. The above code checks for non-ASCII standard single and double quotation marks and replaces them with HTML versions. As this code was also being used for exporting to XML the content is then HTML encoded for any other characters. If the content is only required to be displayed within a browser this technique is not necessary as a simple HttpUtility.HtmlEncode call will do the trick. The idea here is ensuring content is consistently formatted regardless of its original source.

Useful Resources

Regular-Expressions.info

The go to place for learning the basics and intricacies of regular expressions. Bookmark it today.

Regex Hero

Good testing utility for regular expressions in .NET framework.  It is free and works well for testing your regular expressions before comitting them to code. They'll keep asking you to register but you can close that down and keep using it for free.

Wikipedia - Windows 1252 Character Codes

Article that contains a thorough listing of character codes and their unicode equivalents.

Stackoverflow Article - How can you strip non-ASCII characters from a string?

Useful article and code sample used as inspiration for this blog. This article is above removing all non-ASCII characters from content and replacing them with an empty string.