Character Encoding at The Conversations Network

Here’s a post that will be of interest only to that small percentage of Blogarithms readers who are struggling with character-encoding issues in the world of PHP/MySQL. But to those of you who fall into that category, this may be helpful. In the process of preparing for the launch of our first series in French, I’ve been working my way through this issue. Although we haven’t tested everything and I expect there will be bugs, here’s what I’ve learned/decided so far.

  • We store everything in the CMS database as utf-8.
  • Nearly all CMS strings are stored in the database as simple utf-8 without HTML entity encoding. HTML is not allowed in most CMS fields.
  • One can check MySQL’s charset at its phpMyAdmin main page: /main.php
  • Our Long Description field (shows.description) is the one CMS field that has special handling:
    • Many HTML elements are allowed in this field.
    • The TinyMCE editor enforces the HTML elements rules, eliminating those that are not allowed.
    • The TinyMCE editor is configured to HTML encode <,> and & on output. The encoding is invisible unless the user activates HTML mode.
    • All characters are still utf-8 encoded as elsewhere.
  • All HTTP responses include a Content-Type header specifying utf-8 character encoding. (“AddDefaultCharset utf-8” in http.conf.)
       Content-Type text/html; charset=utf-8
  • In generated RSS feeds all strings are HTML entity encoded (<>& only) during feed generation.
  • We also convert various strange characters such as those that may have been copied/pasted from a Microsoft Word document. Current such transformations include
    • slanted single and double quotes
    • various em and long dashes
  • All generated RSS feeds use utf-8 character encoding:
       <?xml version='1.0' encoding='utf-8' ?>
  • All HTML pages use utf-8 character encoding:
       <head>
          <meta http-equiv="Content-Type"
                content="text/html;
                charset=utf-8" />
  • Immediately after opening a database connection and before any other query is performed, the following query is run to ensure that MySQL performs no character-set transformations:
       SET character_set_results = 'utf8',
           character_set_client = 'utf8',
           character_set_connection = 'utf8',
           character_set_database = 'utf8',
           character_set_server = 'utf8';
  • All form elements are (or at least should be) written as follows. It’s not clear what a browser will do with this, for example if the user pastes into an <input> field text that has been copied from a Microsoft Word document with non-utf-8 encodings.
       <form accept-charset='utf-8'...

2 thoughts on “Character Encoding at The Conversations Network

  1. What if i have Mixed content input from users may be a copy paste form MS World, how can i detect character encoding to convert it to appropriate form. If i encode all the data in in utf then it may loose some data which was encoded in Hibrew/Russian/Greek ..
    How to handle the situation where i have to show mixed content, properly as given by user??

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s