Character Encoding at The Conversations Network

Here’s a post that will be of interest only to that small percentage of Blogarithms readers who are struggling with character-encoding issues in the world of PHP/MySQL. But to those of you who fall into that category, this may be helpful. In the process of preparing for the launch of our first series in French, I’ve been working my way through this issue. Although we haven’t tested everything and I expect there will be bugs, here’s what I’ve learned/decided so far.

We store everything in the CMS database as utf-8.
Nearly all CMS strings are stored in the database as simple utf-8 without HTML entity encoding. HTML is not allowed in most CMS fields.
One can check MySQL’s charset at its phpMyAdmin main page: /main.php
Our Long Description field (shows.description) is the one CMS field that has special handling:
- Many HTML elements are allowed in this field.
- The TinyMCE editor enforces the HTML elements rules, eliminating those that are not allowed.
- The TinyMCE editor is configured to HTML encode <,> and & on output. The encoding is invisible unless the user activates HTML mode.
- All characters are still utf-8 encoded as elsewhere.
All HTTP responses include a Content-Type header specifying utf-8 character encoding. (“AddDefaultCharset utf-8” in http.conf.)

       Content-Type text/html; charset=utf-8

In generated RSS feeds all strings are HTML entity encoded (<>& only) during feed generation.
We also convert various strange characters such as those that may have been copied/pasted from a Microsoft Word document. Current such transformations include
- slanted single and double quotes
- various em and long dashes
All generated RSS feeds use utf-8 character encoding:

       <?xml version='1.0' encoding='utf-8' ?>

All HTML pages use utf-8 character encoding:

       <head>
          <meta http-equiv="Content-Type"
                content="text/html;
                charset=utf-8" />

Immediately after opening a database connection and before any other query is performed, the following query is run to ensure that MySQL performs no character-set transformations:

       SET character_set_results = 'utf8',
           character_set_client = 'utf8',
           character_set_connection = 'utf8',
           character_set_database = 'utf8',
           character_set_server = 'utf8';

All form elements are (or at least should be) written as follows. It’s not clear what a browser will do with this, for example if the user pastes into an <input> field text that has been copied from a Microsoft Word document with non-utf-8 encodings.

       <form accept-charset='utf-8'...

All strings in ID3 frames use iso/iec 8859-1 character encoding and hence are limited to that character set. (See http://en.wikipedia.org/wiki/ISO/IEC_8859-1.)

2 thoughts on “Character Encoding at The Conversations Network”

What if i have Mixed content input from users may be a copy paste form MS World, how can i detect character encoding to convert it to appropriate form. If i encode all the data in in utf then it may loose some data which was encoded in Hibrew/Russian/Greek ..
How to handle the situation where i have to show mixed content, properly as given by user??

LikeLike

Same problem asked here also,
http://stackoverflow.com/questions/2669444/how-to-convert-non-latin-based-encoded-text-into-utf-8-or-make-them-coexist-on-s

No Fixed solution i guess.

LikeLike

	robert musacchio on DuArt: The End of the Film…
	Paul Porter on Silver Efex Pro for Color…
	Doug on Kenriko’s Wraps
	KK on Kenriko’s Wraps
	Kirk Donaldson on The Overton Technique

Blogarithms

Doug Kaye's Blog

Character Encoding at The Conversations Network

2 thoughts on “Character Encoding at The Conversations Network”

Leave a comment Cancel reply

Share this:

2 thoughts on “Character Encoding at The Conversations Network”

Leave a comment Cancel reply