Handling data in ColdFusion

Many of the issues involved with globalizing applications deal with processing data from the various sources supported by ColdFusion, including the following:

URL strings
Forms
Files
Databases
E-mail
HTTP
LDAP
WDDX
COM
CORBA

This section describes how to handle data from each of these sources.

Input data from URLs and HTML forms

A web application server receives character data from request URL parameters or as form data.

The HTTP 1.1 standard only allows US-ASCII characters (0-127) for the URL specification and for message headers. This requires a browser to encode the non-ASCII characters in the URL, both address and parameters, by escaping (URL encoding) the characters using the "%xx" hexadecimal format. URL encoding, however, does not determine how the URL is used in a web document. It only specifies how to encode the URL.

Form data uses the message headers to specify the encoding used by the request (Content headers) and the encoding used in the response (Accept headers). So content negotiation between the client and server uses this information.

This section contains suggestions on how you can handle both URL and form data entered in different character sets.

Handling URL strings

URL requests to a server often contain name/value pairs as part of the request. For example, the following URL contains name/value pairs as part of the URL:

http://company.com/prod_page.cfm?name=Stephen;ID=7645

As noted in the previous section, URL characters entered in using any character set other than US-ASCII are URL encoded in a hexadecimal format. However, by default, a web server assumes that the characters of a URL string are single byte characters.

One common method used to support different character sets within a URL is to include a name/value pair within the URL that defines the character set of the URL. For example, the following URL uses a parameter called "encoding to define the character set of the URL parameters:

http://company.com/prod_page.cfm?name=Stephen;ID=7645;encoding=Latin-1

Within the product_name.cfm page, you can check the value of the encoding parameter before processing any of the other name/value pairs. This guarantees that you will handle the parameters correctly.

You can also use the setEncoding() function to specify the character set of URL parameters. The setEncoding() function takes two parameters; the first specifies a variable scope and the second specifies the character set used by the scope. Since ColdFusion writes URL parameters to the URL scope, you specify "URL" as the scope parameter to the function.

For example, if the URL parameters were passed using Shift-JIS, you could access them as follows:

<cfscript>

  setEncoding("URL", "Shift_JIS"); 
  writeoutput(URL.name); 
  writeoutput(URL.ID); 
</cfscript>

Handling form data

The HTML form tag and the ColdFusion cfform tag allows users to enter text on a page, then submit that text to the server. The form tags are designed to work only with single byte character data though. Since ColdFusion uses two bytes per character when it stores strings, ColdFusion converts each byte of the form input into a two-byte representation.

However, if a user enters double-byte text into the form, the form interprets each byte as a single character, rather than recognize that each character is two bytes. This will corrupt the input text, as the following example shows:

A customer enters three double-byte characters in a form, represented by 6 bytes.
The form returns the 6 bytes to ColdFusion as six characters. ColdFusion converts them to a representation using 2 bytes per input byte for a total of 12 bytes.
Outputting these characters results in corrupt information displayed to the user.

To work around this issue, ColdFusion supplies the setEncoding() function that you use when working with forms. You use this tag to specify the character set of input form text. The setEncoding() function takes two parameters; the first specifies the variable scope and the second specifies the character set used by the scope. Since ColdFusion writes form parameters to the Form scope, you specify "Form" as the scope parameter to the function. If the input text is double byte, ColdFusion preserves the two-byte representation of the text.

For example, the following code specifies that the form data contains Korean characters:

<cfscript>

  setEncoding("FORM", "EUC-KR"); 
</cfscript>
<h1> Form Test Result </h1>
<strong>Form Values :</strong>

<cfset text =  "String = #form.input1# , Length = #len(Trim(form.input1))#">
<cfoutput>#text#</cfoutput>

Reading and writing file data

You use the cffile tag to write to and read from text files. By default, the cffile tag assumes that you are reading single byte character data. This causes a problem if you read a file that contains double-byte characters, whether it is a file you created using cffile to write to the file, or any file containing double-byte characters.

On a read, the cffile tag converts each byte into a two-byte representation. If the data is a single-byte representation, the read works fine. If the file contains double-byte characters, the read interprets each byte as a single character and corrupts the data.

To enable the cffile tag to correctly read and write double-bye characters, you can pass the charset attribute to it. Specify as a value the character encoding of the data to read or write, as the following example shows:

<cffile action="read"

  charset="EUC-KR" 
  file = "c:\web\message.txt" 
  variable = "Message" >

Databases

ColdFusion applications access databases using drivers for each of the supported database types. The conversion of client native language data types to SQL data types is transparent and is done by the driver managers, database client, or server. For example, the character data (SQL CHAR, VARCHAR) you use with JDBC API is represented Unicode encoded strings.

Database administrators configure data sources and usually are required to specify the character encodings for character column data. Many of the major vendors, such as Oracle, Sybase, and Informix, support storing character data in many character encodings including the Unicode's UTF-8 and UTF-16 (UCS-2).

The database drivers supplied with ColdFusion correctly handle data conversions from the database native format to the ColdFusion Unicode format. You should not have to perform any additional processing to access databases. However, you should always check with your database administrator to determine how your database supports different character encodings.

E-mail

ColdFusion supports e-mail using the tags cfmail and cfmailparam. Because ColdFusion uses the Java mail package, which supports Unicode, you do not have to perform any special processing to handle e-mail.

HTTP

ColdFusion supports HTTP communication using the cfhttp and cfhttpparam tags and the GetHttpRequestData functions.

The cfhttp tag supports making HTTP requests using GET and POST. By default, the cfhttp tag uses the Unicode UTF-8 encoding for passing data. However, you can also insert the cfhttpparam tag to specify a MIME type.

LDAP

ColdFusion supports LDAP through the cfldap tag. LDAP uses the UTF-8 encoding format, so you can mix all retrieved data with other data and safely manipulated it. No extra processing is required to support LDAP.

WDDX

ColdFusion supports cfwddx tag. ColdFusion stores WDDX data as UTF-8 encoding so it automatically supports double-byte character sets. You do not have to perform any special processing to handle double-byte characters with WDDX.

COM

ColdFusion supports COM through the cfobject type="com" tag. All strings data used in COM interfaces are constructed using wide characters (wchars) which support double-byte characters. You do not have to perform any special processing for interfacing with COM objects.

CORBA

ColdFusion supports CORBA through the cfobject type="corba" tag. The CORBA 2.0 interface definition language (IDL) basic type "String" used the Latin-1 character set which used the full 8-bits (256) to represent characters.

As long as you are using CORBA later than version 2.0, which includes support for the IDL types wchar and wstring which map to Java types char and string respectively, you do not have to do anything to support double-byte characters.

However, if you are using a version of CORBA that does not support wchar and wstring, the server uses char and string data types which assume a single byte representation of text.

Searching and indexing

ColdFusion supports Verity search through the cfindex, cfcollection, and cfsearch tags. To support multilingual searching, the ColdFusion MX product CD-ROM includes the Verity language packs that you install to support different languages.

Developing ColdFusion MX Applications with CFML
Developing Globalized Applications