Using the Easysoft Oracle ODBC driver with Unicode data
Contents
- What is Unicode?
- Types of Unicode
- Unicode and ODBC
- Unicode and UNIX
- Unicode and Oracle
- Oracle, ODBC, and Unicode
- Summary
What is Unicode?
A good definition can be found at http://www.unicode.org:
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Types of Unicode
The previous definition appears to be saying that Unicode is a single way of encoding different character sets. And that is the hope of the Unicode Consortium that produced the above definition. However, the current reality is not as perfect as we would hope.
The Unicode Consortium has associated a unique number against every character in use worldwide. So, for example, U+0041
is the code for Latin capital letter A, and U+1033A
is the code for Gothic letter KUSMA.
This all looks good, and work is underway to ensure that all characters are defined in such a way. What is not obvious from this, however, is the method used to encode the Unicode character. This is where many of the practical problems with the use of Unicode arise.
The two main methods of encoding Unicode characters are Unicode Transformation Format (UTF) and Universal Character Set (UCS). Examples of these encoding forms include:
- UTF-7, a relatively unpopular 7-bit encoding, often considered obsolete.
- UTF-8, an 8-bit, variable-width encoding, which maximises compatibility with ASCII.
- UTF-EBCDIC, an 8-bit variable-width encoding, which maximises compatibility with Extended Binary-Coded Decimal Interchange Code (EBCDIC).
- UCS-2, a 16-bit, fixed-width encoding that only supports the Basic Multilingual Plane (BMP).
- UTF-16, a 16-bit, variable-width encoding.
- UCS-4 and UTF-32, functionally identical 32-bit fixed-width encodings.
So now we can see that from the original ideal of one standard number (Unicode code point) for every character, we have a multitude of encoding types.
Unicode encoding forms use either a fixed or a variable number of bytes to represent each character. For example, UCS-2 is a fixed-width encoding where each character is 2 bytes in size. UTF-8 is a variable-width encoding where one character can be 1, 2, 3, or 4 bytes.
A result of the variable length character of some UTF encoding forms is that there is the potential for illegal sequences of bytes.
Unicode and ODBC
Since the 3.51 release of the Microsoft Windows ODBC driver manager, ODBC has included support for Unicode. This support takes the form of Unicode data types (for example, SQL_WCHAR
), and Unicode versions of the ODBC API that take and return Unicode data (for example, SQLPrepareW
). For ODBC, Unicode means UCS-2 — a fixed-width single 16-bit character representation.
Because UTF-8 type encoding can be considered as a sequence of bytes, many of the ODBC functions that take single byte characters can work with UTF-8 data. However, it must be remembered that ODBC does not know that the data is UTF-8. It is just a sequence of single characters, and ODBC does not know about the relationship between multi-byte sequences and single characters.
This can lead to problems. For example, if a column in a database is defined as CHAR(20)
, the user would expect that to provide storage for twenty characters. However, if the database does not understand UTF-8, this means that only twenty bytes may be stored, and to take the worst case of a sequence of four byte characters, in reality only five characters can actually be stored. If the database is aware of UTF-8, and a column defined as CHAR(20)
can actually contain twenty 4 byte sequences, imagine an application using ODBC to return data from the table. The application may query the database to find the definition of the column, and will discover that its length is reported as twenty. So, it allocates space for twenty bytes and calls the database to return the data. Only the first five 4-byte sequences can be returned, as there is only space for that number of bytes.
Also, because ODBC is not aware of the multi-byte sequences, it cannot maintain the boundaries between valid byte sequences. It may be that the twenty-character buffer provided has space for the first nine characters because they are encoded in two byte sequences, but the tenth character requires a three byte sequence. The ODBC specification gives no guideline as to what should happen here. The driver cannot return a short buffer and return the next character later, as this would lead the driver to believe that all the data was returned. If it returns the first two bytes from the three-byte encoding, the last character in the buffer will be a illegal UTF-8 encoding sequence.
The best way to use Unicode with ODBC is to stick to the original design, use 8-bit ASCII encoding for the normal API calls, and UCS-2 for the W calls.
Unicode and UNIX
Let's look at a small sample program:
#include <stdio.h> main() { char seq[] = { '>', 0x55, 0x6E, 0x69, 0x63, 0x6F, 0x64, 0x65, '<', 0x00 }; int i; i = puts( seq ); printf( "put %d characters\n", i ); }
When compiled and run, this program takes the byte sequence, and displays it on the terminal. Then it displays the number of characters displayed (including one for the newline).
So, running this program produces (as expected):
>Unicode< put 10 characters
Now we will create a sequence of characters that use values beyond the 127 point where ASCII and UTF-8 differ. The sequence is 0x0170
, 0x0144
, 0x0129
, 0x0107
, 0x014D
, 0x0111
, 0x0115
. Converting that sequence to UTF-8 gives a sequence of bytes that we can insert into the test program.
#include <stdio.h> main() { char seq[] = { '>', 0xC5, 0xB0, 0xC5, 0x84, 0xC4, 0xA9, 0xC4, 0x87, 0xC5, 0x8D, 0xC4, 0x91, 0xC4, 0x95, '<', 0x00 }; int i; i = puts( seq ); printf( "put %d characters\n", i ); }
This demonstrates how a character code such as 0x0170
becomes two bytes 0xC5
and 0xB0
. This is an example of a two byte UTF-8 sequence.
Now let's run the program:
> < put 17 characters
What we see now is that the characters have not been displayed, and the call to puts
returns the number of bytes that have been displayed. But looking at the distance (in character positions) between the > and < characters, we see that there is space for 7 characters. So at least part of the system has understood that the fourteen bytes represent seven characters. But something else in the system has not been able to render those seven characters as visible display elements. The name sometimes given to the visual representation of a character is a glyph. This term originates from printing, and is defined as "the shape given in a particular typeface to a specific grapheme or symbol."
So now we have seen that a particular character can take on several forms, all being the same character, but appearing to be very different. We start with the character code 0x0170
. This can be represented in a given encoding as a sequence of bytes, in the case of UTF-8 as 0xC5
, 0xB0
. And that encoding may be displayed (or not in this case) as a particular visual symbol or glyph.
Just to show how different systems can display the same sequence of characters in a different way, we ran the above program on another Linux machine. This time we used a 2.3 version of glibc, instead of the 2.2 used previously. The same program now shows the Unicode characters.
>Űńĩćōđě< put 17 characters
Unicode and Oracle
When dealing with Unicode, there are a number of layers each of which plays a part in the support for Unicode in the Oracle database.
Since version 8 of Oracle, there are a number of Unicode encodings that can be associated with a database. The terminology used for these will be seen to use terms introduced earlier in this document.
Some of these encodings include:
- UTF8, (AL24UTF8 in Oracle's terminology), an encoding that can be applied to all languages, and can store up to 24-bit UTF-8, (three-byte sequence characters). Because it can only contain three-byte characters, it cannot encode the entire Unicode space.
- AL32UTF8, available from version 9i of Oracle, and being a 32-bit version, can consist of up to four bytes of UTF-8 data and so can contain the entire Unicode set.
- AL16UTF16, a 16-bit UTF format that can also contain the entire Unicode sequence set.
An Oracle database can have two of these Unicode encoding forms associated with it:
- The form used for the
CHAR
type (CHAR()
,VARCHAR()
,VARCHAR2()
). - The form used for the
NATIONAL CHAR
type (NCHAR()
,NVARCHAR()
,NVARCHAR2()
).
The CHAR type may be set to use UTF8 or AL32UTF8 but not AL16UTF16. The NCHAR
type may be set to use UTF8 or AL16UTF16, but not AL32UTF8.
Non-Unicode character sets
Oracle also has a large number of non-Unicode character sets. While at first sight, these may be assumed to enable Unicode-like encoding, they do not. The character sets only provide support for a limited set of characters that fall within a particular group. These character sets are not limited to 255 characters, and may use 16-bit coding, but it must be remembered these are not strictly Unicode sets.
Character set naming conventions
The naming convention used by Oracle for character sets is:
<region> <number of bits used to represent a character> <standard character set name>[S|C]
The region describes the geographical area the character set covers. For example, US7ASCII is US, WE8ISO8859P1 is Western European, JA16SJIS is Japanese, and AL16UTF16 covers All Languages (and so it is true Unicode).
The number of bits may be either the fixed bit width in the case of fixed length sets or the maximum sequence length in the case of variable length encoding. For example, WE8ISO8859P1 is 8-bit, US7ASCII is 7-bit, JA16SJIS is 16-bit, and AL32UTF8 is a maximum of 32 bits per character.
The standard character set name does not define what "standard" the name comes from, but it intended to be descriptive. For example, WE8ISO8859P1 is ISO8859 Part 1, AL16UTF16 is UTF16, and JA16SJIS is Japanese 16-bit Shifted Japanese Industrial Standard (SJIS).
The optional S
or C
specifies whether the character set may be only used on the client or server. We will consider client character sets later in this document.
What character sets does my database use?
One way to find the character sets in use in a particular database is to query the schema with the following:
select * from NLS_DATABASE_PARAMETERS where PARAMETER in ( 'NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET' )
This produces a result something like this:
+-------------------------------+---------------------------------------+ | PARAMETER | VALUE | +-------------------------------+---------------------------------------+ | NLS_CHARACTERSET | WE8ISO8859P1 | | NLS_NCHAR_CHARACTERSET | AL16UTF16 | +-------------------------------+---------------------------------------+
Oracle client character sets
In addition to the database character sets, the Oracle client also has a number of National Language Support (NLS) settings, including character set. There are three ways that the client NLS settings can be specified. These are:
- As environment variables on the client, for example:
set NLS_SORT=FRENCH export NLS_SORT
- With the
ALTER SESSION
statement, this overrides any local environment setting. For example:ALTER SESSION SET NLS_SORT = FRENCH;
- In SQL functions. This only has an effect for the duration of the query. For example:
TO_CHAR(hiredate, 'DD/MON/YYYY', 'nls_date_language = FRENCH')
In the context of Unicode, the important variable is the NLS_LANG
. The format of this variable is:
NLS_LANG = language_territory.charset
The components have the following meaning:
- Language: Specifies conventions such as the language used for Oracle messages, sorting, day names, and month names. Each supported language has a unique name; for example, AMERICAN, FRENCH, or GERMAN. The language argument specifies default values for the territory and character set arguments. If the language is not specified, the value defaults to AMERICAN.
- Territory: Specifies conventions such as the default date, monetary, and numeric formats. Each supported territory has a unique name; for example, AMERICA, FRANCE, or CANADA. If the territory is not specified, the value is derived from the language value.
- Charset: Specifies the character set used by the client application. (Normally the Oracle character set that corresponds to the user's terminal character set or the operating system character set.) Each supported character set has a unique acronym, for example, US7ASCII, WE8ISO8859P1, WE8DEC, WE8MSWIN1252, or JA16EUC. Each language has a default character set associated with it.
Examples of NLS_LANG
settings are:
NLS_LANG = AMERICAN_AMERICA.WE8MSWIN1252 NLS_LANG = FRENCH_CANADA.WE8ISO8859P1 NLS_LANG = JAPANESE_JAPAN.JA16EUC
All parts of the definition are optional; any item that is not specified uses its default value. If you specify territory or character set, you must include the preceding delimiter (underscore _
for territory, period .
for character set). Otherwise, the value is parsed as a language name.
For example, to set only the territory portion of NLS_LANG
, use the following format: NLS_LANG=_JAPAN
.
Within the scope of this document, the relevant part of the above definition is the charset.
Character set conversions
It's possible to have different character sets in use at the same time in the server and client parts of a session. And this means that conversions may need to take place between the client and server (and vice-versa) as data is transferred. Because it's possible to have encodings that while valid in one character set are invalid in another, it's possible for data loss to occur in the conversion process. For example, if the client character set was JA16EUC, but the database character set was US7ASCII, many of the valid characters contained in the client character set would have no meaningful representation in the database character set, so attempting to insert such characters would cause data corruption. By default, Oracle will not warn about the loss of data during a conversion. However, by setting the NLS_NCHAR_CONV_EXCP
property to TRUE
, either in the initialisation parameters for the database or through ALTER SESSION
, the database will report an error if data loss occurs.
When converting from character set A to character set B, characters that are not available in character set B are converted to replacement characters. Replacement characters are often specified as ?
or ¿
or as a character that is related to the unavailable character. For example, ä
(a with an umlaut) can be replaced by a
. Replacement characters are defined by the target character set.
Oracle, ODBC, and Unicode
Now we have all the information needed, we can put it all together and look how Oracle, ODBC, and Unicode works in practice. We have seen that any attempt to display data that may contain Unicode encoding is likely to be confused by the local operating system attempting to convert the character encoding to a local glyph. So, this means that convenient text-based utilities like isql provided with unixODBC cannot be used with any accuracy to determine Unicode issues. Because of this, we will use sample C code to show how it all works.
To show the operation of Unicode, we will use a database that uses UTF8 (AL32UTF8) as its CHAR
character set, and AL16UTF16 as its NCHAR
character set.
We create a test table as follows:
CREATE TABLE UCTEST ( I INTEGER, C1 CHARACTER VARYING( 100 ), C2 NATIONAL CHARACTER VARYING( 100 ) );
So we have both a CHAR
and NCHAR
field to insert into.
First, let's set the client character set to be UTF8 as we know that this is what will be needed to insert Unicode.
NLS_LANG=AMERICAN_AMERICA.UTF8 export NLS_LANG
Then we will use sample code to insert the 7-bit characters we used before into the table, both using 8-bit SQL_C_CHAR
and 16-bit SQL_C_WCHAR
types:
#include <stdio.h> #include <sql.h> #include <sqlucode.h> /* * insert a number of character values into a database */ main() { SQLHANDLE henv, hdbc, hstmt; SQLRETURN ret; SQLCHAR cval[] = { '>', 0x55, 0x6E, 0x69, 0x63, 0x6F, 0x64, 0x65, '<' }; SQLWCHAR uval[] = { '>', 0x0055, 0x006E, 0x0069, 0x0063, 0x006F, 0x0064, 0x0065, '<' }; SQLLEN len1, len2, len3; SQLINTEGER ival; ret = SQLAllocHandle( SQL_HANDLE_ENV, SQL_NULL_HANDLE, &henv ); ret = SQLSetEnvAttr( henv, SQL_ATTR_ODBC_VERSION, (SQLPOINTER) 3, 0 ); ret = SQLAllocHandle( SQL_HANDLE_DBC, henv, &hdbc ); ret = SQLConnect( hdbc, "ORACLE", SQL_NTS, "system", SQL_NTS, "master", SQL_NTS ); ret = SQLAllocHandle( SQL_HANDLE_STMT, hdbc, &hstmt ); ret = SQLPrepare( hstmt, "insert into UCTEST values ( ?, ?, ? )", SQL_NTS ); ret = SQLBindParameter( hstmt, 1, SQL_PARAM_INPUT, SQL_C_LONG, SQL_INTEGER, 0, 0, &ival, 0, &len1 ); ret = SQLBindParameter( hstmt, 2, SQL_PARAM_INPUT, SQL_C_CHAR, SQL_CHAR, 100, 0, &cval, sizeof( cval ), &len2 ); ret = SQLBindParameter( hstmt, 3, SQL_PARAM_INPUT, SQL_C_WCHAR, SQL_WCHAR, 100, 0, &uval, sizeof( uval ), &len3 ); ival = 1; len1 = 4; len2 = sizeof( cval ); len3 = sizeof( uval ); ret = SQLExecute( hstmt ); ret = SQLFreeHandle( SQL_HANDLE_STMT, hstmt ); ret = SQLDisconnect( hdbc ); ret = SQLFreeHandle( SQL_HANDLE_DBC, hdbc ); ret = SQLFreeHandle( SQL_HANDLE_ENV, henv ); }
The code inserts low (<127) value characters, both as 8- and 16-bit encodings.
After we have run the test code, we can use isql to find out what has been inserted. (isql is safe to use at this point as we are only using 7-bit codes. The CAST
s are just used to reduce the width of the displays.)
SELECT CAST( I AS VARCHAR( 5 )) AS I, CAST( C1 AS VARCHAR( 20 )) AS C1, CAST( C2 AS VARCHAR( 20 )) AS C2 FROM UCTEST +------+---------------------+---------------------+ | I | C1 | C2 | +------+---------------------+---------------------+ | 1 | >Unicode< | >Unicode< | +------+---------------------+---------------------+
Now let's insert the extended characters we used before, both as UTF-8 and UCS-2.
The code to do this is the same as before, but the definition of the two arrays of character encodings now contains codes above 127.
SQLCHAR cval[] = { '>', 0xC5, 0xB0, 0xC5, 0x84, 0xC4, 0xA9, 0xC4, 0x87, 0xC5, 0x8D, 0xC4, 0x91, 0xC4, 0x95, '<' }; SQLWCHAR uval[] = { '>', 0x0170, 0x0144, 0x0129, 0x0107, 0x014D, 0x0111, 0x0115, '<' };
After we have run this code, when we try to look at the table by using isql, we may get strange characters displayed. This is because we are expecting the operating system to map extended characters to glyphs, and it may not be able to do this.
So we now write another test sample to read the data from the table:
#include <stdio.h> #include <sql.h> #include <sqlucode.h> /* * read a number of character values from a database */ main() { SQLHANDLE henv, hdbc, hstmt; SQLRETURN ret; SQLCHAR cval[ 200 ]; SQLWCHAR uval[ 200 ]; SQLINTEGER ival; SQLLEN len; int i; ret = SQLAllocHandle( SQL_HANDLE_ENV, SQL_NULL_HANDLE, &henv ); ret = SQLSetEnvAttr( henv, SQL_ATTR_ODBC_VERSION, (SQLPOINTER) 3, 0 ); ret = SQLAllocHandle( SQL_HANDLE_DBC, henv, &hdbc ); ret = SQLConnect( hdbc, "ORACLE", SQL_NTS, "system", SQL_NTS, "master", SQL_NTS ); ret = SQLAllocHandle( SQL_HANDLE_STMT, hdbc, &hstmt ); ret = SQLExecDirect( hstmt, "select I, C1, C1, C2, C2 from UCTEST", SQL_NTS ); while(( ret = SQLFetch( hstmt )) != SQL_NO_DATA ) { printf( "reading row: " ); ret = SQLGetData( hstmt, 1, SQL_C_LONG, &ival, 4, &len ); printf( "%d, ", ival ); ret = SQLGetData( hstmt, 2, SQL_C_CHAR, &cval, 200, &len ); printf( "\n\tcolumn length %d { ", len ); for ( i = 0; i < len; i ++ ) { printf( "%02X ", cval[ i ] ); } printf( "}, ", len ); ret = SQLGetData( hstmt, 3, SQL_C_WCHAR, &uval, 200, &len ); printf( "\n\tcolumn length %d { ", len ); for ( i = 0; i < len / sizeof( SQLWCHAR ); i ++ ) { printf( "%04X ", uval[ i ] ); } printf( "}, ", len ); ret = SQLGetData( hstmt, 4, SQL_C_CHAR, &cval, 200, &len ); printf( "\n\tcolumn length %d { ", len ); for ( i = 0; i < len; i ++ ) { printf( "%02X ", cval[ i ] ); } printf( "}, ", len ); ret = SQLGetData( hstmt, 5, SQL_C_WCHAR, &uval, 200, &len ); printf( "\n\tcolumn length %d { ", len ); for ( i = 0; i < len / sizeof( SQLWCHAR ); i ++ ) { printf( "%04X ", uval[ i ] ); } printf( "}, ", len ); printf( "\n" ); } ret = SQLFreeHandle( SQL_HANDLE_STMT, hstmt ); ret = SQLDisconnect( hdbc ); ret = SQLFreeHandle( SQL_HANDLE_DBC, hdbc ); ret = SQLFreeHandle( SQL_HANDLE_ENV, henv ); }
This code reads the data from the tables, both as 8- and 16-bit characters.
Running the code allows us to find out what has been inserted into the database:
reading row: 1, column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, reading row: 2, column length 16 { 3E C5 B0 C5 84 C4 A9 C4 87 C5 8D C4 91 C4 95 3C }, column length 18 { 003E 0170 0144 0129 0107 014D 0111 0115 003C }, column length 16 { 3E C5 B0 C5 84 C4 A9 C4 87 C5 8D C4 91 C4 95 3C }, column length 18 { 003E 0170 0144 0129 0107 014D 0111 0115 003C },
Here we can see that the data has been inserted as both UTF-8 and UCS-2, and is returned in the same way. It can also be seen that the Oracle ODBC driver and Oracle client will convert from one form to the other.
The code samples work because the client knows what character set to use. If we set the client character set to a non-Unicode set, for example:
NLS_LANG = AMERICAN_AMERICA.US7ASCII
then run the test code again, we get a very different result:
reading row: 1, column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, column length 9 { 3E 55 6E 69 63 6F 64 65 3C }, column length 18 { 003E 0055 006E 0069 0063 006F 0064 0065 003C }, reading row: 2, column length 9 { 3E 55 6E 3F 63 3F 64 3F 3C }, column length 18 { 003E 0055 006E 003F 0063 003F 0064 003F 003C }, column length 9 { 3E 55 6E 3F 63 3F 64 3F 3C }, column length 18 { 003E 0055 006E 003F 0063 003F 0064 003F 003C },
We can see that all the characters less than 127 (row 1) are still fine, but all the extended characters (row 2) are now mapped to either the nearest the system can find, or the replacement character ?
(0x3F
). Again, we can see this in isql. It is safe to use isql with NLS_LANG
set to use a 7-bit character set, we know we will only get 7-bit characters, but they may not be the ones in the database.
SELECT CAST( I AS VARCHAR( 5 )) AS I, CAST( C1 AS VARCHAR( 20 )) AS C1, CAST( C2 AS VARCHAR( 20 )) AS C2 FROM UCTEST +------+---------------------+---------------------+ | I | C1 | C2 | +------+---------------------+---------------------+ | 1 | >Unicode< | >Unicode< | | 2 | >Un?c?d?< | >Un?c?d?< | +------+---------------------+---------------------+
Summary
Unicode and Oracle can often be a subject of confusion. However, as this document has shown, once the basic principles of Unicode and the processes involved are understood, the application of Unicode, Oracle, and ODBC can be simple and problem free.
Key points:
- The Unicode Standard specifies a numeric value (code point) and a name for each of its characters. For example, the code point U+0041 is assigned the character name "LATIN CAPITAL LETTER A."
- There are a number of Unicode encoding forms. For example, UTF-8, UTF-16, UTF-32, and UCS-2. An encoding form defines how a Unicode code point is stored as a sequence of bytes in memory.
- Unicode encoding forms use either a fixed or a variable number of bytes to represent each character. For example, UTF-8 is a variable-width encoding where one character can be 1, 2, 3, or 4 bytes. UCS-2 is a fixed-width encoding where each character is 2 bytes in size.
- UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports supplementary characters (which require 4 bytes per character).
- The Unicode Standard does not define the visual representation of a character. The rendering engine of a computer is responsible for the on-screen appearance of characters.
- A valid Unicode character may not display correctly because an appropriate font is not available. Characters that cannot be rendered are sometimes shown as an open or black rectangle.
- ODBC 3.51 supports Unicode in the form of Unicode data types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2.
- Although it's possible to use UTF-8 data with the standard ODBC API functions, ODBC does not know about the relationship between multi-byte sequences and single characters. For example, when reporting the length of a
CHAR(20)
column holding UTF-8 encoded data, ODBC is unaware that the data may require up to 80 (20×4) bytes of storage space. - An Oracle database has two character sets. The database character set determines the encoding of the SQL
CHAR
data types. The national character set determines the character set of the SQLNCHAR
types. - Oracle supports Unicode through Unicode databases and Unicode data types. A Unicode database allows you to store UTF-8 encoded characters as SQL
CHAR
data types. The SQLNCHAR
data types are called Unicode data types because they are only used for storing Unicode data. - You need to set the
NLS_LANG
parameter to let Oracle know what character set the client machine is using. SettingNLS_LANG
correctly enables Oracle to convert from the client operating system character encoding to the database character set (if necessary). - When characters are converted between the client and the database character set, or vice versa, the character should exist in both. If a character does not exist in the destination character set, a replacement character such as a
?
or a¿
is used.