Some common Unicode problems and solutions demonstrated with Perl DBD::ODBC and Microsoft SQL Server
Contents
- Introduction
- Terminology
- Some DBD::ODBC background and history
- ODBC and Unicode
- How can you tell what support for the Unicode ODBC API you have already got?
- What happens if I try to use Unicode from Perl DBD::ODBC and a component in the chain does not support the Unicode APIs?
- Using
VARCHAR
columns instead ofNVARCHAR
columns for Unicode data - The correct way to do Unicode with DBD::ODBC and SQL Server
Introduction
In this article, I have tried to collect some useful information to help you use Unicode in Microsoft SQL Server from Perl DBD::ODBC. Many of the problems listed here come from questions on the dbi-user list, perlmonks or emails direct to me.
I have tried very hard to make all the examples work on Windows and UNIX but:
- Not all ODBC drivers are equal.
Many ODBC drivers for UNIX have multiple ways to configure Unicode support. (For example, returning wide characters as per the ODBC specification, or converting these characters to UTF-8.)
- SQL Server is not constant.
Different versions of Microsoft SQL Server add new features or bug fixes. For example, some versions of SQL Server don't support supplemental characters. SQL Server 2012 and later versions do support supplemental characters if you use the correct collation.
When writing this article, I discovered a bug in DBD::ODBC (when inserting into CHAR
or VARCHAR
columns) that affects all versions of DBD:ODBC that support Unicode up until 1.46_1, when it was fixed. I've tried to highlight the issue in the following examples.
Terminology
- Encoding
By encoding, I mean how Unicode characters are encoded For example, they could be encoded in UTF-8, UTF-16, or UCS-2. In Perl, Unicode characters are encoded in UTF-8, but you mostly do not need to know that, although DBD::ODBC does.
- Wide characters
In the ODBC API, wide characters are 2 bytes and UCS-2. However, Microsoft's idea of wide characters keeps changing and now it is sometimes 4 bytes in UTF-16.
- Wide APIs
The ODBC wide APIs are those called
SQLxxxW
. For example,SQLDriverConnectW
. Any string arguments to wide APIs normally expect UCS-2 (and sometimes UTF-16). SQL_WCHAR
andSQL_WVARCHAR
SQL_WCHAR
andSQL_WVARCHAR
are actually macros in the C ODBC API, which are assigned numbers and passed into some ODBC APIs to tell the ODBC driver to return wide characters.You can use these macros independently of using the wide ODBC APIs. You don't have to use
SQLDriverConnectW
,SQLPrepareW
and so on just to get wide characters back from an ODBC driver.SQLBindCol
andSQLBindParameter
often supportSQL_WCHAR
andSQL_WVARCHAR
as well.
Some DBD::ODBC background and history
DBD::ODBC has not always supported Unicode. It all started with a patch from Alexander Foken. Around version 1.14, Alexander's original patch was adapted to include optional Unicode support for UNIX.
For DBD::ODBC, building without Unicode support really means build as DBD::ODBC worked before Unicode support was added, to maintain backwards compatibility.
The initial Unicode support was for Windows only and allowed you to send or retrieve NCHAR
and NVARCHAR
columns as SQL_WCHAR
s. DBD::ODBC's initial Unicode support:
- Uses Windows APIs like
WideCharToMultiByte
andMultiByteToWideChar
to convert from UCS-2 to UTF-8 and vice versa. - Passes
SQL_WCHAR
s toSQLBindCol
forNCHAR
columns meaning UCS-2 data can be retrieved from a column. - Marks converted UTF-8 data returned from the database as UTF-8 for Perl.
- Binds Unicode data in Perl as a
SQL_WCHAR
s and usesSQLBindParameter
, meaning you can write Unicode data into columns.
Since then Unicode support has grown to include:
- Unicode support in SQL (1.16_2).
- Unicode support in connection strings (1.16_2).
- Unicode support in column names.
- Unicode support in metadata calls (1.32_3).
For full documentation on Unicode in DBD::ODBC, refer to the Unicode section in the DBD::ODBC documentation.
ODBC and Unicode
The ODBC API has two main sets of APIs; the ANSI API (SQLxxxA
APIs) and the Unicode API (SQLxxxW
APIs). By default, DBD::ODBC uses the Unicode API on Windows and the ANSI API on non-Windows platforms. There are good historical reasons for this beyond the scope of this article. If you want to use DBD::ODBC to work with Unicode data in SQL Server from non-Windows platforms, you need:
- unixODBC
Use the unixODBC ODBC Driver Manager. You'll need the
unixodbc
andunixodbc-dev
packages or you can build it yourself. - DBD::ODBC
Build DBD::ODBC for the Unicode API, that is, build it with:
perl Makefile.PL -u
If you install DBD::ODBC from the CPAN shell on UNIX, you will get a DBD::ODBC that does not use the wide ODBC APIs, by default.
- ODBC driver
Use a SQL Server ODBC driver that supports the Unicode APIs. Our SQL Server ODBC driver supports all the wide ODBC APIs, as does Microsoft's ODBC driver.
How can you tell what support for the Unicode ODBC API you have already got?
-
odbc_has_unicode
For DBD::ODBC, connect to your database and then test the
odbc_has_unicode
attribute:perl -MDBI -le 'my $h = DBI->connect; print $h->{odbc_has_unicode};'
If this outputs 1, your DBD::ODBC was built for Unicode. Any false value means DBD::ODBC is not using the Unicode APIs.
- Use a recent version of unixODBC
Use the most recent unixODBC you can get hold of. Most packaged unixODBCs (for example, those for Ubuntu, Debian, and so on) are quite old, but even these support Unicode. We recommend a recent version of unixODBC because of important bug fixes contained in recent unixODBCs.
- Unicode Wide APIs supported by ODBC driver
Find out if your ODBC driver supports the Unicode APIs (all Easysoft ODBC drivers do). This is a lot harder than it sounds and you will most likely have to consult the ODBC driver documentation.
If you understand shared objects in UNIX, you can try looking for
SQLxxxW
APIs being exported by the driver shared library. For example:nm /usr/local/easysoft/sqlserver/lib/libessqlsrv.so | grep SQLDriverConnectW 0001cf80 T SQLDriverConnectW
What happens if I try to use Unicode from Perl DBD::ODBC and a component in the chain does not support the Unicode APIs?
The simple answer is you won't be able to insert, update, delete, or select Unicode data from Microsoft SQL Server:
Example 1: Simple INSERT
and SELECT
with a non-Unicode DBD::ODBC
# ex1.pl use 5.008001; use strict; use warnings; use DBI qw{:utils}; my $unicode = "\x{20ac}"; # Unicode euro symbol my $h = DBI->connect or die $DBI::errstr; $h->{RaiseError} = 1; eval {$h->do(q/drop table unicode_test/)}; $h->do(q/create table unicode_test (a nvarchar(20))/); my $s = $h->prepare(q/insert into unicode_test (a) values(?)/); $s->execute($unicode); my $r = $h->selectrow_arrayref(q/select a from unicode_test/); my $data = $r->[0]; print "DBI describes data as: ", data_string_desc($data), "\n"; print "Data Length: ", length($data), "\n"; print "hex ords: "; foreach my $c(split(//, $data)) { print sprintf("%x,", ord($c)); } print "\n";
which outputs:
DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes Data Length: 3 hex ords: e2,82,ac,
We attempted to insert a Unicode Euro symbol and when we selected it back, we got 3 characters and 3 bytes instead of 1 character and 3 bytes. This is confirmed by the fact that the Perl data contains the UTF-8 encoding for a Euro.
An explanation of what happened:
- The column was created as an
NVARCHAR
, so Microsoft SQL Server should be happy to accept Unicode characters for the column data. - DBD::ODBC prepared the SQL and asked the ODBC driver to describe the parameter. DBD::ODBC was told it was a
UNICODE VARCHAR
of length 20 characters. However, it bound the parameter as a value type ofSQL_C_CHAR
and a parameter type ofSQL_C_WCHAR
, so the driver interpreted each byte as a character. - When we read the data back, we got the bytes back as Perl had encoded the Euro internally (UTF-8).
You might be asking yourself at this point why DBD::ODBC bound the data as a value type of SQL_C_CHAR
and the answer is backwards compatibility: that's what it did for a long time before support for the Unicode API was added.
So what if we force DBD::ODBC to bind the data as a SQL_WCHAR
?
Example 2: INSERT
and SELECT
statements with a non-Unicode DBD::ODBC that force SQL_WCHAR
The code for this is nearly identical to the above except we add a bind_param
call and import :sql_types
from DBI.
use 5.008001; use strict; use warnings; use DBI qw{:utils :sql_types}; my $unicode = "\x{20ac}"; # Unicode euro symbol my $h = DBI->connect or die $DBI::errstr; $h->{RaiseError} = 1; eval {$h->do(q/drop table unicode_test/)}; $h->do(q/create table unicode_test (a nvarchar(20))/); my $s = $h->prepare(q/insert into unicode_test (a) values(?)/); $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR}); $s->execute($unicode); my $r = $h->selectrow_arrayref(q/select a from unicode_test/); my $data = $r->[0]; print "DBI describes data as: ", data_string_desc($data), "\n"; print "Data Length: ", length($data), "\n"; print "hex ords: "; foreach my $c(split(//, $data)) { print sprintf("%x,", ord($c)); } print "\n";
and the output is:
DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes Data Length: 3 hex ords: e2,82,ac,
Exactly the same as before. Why? The TYPE
argument passed to bind_param
sets the SQL type (the parameter type) and not the value type in a SQLBindParameter
call.
Reading properly written Unicode in a non-Unicode DBD::ODBC
Now what if the Unicode Euro was inserted correctly by something else and we want to read it using a non-Unicode DBD::ODBC?
Example 3: Unicode from a non-Unicode DBD::ODBC
We have got a valid Unicode Euro symbol in the database, stored in an NVARCHAR
column. (Do not worry about how for now, this is just showing what happens when the data in the database is correct, but you use the wrong method to get it.)
use 5.008001; use strict; use warnings; use DBI qw{:utils}; my $unicode = "\x{20ac}"; # Unicode euro symbol my $h = DBI->connect or die $DBI::errstr; $h->{RaiseError} = 1; my $r = $h->selectrow_arrayref(q/select a from unicode_test/); my $data = $r->[0]; print "DBI describes data as: ", data_string_desc($data), "\n"; print "Data Length: ", length($data), "\n"; print "hex ords: "; foreach my $c(split(//, $data)) { print sprintf("%x,", ord($c)); } print "\n";
which outputs:
DBI describes data as: UTF8 off, non-ASCII, 1 characters 1 bytes Data Length: 1 hex ords: 80
Note that what you get back in data here very much depends on the ODBC driver and platform. On Windows, you would probably get the above because 0x80
is the Windows-1252 character for a Euro. (If it had been something not in Windows-1252, it would probably have returned a question mark.) With some Microsoft SQL Server ODBC drivers on UNIX, you could get any of the following (and perhaps more):
0xac
(the low byte of0x20ac
).0x3f
(a question mark because the driver cannot convert a wide character to aSQL_C_CHAR
).0x80
(if you set the client character set to Windows-1252).0xe2
,0x82
,0xac
(UTF-8 encoded Euro).
The point of this illustration is that you do not really want to do any of the above unless you have no choice.
You might be saying to yourself, yes but you can set a type in the bind_col
method, so you can control how the data is returned to you. Mostly that's not true for just about all Perl DBDs I know of, and with DBD::ODBC although you can override the default type in a bind_col
call, you can only do it for decimals and timestamps.
Using VARCHAR
columns instead of NVARCHAR
columns for Unicode data
If you're using DBD::ODBC before 1.46_1, do not do this. There's a bug in DBD::ODBC before 1.46_1 that means it does not look at the Perl scalars you're binding for input and always binds them using the type the driver describes the column as (which will always be SQL_CHAR
for a VARCHAR
column).
Generally speaking, you should use NCHAR
and NVARCHAR
columns when you need to support multiple languages in the same column, although even that is not always necessary. For example, you can support English, German, and Italian in one Windows codepage. A better recommendation would be to use N
columns for user provided data that's unconstrained and VARCHAR
columns for data that is constrained and you control. For example, a number plate, serial number, and so on.
However, in the spirit of describing why, let's look at some examples. These examples assume we're now using a Unicode DBD::ODBC with an ODBC driver that supports ODBC.
So we return to our first simple example, but now run it with a Unicode DBD::ODBC, use a VARCHAR
column and try two different bind types (the default and an overridden one):
Example 4: Simple INSERT
and SELECT
with a Unicode DBD::ODBC using VARCHAR
use 5.008001; use strict; use warnings; use DBI qw{:utils :sql_types}; my $unicode = "\x{20ac}"; # unicode euro symbol my $h = DBI->connect or die $DBI::errstr; $h->{RaiseError} = 1; eval {$h->do(q/drop table unicode_test/)}; $h->do(q/create table unicode_test (a varchar(20))/); my $s = $h->prepare(q/insert into unicode_test (a) values(?)/); $s->execute($unicode); $s->bind_param(1, undef, {TYPE => SQL_WVARCHAR}); $s->execute($unicode); my $r = $h->selectall_arrayref(q/select a from unicode_test/); foreach my $r (@$r) { my $data = $r->[0]; print "DBI describes data as: ", data_string_desc($data), "\n"; print "Data Length: ", length($data), "\n"; print "hex ords: "; foreach my $c(split(//, $data)) { print sprintf("%x,", ord($c)); } print "\n"; }
which outputs:
DBI describes data as: UTF8 on, non-ASCII, 3 characters 6 bytes Data Length: 3 hex ords: e2,201a,ac, DBI describes data as: UTF8 on, non-ASCII, 1 characters 3 bytes Data Length: 1 hex ords: 20ac,
Here again, you'll get different results depending on the platform and ODBC driver.
I imagine this is really going to make you wonder what on earth has happened here. Bear in mind, in Perl, the Euro is internally encoded in UTF-8 as 0xe2
, 0x82
, and 0xac
.
In the first INSERT
, DBD::ODBC does what it always did and asked the database what the column type was. The database returned SQL_CHAR
and the Euro was bound as a SQL_CHAR
(the bug). In the second case, we overrode DBD::ODBC and told it to bind the data as a SQL_WVARCHAR
.
When we retrieved the data, DBD::ODBC bound the column as SQL_WCHAR
(which it always does in a Unicode build).
As far as Microsoft SQL Server is concerned, this is a VARCHAR
column, you wanted to insert three characters (codes 0xe2
, 0x82
, and 0xac
). This is what end ups in the database, as confirmed when we read them back as binary data. However, where did the character with the code 0x201a
come from? When DBD::ODBC read the data back, it bound the column as a SQL_C_WCHAR
and therefore asked SQL Server to convert the characters in the VARCHAR
column to wide (UCS-2 or UTF-16) characters and guess what? Character 82 in the Windows-1252 character set (which I was using when running this code) is "curved quotes" with the Unicode value 0x201A
. 0xe2
and 0xac
in Windows-1252 are the same character code in Unicode.
In the second row, we bound the data as a SQL_WCHAR
for the INSERT
(and the database maps those Unicode characters to the codepage for the table) and we bound the data as a SQL_WCHAR
for the SELECT
. The ODBC driver maps the codepage characters retrieved back to Unicode, and we get back what we inserted (as long as all the characters we inserted fit into the current codepage). However, had we tried to insert a character not in the Windows-1252 codepage, SQL Server would substitute that character with a ?
. We should not have had to override the bind type here and that was the bug in DBD::ODBC pre 1.46_1.
This last point is really important. On Windows, the ODBC driver knows your codepage and can do this, but on UNIX it is not uncommon for you to have to tell the ODBC driver what codepage you want to use for the conversion. For example, with the Easysoft SQL Server ODBC driver, you can set Client_cset = WINDOWS=1252
or you could add use_lcid=Yes
to your connection and the ODBC driver will work out the correct codepage to use.
Here's a Windows specific version of the above test with a few more enhancements:
# # A simple demonstration of why you cannot use char and varchar columns # in Microsoft SQL Server to store Unicode. char and varchar columns use a codepage # and Unicode characters inserted into them are converted to the codepage. # If a conversion does not exist in the codepage the characters which don't # convert will become '?' # # Show the diference between binding as SQL_CHAR and SQL_WCHAR. # # Refer to http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2 # use 5.008.001; use strict; use warnings; use DBI qw(:sql_types); use Data::Dumper; use Win32::API; # chcp on my machines normally gives 850 # http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true # # we want windows-1252 so run: chcp 1252 first #use open qw( :encoding(Windows-1252) :std ); binmode STDOUT, ":encoding(cp1252)"; # get active codepage and ensure it is cp1252 # http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()"); my $cp = GetConsoleOutputCP(); print "Current active console code page: $cp\n"; if ($cp != 1252) { print "Change to codepage 1252 - run chcp 1252\n"; die "Incompatible active codepage - change to codepage 1252 by running chcp 1252\n"; } my $h = DBI->connect() or die $DBI::errstr; $h->{RaiseError} = 1; $h->{PrintError} = 1; eval {$h->do(q/drop table varchar_test/)}; $h->do(q/create table varchar_test(a varchar(20) collate Latin1_General_CI_AS)/); # note codepage 1252 is 255 chrs including the euro at 0x80 # windows-1252 does not contain U+0187 but it does contain # the euro symbol (U+20ac), the curved quotes (U+201A), # Latin Small Letter F with hook (U+192), dagger (U+2020) # mixing code pages in SQL Server is not recommended my $insert = $h->prepare(q/insert into varchar_test (a) values(?)/); my $data = "\x{20ac}\x{201A}\x{192}\x{2020}\x{187}" ; { use bytes; print "encoded length of our data is:", length($data), "\n"; print "encoded data in hex is:"; foreach my $b(split(//, $data)) { print sprintf("%x,", ord($b)); } print "\n"; } # this execute will discover the column is varchar and bind the perl scalar # as SQL_CHAR meaning the UTF-8 encoded data in the perl scalar # will be inserted as separate characters not all of which will even # be translateable to the current codepage. $insert->execute($data); # Now we force DBD::ODBC to insert the parameter as SQL_WVARCHAR $insert->bind_param(1, undef, {TYPE => SQL_WVARCHAR}); $insert->execute($data); print "\nNotice in the first row (which was inserted as SQL_CHAR), the UTF-8 stored\n" print "\nin the perl scalar is mostly stored as individual characters but then you will be\n" print "\nwondering why few of the characters seem to come back as Unicode. Windows sees individual\n" print "\ncharacters in the UTF-8 sequence as characters in the windows-1252 codepage and the UTF-8 sequence\n" print "\ncontains some characters in windows-1252 which map back to Unicode chrs. e.g., the UTF-8 sequence for the\n" print "\neuro is e2, 82, ac and windows see the 82 as the curved quotes in windows-1252 but when you ask for it back\n" print "\nas wide/Unicode characters it maps it to U+201a (curved quotes Unicode point)\n"; print "\nNotice how in the second row the last character is a ?. That is because U+0187 does not exist in windows-1252\n" print "\ncodepage our column is using\n"; my $r = $h->selectall_arrayref(q/select a from varchar_test/); print Dumper($r); foreach my $row (@$r) { print $row->[0], "\n"; } $h->disconnect;
The correct way to do Unicode with DBD::ODBC and SQL Server
When retrieving rows on Windows or UNIX (and using a Unicode DBD::ODBC), your CHAR
, VARCHAR
, NCHAR
, and NVARCHAR
columns should all be correct. Even when you use CHAR
and VARCHAR
columns that use a codepage, SQL Server will convert any character in the codepage to a Unicode codepoint and DBD::ODBC will encode them as UTF-8 and mark them Unicode to Perl. This is because DBD::ODBC asks for the column data as SQL_WCHAR
.
When inserting Unicode, DBD::ODBC will normally just do the right thing: use SQL_WCHAR
for NCHAR
and NVARCHAR
columns. If your column is a CHAR
or VARCHAR
then prior to 1.46_1 it may do the wrong thing by default. Until 1.46_1, DBD::ODBC ignored your Perl data and bound it as the type that the ODBC driver reported for the parameter. In 1.46_1 and later, DBD::ODBC checks your scalar for the parameter first to see it has the utf8
flag on it.
Surrogate pairs (Unicode code points above U+FFFF
)
ODBC supports Unicode in the form of SQL_WCHAR
types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2 (or at least that is how it looked when Windows and Microsoft SQL Server were using UCS-2).
In anticipation of things changing, when Unicode support was added to DBD::ODBC, the internal code actually converts Perl UTF-8 encoded strings into UTF-16, which for characters in the Basic Multilingual Plane is identical to UCS-2. As there were no Unicode ODBC drivers supporting supplementary characters (above U+0FFFF
) this could not be proven as a good decision. However, at worst, it meant code using Unicode outside the Basic Multilingual Plane would just insert what looked to the database as more characters.
Older versions of SQL Server and surrogate pairs
As it turned out, the decision in DBD::ODBC to use UTF-16 was exactly what Microsoft initially did and versions of Microsoft SQL Server like 2000, 2005 and 2008 are described as surrogate neutral rather than surrogate aware. Microsoft had this to say at the time:
Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.
However, there are a few things you should be aware of when using these older, surrogate-neutral, SQL Server versions:
- For each surrogate inserted, you will need one extra character in the column. For example, inserting three surrogate pairs into a
NVARCHAR
columns requires anNVARCHAR(6)
, not an
NVARCHAR(3)
. - String operations are not aware of supplementary characters. So, watch out if you're using
SUBSTRING
orLEN
functions in your SQL. - Sorting and searching behaviour for supplementary characters may vary depending on the collation.
Newer versions of SQL Server and surrogate pairs
Newer versions of SQL Server (2012 and later, version >= 11) support surrogate pairs, but you must set the collation to a one ending in _SC
. For example, Latin1_General_100_CI_AS_SC
. When you do this, string functions will recognise surrogate pairs and all of the problems listed above for older versions of SQL Server are fixed.
Is my SQL Server surrogate-neutral or surrogate-aware?
Here's a small script you can use to test whether your SQL Server is surrogate-neutral or surrogate-aware:
# Test to see if your SQL Server is surrogate-aware or just surrogate-neutral use 5.008.001; use strict; use DBI qw(:utils); my $h = DBI->connect() or die $DBI::errstr; $h->{PrintError} = 0; eval {$h->do(q/drop table surrogate_pairs/)}; # It is possible to set the collation at instance or database level. # Set it on the column to make sure that, initially, we are using a # non supplementary character collation. $h->do(q/create table surrogate_pairs (a nvarchar(20) collate Latin1_General_100_CI_AI)/); my $insert = $h->prepare(q/insert into surrogate_pairs values(?)/); # Insert test supplementary character print "Inserting Unicode character U+2070E into db\n"; $insert->execute("\x{2070E}"); # now read it back and see what we get print "\nNote when we select this character back it is still 1 Unicode character and 4 bytes and the ord is correct at\n" print "\n0x2070E. This is because DBD::ODBC received a buffer of SQL_WCHAR chrs back from SQL Server which it then\n" print "\ndecoded as UTF-16 which recognises the surrogate pair. This is why SQL Server using this collation (or older\n" print "\nSQL Servers)\n" print "\nare known as surrogate-neutral.\n"; my $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/); print data_string_desc($r->[0]), "\n"; print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n"; # This is a non _SC collation, so the length function returns "2". print "\nNote in the following that len(a) returns 2 not 1 as SQL Server has not recognised this as a surrogate pair.\n"; $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/); print "length in database is: ", $r->[0], "\n"; # now try to alter the table to change the collation to Latin1_General_100_CI_AS_SC # which only later SQL Servers (>= version 11, i.e., 2012) can do. # Unfortunately older SQL Servers don't error if you try to change the collation # to one it does not support so we cannot test by just trying to change to a # surrogate aware collation. $h->do(q/alter table surrogate_pairs alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/); $r = $h->selectrow_arrayref(q/SELECT SERVERPROPERTY('ProductVersion')/); my $version = $r->[0]; print "\nYou SQL Server is version: $version\n\n"; if ((split(/\./, $version))[0] >= 11) { print "Your SQL Server is surrogate-aware\n"; $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/); print data_string_desc($r->[0]), "\n"; print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n"; print "\nNote in the following that len(a) returns 1 as SQL Server in this collation recognises surrogate pairs\n"; $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/); print "length in database is: ", $r->[0], "\n"; } else { print "You SQL Server is surrogate-neutral but not surrogate-aware\n"; } $h->disconnect;