We have a MS SQL Server database that has what we refer to as extended ascii characters in it. We also refer to them internally as special characters. The database collation is set to SQL_Latin1_General_CP1_CI_AS and the data type is of “Text”.
When we pull these data over to Hadoop vis Sqoop, we end up with black diamonds with question marks mixed into the data. Here are examples of what the data look like in SQL server, Impala/Hive editors in Hue and the Impala Shell. Notice the diamonds with question marks mixed into the data on the Hadoop side.
What we’re thinking is happening is that somehow we’re not successfully telling Sqoop that the character set we’re pulling in is not UTF-8. Or something like that.
We’re not 100% sure what this is doing, but we’ve tried to look for something like this MySQL setting ‘characterEncoding=UTF-8’, but haven’t found anything similar on the MS SQL JDBC connection string.