Hi, I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4).
I cannot use Sqoop to write directly to the destination Hive table, since this table has got a custom type mapping, that is different from the one used by Sqoop.
To achieve this, I offload data to a temporary uncompressed parquet table, and then I load the destination table using Hive (beeline). This way I can compress it (snappy) and fix the datatypes on the fly.
This works correctly.
The problem is that I have some tables where one or two fields contains some special chars that break my table.
I know this because, after a lot of debug, I have a working solution using Oracle's replace function to replace newlines, tabs and carriage return with a space (' ').
So, for the fields I know to be problematic, I write a query that replaces those chars upon extraction, and it works fine.
However, this is clearly the wrong approach, since I can have other fields with dirty chars, and also other dirty chars apart from those 3 I am replacing.
Also, I moved from flatfile to binary (parquet) for this very reason: to not have to bother with special chars.
Isn't Parquet supposed to not to care about the data contained in the fields ?? What am I missing ?
What can I do to fix this ?
For now, avro is not an option (SAS is not compatibile with it).