Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

[Sqoop1] Dirty values in Parquet tables

[Sqoop1] Dirty values in Parquet tables

Contributor

Hi, I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4).

I cannot use Sqoop to write directly to the destination Hive table, since this table has got a custom type mapping, that is different from the one used by Sqoop.

To achieve this, I offload data to a temporary uncompressed parquet table, and then I load the destination table using Hive (beeline). This way I can compress it (snappy) and fix the datatypes on the fly. 

This works correctly.

 

The problem is that I have some tables where one or two fields contains some special chars that break my table. 

I know this because, after a lot of debug, I have a working solution using Oracle's replace function to replace newlines, tabs and carriage return with a space (' ').

So, for the fields I know to be problematic, I write a query that replaces those chars upon extraction, and it works fine. 

 

However, this is clearly the wrong approach, since I can have other fields with dirty chars, and also other dirty chars apart from those 3 I am replacing. 

Also, I moved from flatfile to binary (parquet) for this very reason: to not have to bother with special chars.

Isn't Parquet supposed to not to care about the data contained in the fields ?? What am I missing ?

 

What can I do to fix this ?

For now, avro is not an option (SAS is not compatibile with it).

 

Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here