Reply
Highlighted
Explorer
Posts: 32
Registered: ‎11-03-2015

[Sqoop1] Dirty values in Parquet tables

Hi, I am importing from Oracle using Sqoop1 (version 1.4.6, CDH 5.7.4).

I cannot use Sqoop to write directly to the destination Hive table, since this table has got a custom type mapping, that is different from the one used by Sqoop.

To achieve this, I offload data to a temporary uncompressed parquet table, and then I load the destination table using Hive (beeline). This way I can compress it (snappy) and fix the datatypes on the fly. 

This works correctly.

 

The problem is that I have some tables where one or two fields contains some special chars that break my table. 

I know this because, after a lot of debug, I have a working solution using Oracle's replace function to replace newlines, tabs and carriage return with a space (' ').

So, for the fields I know to be problematic, I write a query that replaces those chars upon extraction, and it works fine. 

 

However, this is clearly the wrong approach, since I can have other fields with dirty chars, and also other dirty chars apart from those 3 I am replacing. 

Also, I moved from flatfile to binary (parquet) for this very reason: to not have to bother with special chars.

Isn't Parquet supposed to not to care about the data contained in the fields ?? What am I missing ?

 

What can I do to fix this ?

For now, avro is not an option (SAS is not compatibile with it).

 

Thanks

Announcements
New solutions