Support Questions
Find answers, ask questions, and share your expertise

have a problem about table convertion (csv -> gzipped parquet / textfile)




I have csv files and I want to convert them to gzip compressed parquet tables. So I have information about what column must be what type of data and column names(headers) for CSV file. So I have created a parquet table according CSV field information.


And then I tried convert data after set compression_codec=gzip command and used insert overwrite .... command.


When I run the insert commands I got 8 times of following errors whole one day. ( I have to use masking operation for some data so I use *** for data masking)


Error converting column: 15 TO BIGINT (Data is: numbers=lat
Error converting column: 18 TO INT (Data is: ***
Error converting column: 20 TO INT (Data is: ***)
Error converting column: 29 TO INT (Data is: **)
Error converting column: 35 TO BIGINT (Data is: Unspecified
Error converting column: 49 TO INT (Data is: Unspecified)


But inserted lots of records into parquet table as compressed


When I check the source table and target parquet table, source table has 159183859 but target parquet table has 39328054.  So I have 119855805 records are missing even I got 8 errors while conversion. This lost is very huge and I don't know why


So I tried with csv --> partitioned csv convertion I got same results I mean same lost record numbers


So I am looking for a reason why I lost this number of records. Lost is about %25 of data 

Do you have any idea/explanation why I lost this huge data? why I did not got any error other than 8 column conversion errors.


Thanks for your helps





I would recommend you to use pyspark for converting csv to parquet  . 

Spark, by default, uses gzip to store parquet files .

is there any particualar reason you want to be around with Hive ? 

because as far I concerned with just the default setting spark performs good compression when compared to Hive.

I have many internal customers that they know only impala/hive and they are running lots of query at same time

So I have to use it and therefore I want to use gzipped parquet