I have csv files and I want to convert them to gzip compressed parquet tables. So I have information about what column must be what type of data and column names(headers) for CSV file. So I have created a parquet table according CSV field information.
And then I tried convert data after set compression_codec=gzip command and used insert overwrite .... command.
When I run the insert commands I got 8 times of following errors whole one day. ( I have to use masking operation for some data so I use *** for data masking)
Error converting column: 15 TO BIGINT (Data is: numbers=lat
Error converting column: 18 TO INT (Data is: ***
Error converting column: 20 TO INT (Data is: ***)
Error converting column: 29 TO INT (Data is: **)
Error converting column: 35 TO BIGINT (Data is: Unspecified
Error converting column: 49 TO INT (Data is: Unspecified)
But inserted lots of records into parquet table as compressed
When I check the source table and target parquet table, source table has 159183859 but target parquet table has 39328054. So I have 119855805 records are missing even I got 8 errors while conversion. This lost is very huge and I don't know why
So I tried with csv --> partitioned csv convertion I got same results I mean same lost record numbers
So I am looking for a reason why I lost this number of records. Lost is about %25 of data
Do you have any idea/explanation why I lost this huge data? why I did not got any error other than 8 column conversion errors.
Thanks for your helps
I would recommend you to use pyspark for converting csv to parquet .
Spark, by default, uses gzip to store parquet files .
is there any particualar reason you want to be around with Hive ?
because as far I concerned with just the default setting spark performs good compression when compared to Hive.