Hello,
I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet for use in AWS Athena.
However, no mater what I try the resulting Parquet always has columns titles [_col0, _col1, ..., _colN]
After researching, I read that the line SET parquet.column.index.access=false was supposed to allow for Parquet to use the column titles of my HIVE table; however, it has been unsuccessful so far.
Below is an example script I use to create the Parquet from data
SET parquet.column.index.access=false;
CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
`release` STRING,
`customer` STRING,
`cookie` STRING,
`category` STRING,
`end_time` STRING,
`start_time` STRING,
`first_name` STRING,
`email` STRING,
`phone` STRING,
`last_name` STRING,
`site` STRING,
`source` STRING,
`subject` STRING,
`raw` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
STORED AS PARQUET
SELECT *
FROM EVENTS;
Using parquet-tools, I read the resulting file and below is an example output:
_col0 = 0.1.2
_col1 = customer1
_col2 = NULL
_col3 = api
_col4 = 2018-01-21T06:57:57Z
_col5 = 2018-01-21T06:57:56Z
_col6 = Brandon
_col7 = brandon@fakesite.com
_col8 = 999-999-9999
_col9 = Pompei
_col10 = Boston
_col11 = Wifi
_col12 = NULL
_col13 = eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
This is problematic because it is impossible to transfer it to an Athena table (or even back to HIVE) without using these index-based column titles. I need HIVE's column titles to transfer over to the Parquet file.
I've search for a very long time and have come up short. Am I doing something wrong?
Please let me know if I can provide more information. Thank you!