Created on 04-24-2018 12:07 AM - edited 09-16-2022 06:08 AM
After loading csv data into HDFS, I have seen below way to create a Hive external table(textfile format). Followed by creating Impala internal table(parquet format) like the Hive external table. And it works.
My question:
1) Why should one go this roundabout way of creating a Hive table; and then an impala table from it? Why can't we directly create an external impala table (in parquet format)?
2) Is there any issue with sticking to external tables only(without any internal tables) - given that my data is always bulk loaded directly into hdfs?
2) When should one use "from_unixtime(unix_timestamp(as_of_date,"dd-MMM-yy"),'yyyy-MM-dd')"
and store date as string vs storing date as timestamp in Impala?
// sample external table defined below
create EXTERNAL TABLE my_external_table (
Col1 string,
as_of_date string,
Col3 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION "/data/my_data_files"
tblproperties("skip.header.line.count"="1");
// sample internal table defined below
create table my_internal_table like my_external_table stored as parquet;
insert into table my_internal_table
select
Col1,
from_unixtime(unix_timestamp(as_of_date,"dd-MMM-yy"),'yyyy-MM-dd'),
Col3
from my_external_table;
Created 04-24-2018 03:05 AM
Hi @toamitjain,
Firstly, I advice you to create the external -TEXTFILE- table too in impala (it's faster to create it in hive!).
CREATE EXTERNAL TABLE table1 (columns def) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'hdfs_path';
1- Because of your initial hdfs file probably is a text format file while the parquet tables use the parquet format, so you have do this recreation to enssure the convertision TEXT=>PARQUET (but if your HDFS files are already in parquet format you can create an direct external parquet table on it..).
2- Except the slowness of your queries (because parquet table is very fast than text table) there is no issue, just the table and the hdfs files will be separated..
3- I think storing the date on timestamps is more profetionnal, so you can benifit from the timestamp functions..
Good luck.
Created 04-24-2018 03:59 AM
Thanks for your prompt response.
Regarding your response to Query1,
a) Can the conversion from TEXT=>PARQUET be done using Impala alone, without using Hive specific SerDe?
b) How can I create datafiles in HDFS directly in parquet format?
Regards.
Created 04-24-2018 04:03 AM
1. you can go ahead create external table directly to impala , i dont see any issue in there .
2. Use external table when multiple client tool want to have a centeralized data i, you have to decided whether your external table data is going to be used by another external program outside hdfs for example pig etc
3. that totally depends on the future requirement but mostly people prefer timestamps over string format when it comes to dates.
note - loading data into Parquet tables is a memory-intensive operation , you got to keep an eye on it .
Created 04-24-2018 05:41 AM