Created on 06-25-2019 10:40 AM - edited 09-16-2022 07:28 AM
Hello,
I am trying to add data into my table using load data method, however I was getting error so to ease the complexity and understand the concept better I reduced my table however still I am getting error.
I have tried two methods both throw same error. There isn't much example available-
My create query-
create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ',';
Now I tried two methods here -
1. I created a text file id.txt with data
2,2009 10,2009 15,2010 34,2010
My LOAD query-
[quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010);
When I executed above query I got-
Query: load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010) ERROR: AnalysisException: Partition spec does not exist: (year=2010).
I thought it must be throwing error id.txt contains year data so I tried with different id1.txt file-
5 10 20 30
My load query-
[quickstart.cloudera:21000] > load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010);
I am getting same error-
Query: load data inpath '/user/impala/data/id1.txt' into table tbl_id_part partition(year=2010) ERROR: AnalysisException: Partition spec does not exist: (year=2010).
My question is how should my data text file look like, the partioned column values should be inside file or not. I am pretty much confuse , please guide.
Thanks
Created on 06-25-2019 06:00 PM - edited 06-25-2019 06:01 PM
@punshi,
Looks like you want to load data with dynamic partitions. My suggestion is to load the data into a temp table without partitions and then load them into another one with partitions:
1. create table tbl_id_part_temp (id int, year int) row format delimited fields terminated by ',';
2. load data inpath '/user/impala/data/id.txt' into table tbl_id_part_temp;
3. create table tbl_id_part (id int) partitioned by (year int) row format delimited fields terminated by ',';
4. insert into tbl_id_part partition (year) select id, year from tbl_id_part_temp;
This is because LOAD DATA does not support dynamic partition loading at the moment:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_load_data.html#load_data
>>> When the LOAD DATA statement operates on a partitioned table, it always operates on one partition at a time. Specify the PARTITION clauses and list all the partition key columns, with a constant value specified for each.
Cheers
Eric
Created on 06-25-2019 11:15 PM - edited 06-25-2019 11:45 PM
Hi @EricL
The dynamic part is not that important my concern is how should I upload the data into partitioned table using LOAD, I tried with regular method of directly using -
[quickstart.cloudera:21000] > load data inpath '/user/impala/data/id.txt' into table tbl_id_part;
But it didn't worked my text file id.txt contained both regular column and partitioned column data.
So I tried with method by specifying partitioned value in query-
load data inpath '/user/impala/data/id.txt' into table tbl_id_part partition(year=2010);
But this is also not working.
So how does the given query works? and on what kind of table, what is the structure of data file it needs ?
LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
I have created a basic partitioned table using below query-
create table tbl_id_part(id int) partitioned by (year int) row format delimited fields terminated by ',';
I am able to use load into regular tables but with partitioned I am getting errors.
What should my data file which I will load into this table look like ?
What should be my load statement so that I am able to load data into partitioned table.
Edit-
I have also tried following hive example but I am getting same error mentioned.
https://bigdataprogrammers.com/load-files-into-hive-partitioned-table/
Thanks
Created 06-25-2019 11:58 PM
Created 06-26-2019 12:01 AM
Created on 06-26-2019 01:17 AM - edited 06-26-2019 01:22 AM
Hi @EricL
I think issue here is something different the above steps are not working nor anything which I tried from internet. I dropped table and created it again but its throwing same error.
ERROR: AnalysisException: Partition spec does not exist: (year=2009).
My create query-
create table tbl_id(id int) partitioned by(year int);
Here is my tables full details using desc formatted tbl_part-
+------------------------------+--------------------------------------------------------------+----------------------+ | name | type | comment | +------------------------------+--------------------------------------------------------------+----------------------+ | # col_name | data_type | comment | | | NULL | NULL | | id | int | NULL | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | year | int | NULL | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | root | NULL | | CreateTime: | Wed Jun 26 13:33:02 IST 2019 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/tbl_part | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | transient_lastDdlTime | 1561536182 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.mapred.TextInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | +------------------------------+--------------------------------------------------------------+----------------------+
I tried with two files one which doesn't contain partition column value-
2009.txt
10 20 30 40
This one contains partition column values-
2010.txt
5,2010 1,2010 3,2010
I had used same query to execute just different parttion values-
I also created the the same table by specifying "row format delimited fields terminated by ','".
This is very simple thing and it should have worked, I don't know whether there is some configuration issue or something unknown. Do you see anything wrong in above steps ?
Meanwhile I will try temp table method to see how does that works.
Thanks
Created 06-26-2019 06:36 AM
Hi @EricL
I tried for very small table like you have showed and it worked but for even text files which contain roughly 50k values I was getting error.
This method seems not suitable for POC purpose or production.
I tried copying mere 50k records which I had loaded into other table and it threw memory error.
[quickstart.cloudera:21000] > insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata; Query: insert into tbl_raw_part_v11 partition(source_ip,destination_ip,year,event_date) select * from tbl_motadata Query submitted at: 2019-06-26 18:43:21 (Coordinator: http://quickstart.cloudera:25000) Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=584e2c3facce84f0:a95b21f300000000 WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000 Memory left in process limit: -4.25 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB HdfsTableSink: Total=3.94 GB Peak=3.94 GB CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB Buffer Pool: Free Buffers: Total=0 Buffer Pool: Clean Pages: Total=0 Buffer Pool: Unused Reservation: Total=0 RequestPool=root.root: Total=3.96 GB Peak=4.12 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB Untracked Memory: Total=3.29 GB WARNING: The following tables are missing relevant table and/or column statistics. default.tbl_motadata Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 584e2c3facce84f0:a95b21f300000000 Memory left in process limit: -4.25 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB Fragment 584e2c3facce84f0:a95b21f300000000: Reservation=0 OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB HDFS_SCAN_NODE (id=0): Total=22.82 MB Peak=30.82 MB HdfsTableSink: Total=3.94 GB Peak=3.94 GB CodeGen: Total=6.60 KB Peak=690.00 KBProcess: memory limit exceeded. Limit=3.00 GB Total=7.25 GB Peak=7.40 GB Buffer Pool: Free Buffers: Total=0 Buffer Pool: Clean Pages: Total=0 Buffer Pool: Unused Reservation: Total=0 RequestPool=root.root: Total=3.96 GB Peak=4.12 GB Query(584e2c3facce84f0:a95b21f300000000): Reservation=0 ReservationLimit=2.40 GB OtherMemory=3.96 GB Total=3.96 GB Peak=3.96 GB RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB Untracked Memory: Total=3.29 GB (1 of 2 similar)
The same doesn't throw error for non-partitioned tables.
Yesterday I copied some 500million records using "INSERT INTO table1 Select * from table2" on same machine though it took 4hours but the task was complete, but here I am getting error for copying mere 50k records.
I am actually testing Impala performance for our production system needs however I believe we need to tune it extensivley for better performance, even Select queries are taking too much time ,it took 46 minutes for query to execute and same query I had executed in Clickhouse db which took only 2min from same hardware configuration.
We where actually expecting to get result in few seconds as we thought Impala should be way faster then Clickhouse.
I don't know why "Insert ...into [Partition] Select *" throws error but without partition it works fine.
Is there any other method to check.
I am actually checking against 500 million records.
Thanks
Created 06-26-2019 08:26 AM
Hello,
Finally I was able to add values into partition table after much reading into the concepts and trial & error.
However the load file is still not working and also I came to know while doing that when we specify partition value into query it actually is treated as column.
insert into tbl_raw_v12 partition(source_ip="192.168.1.10",destination_ip="172.16.8.177",year,event_date) select * from tbl_123;
I managed to add 50k values, tbl_raw_v12 contained 29 columns and tbl_123 contains 27 columns and remaining two column values are added through insert query at runtime.
I will try with more values tomorrow to check performance.
Thanks