I am trying to execute multiple insert queries into a table which looks like the next one:
create external table test1(c1 int, c2 int) CLUSTERED BY(c1) SORTED BY(c1) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/companies/';
Although I noticed that Hive is overriding the existing data on each insert(). I am running Hive v.1.2.1000 with HDP-188.8.131.52
I already found this similar issue and I was wondering if someone has any suggestions or solution for this one
I could see the work around available in the link which you have mentioned. Anyways let me add few points on top of it.
Create a work table
Perform union between target and incremental data and insert into the newly created work table
Assuming that you are using only external table --> Drop the work table. re-create the target table pointing to the work table location so that you can avoid re-loading the target from the work table.
Hope it helps!
Hello @Bala Vignesh N V thank you for your reply and my excuses for the insufficient information but I missed out to mention that in our case we have this behavior on HDFS not S3. Also I really want to avoid such a union because our inserts are already invoking huge select with join so this work around would exhaust our cluster resources and would make the ETL process really slow. Are you aware of such a bug? or it could be a configuration issue as well?
@Alexandros Biratsis i believe that you are not using Insert Overwrite when inserting the incremental records into the target. Assuming that its weird how the data is being overridden.
For the union part --> If you wanted to avoid union then you may have to may have to perform left join between the incremental data and target to apply some transformations ( assuming that you are performing SCD type 1). If you wanted to just append the data then insert the incremental data through multiple queries into the target one by one. By if you are inserting the data multiple times then the no of jobs will be more which be more or less equal to performing union over it. Sorry for the late reply.
Hello @Bala Vignesh N V I am sorry for the late reply although I was too busy lately and not following the forum. So finally we solved this by making the table internal, keeping the TextFile format and storing data under default Hive directory. The definition of table look like this at the moment:
create table test1(c1 int, c2 int) CLUSTERED BY(c1) SORTED BY(c1) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
So every time I was specifying the LOCATION Hive would overwrite the data but when changing to internal everything worked fine under the /apps/hive/warehouse/ directory