Created on 02-27-2018 01:22 PM - edited 09-16-2022 05:54 AM
I am denormalizing a big table by joining several tables together. The total number of rows I am populating is 2b+ rows and the whole process took 3 hours. Although my cluster is relatively small (6 node with 48 virtual cpu each), I still expect a much better performance. The CDH version is 2.6.0-cdh5.7.1
Query info is as follows:
Created 03-06-2018 07:09 AM
are you stilling looking solution ? could you let me know
Created 03-06-2018 07:11 AM
Yes I do. Actually I found out that even I insert into an impala table from another impala table with insert...select..., it is very slow too. Also I compared the data volume of the underlying parquet file inserted by impala and spark, sometimes, the spark written parquet file is much smaller.
Created 03-06-2018 07:19 AM
could you share the query explain plan ?
did you compute stats on the table ?
what file format ?
is there any compression ?
Created 03-06-2018 08:59 AM
I tried with a simpler insert statement to make sure that the bottleneck is the parquet writting. Here is the explain plan and query info.
Yes. I do have default snappy compression. I tried gzip compression by setting compression_codec=gzip. It improves but only marginally.
insert into zc_inst select * from zc_inst_snappy
---------------- Estimated Per-Host Requirements: Memory=5.69GB VCores=1 WARNING: The following tables are missing relevant table and/or column statistics. zinc_bda_poc.zc_inst_snappy WRITE TO HDFS [zinc_bda_poc.zc_inst, OVERWRITE=false] | partitions=1 | hosts=6 per-host-mem=1.00GB | 00:SCAN HDFS [zinc_bda_poc.zc_inst_snappy, RANDOM] partitions=1/1 files=562 size=50.37GB table stats: unavailable column stats: unavailable hosts=6 per-host-mem=4.69GB tuple-ids=0 row-size=3.40KB cardinality=unavailableQuery Info