About gauravmallik12

gauravmallik12 · ‎09-07-2017

@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful. Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct? set mapreduce.map.output.compress = true set mapreduce.output.fileoutputformat.compress = true Or anything else we can do to optimize the query?

gauravmallik12 · ‎09-06-2017

Currently, we are running a Hive job which inserts around 2 billion rows into an acid table which is partitioned and clustered. I see a huge number of 'SPILLED_RECORDS' and I'm not exactly sure how to fix/improve. I think more the spilled_records higher the io and processing times. Any inputs are appreciated. Some TEZ stats of the job inline: org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS 1653127325 org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS 11490401485 org.apache.tez.common.counters.TaskCounter PHYSICAL_MEMORY_BYTES 6732617089024 org.apache.tez.common.counters.TaskCounter VIRTUAL_MEMORY_BYTES 13973924044800 org.apache.tez.common.counters.TaskCounter COMMITTED_HEAP_BYTES 6732617089024 org.apache.tez.common.counters.TaskCounter ADDITIONAL_SPILLS_BYTES_WRITTEN 572880403808 org.apache.tez.common.counters.TaskCounter ADDITIONAL_SPILLS_BYTES_READ 1540736899809 org.apache.tez.common.counters.TaskCounter ADDITIONAL_SPILL_COUNT 6965 HIVE RECORDS_IN_Map_1 1941777885 TaskCounter_Map_1_OUTPUT_Reducer_2 SPILLED_RECORDS 3739831692 TaskCounter_Reducer_2_INPUT_Map_1 SPILLED_RECORDS 1941777885 TaskCounter_Reducer_2_OUTPUT_Reducer_3 SPILLED_RECORDS 3867014023 TaskCounter_Reducer_2_OUTPUT_Reducer_3 ADDITIONAL_SPILLS_BYTES_READ 387364743478 TaskCounter_Reducer_2_OUTPUT_Reducer_3 ADDITIONAL_SPILLS_BYTES_WRITTEN 320256756650 TaskCounter_Reducer_3_INPUT_Reducer_2 ADDITIONAL_SPILLS_BYTES_WRITTEN 11229906959 TaskCounter_reducer_3_INPUT_Reducer_2 SPILLED_RECORDS 1941777885

gauravmallik12 · ‎07-10-2017

@Daniel Kozlowski We have Ranger (LDAP) for authorization and also LDAP/AD for Hive authentication. So the "-n admin" option doesn't work

gauravmallik12 · ‎07-08-2017

Yes, I tried both the approach with double quotes and without quotes.

gauravmallik12 · ‎07-07-2017

When I directly try to connect to hive using Beeline -u "jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2" -n username CLI throws the following error "Connecting to jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/ 17/07/07 18:11:35 [main]: WARN jdbc.HiveConnection: Failed to connect to node03.comp.net:2181 Error: Could not open client transport with JDBC Uri: jdbc:hive2://node03.comp.net:2181/: null (state=08S01,code=0) Beeline version 1.2.1000.2.6.1.0-129 by Apache Hive" While I just run beeline <enter> and use same jdbc url I'm able to connect. !connect jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 Connecting to jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 Enter username for jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2: username Enter password for jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2: ********* Connected to: Apache Hive (version 2.1.0.2.6.1.0-129) Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129) Transaction isolation: TRANSACTION_REPEATABLE_READ Any idea why this is happening. How to fix it? A bit about our env: We have Ranger (LDAP) for authorization and also LDAP/AD for Hive authentication. I plan to use the above commands to launch HQL scripts with parameters like beeline -u jdbc:hive2://node03.comp.net:2181,node02.comp.net:2181,node01.comp.net:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-hive2 -n xxxx -w pfile -f /home/hdfs/hive_llap/script/test.hql --hivevar db=$dbname Any help is appreciated, thanks.

Online	Offline
Last Visited	‎09-07-2017 08:32 PM

Member Since	‎06-23-2016 04:39 AM
Last Visited	‎09-07-2017 08:32 PM
Posts	13
Kudos received	2

Cloudera Community

Re: How to reduce 'SPILLED_RECORDS' in Hive with T...

How to reduce 'SPILLED_RECORDS' in Hive with TEZ ?...

Re: Beeline -u "JDBC:hive2://url..." -n username ...

Re: Beeline -u "JDBC:hive2://url..." -n username ...

Beeline -u "JDBC:hive2://url..." -n username give...