About kaliyugantagoni

kaliyugantagoni · ‎06-09-2016

Have you set JAVA_HOME correctly? Ambari by default install java at /usr/jdk64 Yes, the jdk 1.8 exists at /usr/jdk64 but I assumed that Ambari sets the JAVA_HOME because if one selects the 'Custom JDK' option during ambari server set-up, it prompts for providing the JAVA_HOME. I am just wondering how Ambari accesses Java However not sure of you have other dependencies on internet. If the repositories and JDK is now available locally, will Ambari still try to access the Internet? Can you elaborate on the other dependencies ?

kaliyugantagoni · ‎06-09-2016

Machines : (4 datanodes + 2 master(HA) + 1 management = 7 machines) . Target : Install Ambari 2.2 and using it, install HDP 2.4(the auto. install way). The Ambari and HDP repositories are available locally(on the management node, tarballs were extracted) via http. Ambari server is already running on the management machine and now the HDP 2.4 has to be installed. Questions : To avoid installing JDK on the management(and others, too) machine, the Internet access to http://public-repo-1.hortonworks.com/ has been enabled on all the machines only for a day. I set the 'export http_proxy' and set-up the Ambari server which internally fetched the Oracle 8 JDK. Somehow, 'java -version' still doesn't work, does Ambari really install JDK The Ambari agents will be installed auto. during the cluster install later but then there would be NO Internet connection. How is java(jdk) installed on the other nodes then ? Is it that Ambari pushes the /var/lib/ambari-server/resources/jdk-8u60-linux-x64.tar.gz to all the nodes manually(after all, hadoop would require java) Is it safe to remove the Internet access now

kaliyugantagoni · ‎06-07-2016

I think that doc. addresses only Ambari installation. Will the HDP installation be hampered in case of a non-root user ? Can the services like HDFS, YARN, Hive execute smoothly if the installation was done via Ambari running as non-root ?

kaliyugantagoni · ‎06-07-2016

Earlier, on the test machines, I had installed HDP 2.2 using Ambari. I had the root credentials as well as the Internet access. The cluster and the services functioned properly. Now on the prod. machines(4 datanodes + 2 master(HA) + 1 management = 7 machines) and each machine can be allowed to have access only to specific sites. Does this qualify for the 'Temporary Access to Internet' case in the Hortonworks doc. ? Is it possible to provide a complete list of the URLs that need to be accessed for Ambari + HDP Install ?

kaliyugantagoni · ‎06-07-2016

Earlier, on the test machines, I had installed HDP 2.2 using Ambari. I had the root credentials as well as the Internet access. The cluster and the services functioned properly. Now on the prod. machines(4 datanodes + 2 master(HA) + 1 management = 7 machines) I wish to : Install Ambari 2.2 on a management node. The Ambari agents will be installed auto.(password-less SSH) Log-in the Ambari mgt. console and install the HDP stack The first challenge is now I don't have the root credentials of any of the machines, I can log-in using my Linux account(connected to an ldap) and install it. As for now, the Internet access is unclear. I read several threads like this and this but I am unsure as to whether I can proceed without root access ? I suspect that non-root user installation will run into issues later, either at the Ambari or the HDP level(or both!). I don't have the liberty of trying out approaches 😞 , hence, I need to be correct till the cluster is installed. How shall I start ?

kaliyugantagoni · ‎06-01-2016

I have read about the care to be exercised while using ext4 (noatime etc) in several threads but is there some concise guide or doc. which can be used ?

kaliyugantagoni · ‎06-01-2016

I suspected that the doc. for the file system is merely carried forward from the previous versions, I hope Hortonworks invests some resources in upgrading it 🙂 The LVM part I guess is clear - use it for OS partitions but NOT datanodes, am I right ? Can you help me understand more about your inputs : XFS is perfectly fine here, so you can let RHEL use the default. However, note that XFS filesystems can not be shrunk, whereas with LVM + ext4, filesystems can be expanded and shrunk while online. This is a big gap for XFS So what should I proceed with - ext4 everywhereORxfs everywhereORboth(xfs for datanodes etc. and ext4 for os partitions or vice versa) so moving this logging to one of the data disks may be necessary What is the better idea, have large a large, dedicated disk(and add more if required and resize using LVM) for the OS partition so that log, binaries etc. have aplenty space or during the HDP installation itself OR redirect logs(YARN etc.) to some directories on the disks dedicated to the datanode. For example, this is how it is in the test cluster :

kaliyugantagoni · ‎06-01-2016

Following is the prod. cluster planned infra. Initially 4 data/compute nodes each with 2x12 cores, 256 GB RAM and 24x2TB disks (plus 2x300 Gb for Linux).3 name/admin nodes (with much less disks configured as RAID 1). Later, 4-5 datanodes will be added. All nodes will be having RHEL 7. We will be proceeding with the latest 2.4 HDP installation via Ambari. The HDP documentation has following statements : The ext4 file system may have potential data loss issues with default options because of the "delayed writes" feature. XFS reportedly also has some data loss issues upon power failure. Do not use LVM; it adds latency and causes a bottleneck I read several existing threads and doc. but I still don't have a clear understanding of what suits in the latest editions of HDP and RHEL. ext4-vs-xfs-filesystem-survey-of-popularity best-practices-linux-file-systems-for-hdfs any-recommendation-on-how-to-partition-disk-space-1 @Benjamin Leonhardi insightful recommendation Following are the possibilities : Have ext3 on all the partitions on all the nodes Have ext4 on all the partitions on all the nodes Have xfs(default file system for RHEL) on all the partitions on all the nodes Have XFS on the boot disks (and for all disks on the head/management nodes) e.g: /boot, /var, /usr etc. but use ext3/ext4 on the data disks (that anyhow are “special” compared to our normal install images) just to minimize risk so good to stick to proposed standard practices as much as possible Whether LVM should be used for ALL the volumes/partitions OR selectively(use for /var, /usr etc. but NOT for the datanode and log directories) OR don't use it at all Any suggestions/recommendations/further reading(suited to the latest HDP2.4 and RHEL 7 environment) ?

kaliyugantagoni · ‎05-30-2016

@mlanciaux As mentioned in the original query : DimSnapshot.snapshot_id is the PK of the DimSnapshot table, thus it's count = no. of records in DimSnapshot which is around 8 million I did the following : CREATE TABLE factsamplevalue_snapshot AS SELECT snapshot_id, COUNT(*) FROM factsamplevalue GROUP BY snapshot_id; which resulted into a table with 7914806 rows, sample data : select * from factsamplevalue_snapshot limit 10; OK factsamplevalue_snapshot.snapshot_id factsamplevalue_snapshot._c1 643438 2170 643445 2023 643924 3646 644063 2448 644153 2837 644459 848 644460 3713 644541 2080 645243 725 645599 852 Unfortunately, the histogram will return huge no. of entries, so cannot paste or provide the output.

kaliyugantagoni · ‎05-26-2016

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network I have a staging/vanilla/simple Hive table with 24 billion records. I created an empty ORC table as follows : CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue ( `Snapshot_Id` int /*OTHER COLUMNS*/ ) PARTITIONED BY (`SmapiName_ver` varchar(30)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/datastore/'; Some settings : Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties hive> > > set optimize.sort.dynamic.partitioning=true; hive> > set hive.exec.dynamic.partition.mode=nonstrict; hive> > set hive.exec.max.dynamic.partitions.pernode=3000; hive> > > set hive.enforce.sorting=true; hive> > Executed an insert : INSERT INTO odp_dw_may2016_orc.FactSampleValue PARTITION (SmapiName_ver) SELECT * FROM odp_dw_may2016.FactSampleValue DISTRIBUTE BY SmapiName_ver SORT BY SmapiName_ver; Query ID = hive_20160526125733_8834c7bc-b4f3-4539-8d48-fa46bba92a33 Total jobs = 1 Launching Job 1 out of 1 ******REDUCERS NOT STARTING Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 RUNNING 3098 0 110 2988 0 0 Reducer 2 INITED 1009 0 0 1009 0 0 -------------------------------------------------------------------------------- VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 12.70 s -------------------------------------------------------------------------------- Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) After a long time, the mappers completed but the reducers failed : Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 ........ RUNNING 3098 2655 94 349 0 0 Reducer 2 RUNNING 1009 45 110 854 49 91 -------------------------------------------------------------------------------- VERTICES: 01/02 [=================>>---------] 65% ELAPSED TIME: 8804.16 s -------------------------------------------------------------------------------- As seen above, A FEW mappers started again, I guess it's a reattempt. Again some failures, latest : -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 ........ RUNNING 3098 2773 113 212 0 8 Reducer 2 RUNNING 1009 45 110 854 57 119 -------------------------------------------------------------------------------- VERTICES: 01/02 [=================>>---------] 68% ELAPSED TIME: 10879.73 s -------------------------------------------------------------------------------- I suspect some memory or relevant issues but I dunno which all logs should I check. For example, under log/application_1446726117927_0092, I found several containers and many of them had the following error in the syslog_attempt_1446726117927_0092_1_01_000041_1 : 2016-05-26 15:45:11,932 [WARN] [TezTaskEventRouter{attempt_1446726117927_0092_1_01_000041_1}] |orderedgrouped.ShuffleScheduler|: Map_1: Duplicate fetch of input no longer needs to be fetched: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=709], attemptNumber=1, pathComponent=attempt_1446726117927_0092_1_00_000709_1_10012, spillType=0, spillId=-1] 2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM 2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat 2016-05-26 15:45:24,253 [INFO] [main] |task.TezTaskRunner|: Interrupted while waiting for task to complete. Interrupting task 2016-05-26 15:45:24,254 [INFO] [main] |task.TezTaskRunner|: Shutdown requested... returning 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Got a shouldDie notification via heartbeats for container container_1446726117927_0092_01_000187. Shutting down 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutdown invoked for container container_1446726117927_0092_01_000187 2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutting down container container_1446726117927_0092_01_000187 2016-05-26 15:45:24,255 [ERROR] [TezChild] |tez.ReduceRecordProcessor|: Hit error while closing operators - failing tree 2016-05-26 15:45:24,256 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-05-26 15:45:24,257 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an error while executing task: attempt_1446726117927_0092_1_01_000041_1 java.lang.RuntimeException: java.lang.InterruptedException at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052) at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120) at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90) at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147) ... 14 more 2016-05-26 15:45:24,260 [INFO] [TezChild] |task.TezTaskRunner|: Ignoring the following exception since a previous exception is already registered 2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1446726117927_0092_1_01_000041_1: Counters: 71 [[File System Counters FILE_BYTES_READ=290688, FILE_BYTES_WRITTEN=227062571, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=53, GC_TIME_MILLIS=11162, CPU_MILLISECONDS=62510, PHYSICAL_MEMORY_BYTES=664797184, VIRTUAL_MEMORY_BYTES=2432901120, COMMITTED_HEAP_BYTES=664797184, OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, ADDITIONAL_SPILLS_BYTES_READ=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=39880956, NUM_MEM_TO_DISK_MERGES=2, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=0, MERGE_PHASE_TIME=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569, MERGED_MAP_OUTPUTS=53, MERGE_PHASE_TIME=0, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=2, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_DISK_DIRECT=39880956, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_PHASE_TIME=0, SPILLED_RECORDS=0][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 OUTPUT_RECORDS=0]] 2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter 2016-05-26 15:45:24,276 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed processor for vertex=Reducer 2, index=1 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting down Shuffle for source: Map_1 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Already shutdown. Ignoring error 2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: numDmeEventsSeen=3480, numDmeEventsSeenWithNoData=3395, numObsoletionEventsSeen=443, updateOnClose 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #1, status:false, isInterrupted:false 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #2, status:false, isInterrupted:false 2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #3, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #4, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #5, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #6, status:false, isInterrupted:false 2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #7, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #8, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #9, status:false, isInterrupted:false 2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #10, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #11, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #12, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #13, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #14, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #15, status:false, isInterrupted:false 2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #16, status:false, isInterrupted:false 2016-05-26 15:45:24,291 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #17, status:false, isInterrupted:false 2016-05-26 15:45:24,302 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #18, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #19, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #20, status:false, isInterrupted:false 2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #21, status:false, isInterrupted:false 2016-05-26 15:45:24,318 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #22, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #23, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #24, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #25, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #26, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #27, status:false, isInterrupted:false 2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #28, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #29, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #30, status:false, isInterrupted:false 2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.MergeManager|: finalMerge called with 8 in-memory map-outputs and 14 on-disk map-outputs 2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Merging 8 sorted segments 2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Down to the last merge-pass, with 8 segments left of total size: 376486161 bytes Which all logs and what errors shall I look for ? How can Ambari help ? **********EDIT-1********** The Hive query finally failed with the following error : Status: Failed Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00 Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000036 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000199 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) at java.io.Writer.write(Writer.java:157) at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000299 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE] Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE] DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000036 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000199 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220) at java.io.Writer.write(Writer.java:157) at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48) at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310) at org.apache.log4j.WriterAppender.append(WriterAppender.java:162) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch. Container id: container_1446726117927_0092_01_000299 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:576) at org.apache.hadoop.util.Shell.run(Shell.java:487) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1 ]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 As I suspected, there is a memory issue BUT THIS IS OCCURING ON THE SECOND ATTEMPT, the FIRST ONE FAILED for unknown reason : FatalError cause TezChildexit.:java.lang.OutOfMemoryError:Java heap space The question is which parameters need to be changed ?

Online	Offline
Last Visited	‎03-18-2020 10:21 AM

Member Since	‎04-11-2016 02:31 PM
Last Visited	‎03-18-2020 10:21 AM
Posts	174
Kudos received	29

Cloudera Community

Re: NiFi custom processor custom log, logging in t...

Re: Separate log file for custom processor

Re: Sqoop imported more records than source

Re: Sqoop import to HCatalog/Hive - table not visi...

Re: HDFS Space not reclaimed

Re: HDP installation via Ambari - a doubt about th...

HDP installation via Ambari - a doubt about the jd...

Re: HPD 2.4 auto. install(using Ambari) - does a '...

HDP 2.4 auto.install - which Internet sites are re...

HPD 2.4 auto. install(using Ambari) - does a 'non-...

Re: HDP 2.4 installation on prod. cluster - filesy...

Re: HDP 2.4 installation on prod. cluster - filesy...

HDP 2.4 installation on prod. cluster - filesystem...

Re: Part-1 : Join involving 24 billion X 1 to 8 mi...

Hive INSERT failing for a large table