Member since
04-11-2016
174
Posts
29
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3462 | 06-28-2017 12:24 PM | |
2629 | 06-09-2017 07:20 AM | |
7242 | 08-18-2016 11:39 AM | |
5469 | 08-12-2016 09:05 AM | |
5618 | 08-09-2016 09:24 AM |
06-09-2016
03:38 PM
Have you set JAVA_HOME correctly? Ambari by default install java at /usr/jdk64 Yes, the jdk 1.8 exists at /usr/jdk64 but I assumed that Ambari sets the JAVA_HOME because if one selects the 'Custom JDK' option during ambari server set-up, it prompts for providing the JAVA_HOME. I am just wondering how Ambari accesses Java
However not sure of you have other dependencies on internet. If the repositories and JDK is now available locally, will Ambari still try to access the Internet? Can you elaborate on the other dependencies ?
... View more
06-09-2016
02:49 PM
1 Kudo
Machines : (4 datanodes + 2 master(HA) + 1 management = 7 machines) . Target : Install Ambari 2.2 and using it, install HDP 2.4(the auto. install way). The Ambari and HDP repositories are available locally(on the management node, tarballs were extracted) via http. Ambari server is already running on the management machine and now the HDP 2.4 has to be installed. Questions : To avoid installing JDK on the management(and others, too) machine, the Internet access to http://public-repo-1.hortonworks.com/ has been enabled on all the machines only for a day. I set the 'export http_proxy' and set-up the Ambari server which internally fetched the Oracle 8 JDK. Somehow, 'java -version' still doesn't work, does Ambari really install JDK The Ambari agents will be installed auto. during the cluster install later but then there would be NO Internet connection. How is java(jdk) installed on the other nodes then ? Is it that Ambari pushes the /var/lib/ambari-server/resources/jdk-8u60-linux-x64.tar.gz to all the nodes manually(after all, hadoop would require java) Is it safe to remove the Internet access now
... View more
Labels:
- Labels:
-
Apache Ambari
06-07-2016
04:11 PM
I think that doc. addresses only Ambari installation. Will the HDP installation be hampered in case of a non-root user ? Can the services like HDFS, YARN, Hive execute smoothly if the installation was done via Ambari running as non-root ?
... View more
06-07-2016
01:29 PM
1 Kudo
Earlier, on the test machines, I had installed HDP 2.2 using Ambari. I had the root credentials as well as the Internet access. The cluster and the services functioned properly. Now on the prod. machines(4 datanodes + 2 master(HA) + 1 management = 7 machines) and each machine can be allowed to have access only to specific sites. Does this qualify for the 'Temporary Access to Internet' case in the Hortonworks doc. ? Is it possible to provide a complete list of the URLs that need to be accessed for Ambari + HDP Install ?
... View more
Labels:
06-07-2016
12:44 PM
Earlier, on the test machines, I had installed HDP 2.2 using Ambari. I had the root credentials as well as the Internet access. The cluster and the services functioned properly. Now on the prod. machines(4 datanodes + 2 master(HA) + 1 management = 7 machines) I wish to :
Install Ambari 2.2 on a management node. The Ambari agents will be installed auto.(password-less SSH) Log-in the Ambari mgt. console and install the HDP stack The first challenge is now I don't have the root credentials of any of the machines, I can log-in using my Linux account(connected to an ldap) and install it. As for now, the Internet access is unclear. I read several threads like this and this but I am unsure as to whether I can proceed without root access ? I suspect that non-root user installation will run into issues later, either at the Ambari or the HDP level(or both!). I don't have the liberty of trying out approaches 😞 , hence, I need to be correct till the cluster is installed. How shall I start ?
... View more
Labels:
- Labels:
-
Apache Ambari
06-01-2016
01:23 PM
I have read about the care to be exercised while using ext4 (noatime etc) in several threads but is there some concise guide or doc. which can be used ?
... View more
06-01-2016
12:41 PM
I suspected that the doc. for the file system is merely carried forward from the previous versions, I hope Hortonworks invests some resources in upgrading it 🙂 The LVM part I guess is clear - use it for OS partitions but NOT datanodes, am I right ? Can you help me understand more about your inputs :
XFS is perfectly fine here, so you can let RHEL use the default. However, note that XFS filesystems can not be shrunk, whereas with LVM + ext4, filesystems can be expanded and shrunk while online. This is a big gap for XFS So what should I proceed with - ext4 everywhereORxfs everywhereORboth(xfs for datanodes etc. and ext4 for os partitions or vice versa)
so moving this logging to one of the data disks may be necessary What is the better idea, have large a large, dedicated disk(and add more if required and resize using LVM) for the OS partition so that log, binaries etc. have aplenty space or during the HDP installation itself OR redirect logs(YARN etc.) to some directories on the disks dedicated to the datanode. For example, this is how it is in the test cluster :
... View more
06-01-2016
10:42 AM
Following is the prod. cluster planned infra. Initially 4 data/compute nodes each with 2x12 cores, 256 GB
RAM and 24x2TB disks (plus 2x300 Gb for Linux).3 name/admin nodes (with much less disks configured
as RAID 1). Later, 4-5 datanodes will be added. All nodes will be having RHEL 7. We will be proceeding with the latest 2.4 HDP installation via Ambari. The HDP documentation has following statements : The ext4 file system may have potential data loss
issues with default options because of the "delayed writes" feature.
XFS reportedly also has some data loss issues upon power failure. Do not use
LVM; it adds latency and causes a bottleneck I read several existing threads and doc. but I still don't have a clear understanding of what suits in the latest editions of HDP and RHEL. ext4-vs-xfs-filesystem-survey-of-popularity best-practices-linux-file-systems-for-hdfs any-recommendation-on-how-to-partition-disk-space-1 @Benjamin Leonhardi insightful recommendation Following are the possibilities : Have ext3 on all the partitions on all the nodes Have ext4 on all the partitions on all the nodes Have xfs(default file system for RHEL) on all the partitions on all the nodes Have XFS on the boot disks (and for all disks on the
head/management nodes) e.g: /boot, /var, /usr etc. but use ext3/ext4 on the data disks (that anyhow are
“special” compared to our normal install images) just to minimize risk so good to stick to proposed
standard practices as much as possible Whether LVM should be used for ALL the volumes/partitions OR selectively(use for /var, /usr etc. but NOT for the datanode and log directories) OR don't use it at all Any suggestions/recommendations/further reading(suited to the latest HDP2.4 and RHEL 7 environment) ?
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
05-30-2016
08:57 AM
@mlanciaux As mentioned in the original query : DimSnapshot.snapshot_id is the PK of the DimSnapshot table, thus it's count = no. of records in DimSnapshot which is around 8 million I did the following : CREATE TABLE factsamplevalue_snapshot AS SELECT snapshot_id, COUNT(*) FROM factsamplevalue GROUP BY snapshot_id; which resulted into a table with 7914806 rows, sample data : select * from factsamplevalue_snapshot limit 10;
OK
factsamplevalue_snapshot.snapshot_id factsamplevalue_snapshot._c1
643438 2170
643445 2023
643924 3646
644063 2448
644153 2837
644459 848
644460 3713
644541 2080
645243 725
645599 852 Unfortunately, the histogram will return huge no. of entries, so cannot paste or provide the output.
... View more
05-26-2016
02:04 PM
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1 Nodes : 1 NN(8 X 1TB hdd, 16 X 2.53 GHz core processor,48GB RAM, RHEL 6.5) + 8 DN(8 X 600GB hdd, 16 X 2.53 GHz core processor, 75GB RAM, RHEL 6.5). Nodes are connected by a 10-gig network I have a staging/vanilla/simple Hive table with 24 billion records. I created an empty ORC table as follows : CREATE EXTERNAL TABLE IF NOT EXISTS FactSampleValue (
`Snapshot_Id` int
/*OTHER COLUMNS*/
)
PARTITIONED BY (`SmapiName_ver` varchar(30))
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC LOCATION '/datastore/'; Some settings : Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
hive>
>
> set optimize.sort.dynamic.partitioning=true;
hive>
> set hive.exec.dynamic.partition.mode=nonstrict;
hive>
> set hive.exec.max.dynamic.partitions.pernode=3000;
hive>
>
> set hive.enforce.sorting=true;
hive>
> Executed an insert : INSERT INTO odp_dw_may2016_orc.FactSampleValue PARTITION (SmapiName_ver) SELECT * FROM odp_dw_may2016.FactSampleValue DISTRIBUTE BY SmapiName_ver SORT BY SmapiName_ver;
Query ID = hive_20160526125733_8834c7bc-b4f3-4539-8d48-fa46bba92a33
Total jobs = 1
Launching Job 1 out of 1
******REDUCERS NOT STARTING
Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 RUNNING 3098 0 110 2988 0 0
Reducer 2 INITED 1009 0 0 1009 0 0
--------------------------------------------------------------------------------
VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 12.70 s
--------------------------------------------------------------------------------
Status: Running (Executing on YARN cluster with App id application_1446726117927_0092) After a long time, the mappers completed but the reducers failed : Status: Running (Executing on YARN cluster with App id application_1446726117927_0092)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 ........ RUNNING 3098 2655 94 349 0 0
Reducer 2 RUNNING 1009 45 110 854 49 91
--------------------------------------------------------------------------------
VERTICES: 01/02 [=================>>---------] 65% ELAPSED TIME: 8804.16 s
-------------------------------------------------------------------------------- As seen above, A FEW mappers started again, I guess it's a reattempt. Again some failures, latest : --------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 ........ RUNNING 3098 2773 113 212 0 8
Reducer 2 RUNNING 1009 45 110 854 57 119
--------------------------------------------------------------------------------
VERTICES: 01/02 [=================>>---------] 68% ELAPSED TIME: 10879.73 s
-------------------------------------------------------------------------------- I suspect some memory or relevant issues but I dunno which all logs should I check. For example, under log/application_1446726117927_0092, I found several containers and many of them had the following error in the syslog_attempt_1446726117927_0092_1_01_000041_1 : 2016-05-26 15:45:11,932 [WARN] [TezTaskEventRouter{attempt_1446726117927_0092_1_01_000041_1}] |orderedgrouped.ShuffleScheduler|: Map_1: Duplicate fetch of input no longer needs to be fetched: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=709], attemptNumber=1, pathComponent=attempt_1446726117927_0092_1_00_000709_1_10012, spillType=0, spillId=-1]
2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM
2016-05-26 15:45:24,251 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat
2016-05-26 15:45:24,253 [INFO] [main] |task.TezTaskRunner|: Interrupted while waiting for task to complete. Interrupting task
2016-05-26 15:45:24,254 [INFO] [main] |task.TezTaskRunner|: Shutdown requested... returning
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Got a shouldDie notification via heartbeats for container container_1446726117927_0092_01_000187. Shutting down
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutdown invoked for container container_1446726117927_0092_01_000187
2016-05-26 15:45:24,254 [INFO] [main] |task.TezChild|: Shutting down container container_1446726117927_0092_01_000187
2016-05-26 15:45:24,255 [ERROR] [TezChild] |tez.ReduceRecordProcessor|: Hit error while closing operators - failing tree
2016-05-26 15:45:24,256 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-05-26 15:45:24,257 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an error while executing task: attempt_1446726117927_0092_1_01_000041_1
java.lang.RuntimeException: java.lang.InterruptedException
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:171)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:90)
at org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:116)
at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:117)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:147)
... 14 more
2016-05-26 15:45:24,260 [INFO] [TezChild] |task.TezTaskRunner|: Ignoring the following exception since a previous exception is already registered
2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1446726117927_0092_1_01_000041_1: Counters: 71 [[File System Counters FILE_BYTES_READ=290688, FILE_BYTES_WRITTEN=227062571, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=0, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=0, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, NUM_FAILED_SHUFFLE_INPUTS=0, MERGED_MAP_OUTPUTS=53, GC_TIME_MILLIS=11162, CPU_MILLISECONDS=62510, PHYSICAL_MEMORY_BYTES=664797184, VIRTUAL_MEMORY_BYTES=2432901120, COMMITTED_HEAP_BYTES=664797184, OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, ADDITIONAL_SPILLS_BYTES_READ=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=39880956, NUM_MEM_TO_DISK_MERGES=2, NUM_DISK_TO_DISK_MERGES=0, SHUFFLE_PHASE_TIME=0, MERGE_PHASE_TIME=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569][Shuffle Errors BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][Shuffle Errors_Reducer_2_INPUT_Map_1 BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0][TaskCounter_Reducer_2_INPUT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=227062571, COMBINE_INPUT_RECORDS=0, FIRST_EVENT_RECEIVED=264, LAST_EVENT_RECEIVED=49569, MERGED_MAP_OUTPUTS=53, MERGE_PHASE_TIME=0, NUM_DISK_TO_DISK_MERGES=0, NUM_FAILED_SHUFFLE_INPUTS=0, NUM_MEM_TO_DISK_MERGES=2, NUM_SHUFFLED_INPUTS=73, NUM_SKIPPED_INPUTS=3022, REDUCE_INPUT_GROUPS=0, REDUCE_INPUT_RECORDS=0, SHUFFLE_BYTES=326055728, SHUFFLE_BYTES_DECOMPRESSED=2123360327, SHUFFLE_BYTES_DISK_DIRECT=39880956, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=286174772, SHUFFLE_PHASE_TIME=0, SPILLED_RECORDS=0][TaskCounter_Reducer_2_OUTPUT_out_Reducer_2 OUTPUT_RECORDS=0]]
2016-05-26 15:45:24,275 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Joining on EventRouter
2016-05-26 15:45:24,276 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Closed processor for vertex=Reducer 2, index=1
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Shutting down Shuffle for source: Map_1
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Already shutdown. Ignoring error
2016-05-26 15:45:24,276 [INFO] [TezChild] |orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: Map 1: numDmeEventsSeen=3480, numDmeEventsSeenWithNoData=3395, numObsoletionEventsSeen=443, updateOnClose
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #1, status:false, isInterrupted:false
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #2, status:false, isInterrupted:false
2016-05-26 15:45:24,277 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #3, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #4, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #5, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #6, status:false, isInterrupted:false
2016-05-26 15:45:24,278 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #7, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #8, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #9, status:false, isInterrupted:false
2016-05-26 15:45:24,279 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #10, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #11, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #12, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #13, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #14, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #15, status:false, isInterrupted:false
2016-05-26 15:45:24,280 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #16, status:false, isInterrupted:false
2016-05-26 15:45:24,291 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #17, status:false, isInterrupted:false
2016-05-26 15:45:24,302 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #18, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #19, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #20, status:false, isInterrupted:false
2016-05-26 15:45:24,314 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #21, status:false, isInterrupted:false
2016-05-26 15:45:24,318 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #22, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #23, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #24, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #25, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #26, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #27, status:false, isInterrupted:false
2016-05-26 15:45:24,319 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #28, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #29, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.Shuffle|: Map_1: Shutdown..fetcher {Map_1} #30, status:false, isInterrupted:false
2016-05-26 15:45:24,320 [INFO] [TezChild] |orderedgrouped.MergeManager|: finalMerge called with 8 in-memory map-outputs and 14 on-disk map-outputs
2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Merging 8 sorted segments
2016-05-26 15:45:24,321 [INFO] [TezChild] |impl.TezMerger|: Down to the last merge-pass, with 8 segments left of total size: 376486161 bytes Which all logs and what errors shall I look for ? How can Ambari help ? **********EDIT-1********** The Hive query finally failed with the following error : Status: Failed
Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000036
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000199
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
at java.io.Writer.write(Writer.java:157)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000299
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]
Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00Vertex failed, vertexName=Reducer 2, vertexId=vertex_1446726117927_0092_1_01, diagnostics=[Task failed, taskId=task_1446726117927_0092_1_01_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_1446726117927_0092_01_000036 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000036
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 1 failed, info=[Container container_1446726117927_0092_01_000199 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000199
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]], TaskAttempt 2 failed, info=[Error: Fatal Error cause TezChild exit.:java.lang.OutOfMemoryError: Java heap space
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:133)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:220)
at java.io.Writer.write(Writer.java:157)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:48)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:310)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:162)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:218)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:156)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:137)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
], TaskAttempt 3 failed, info=[Container container_1446726117927_0092_01_000299 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
Container id: container_1446726117927_0092_01_000299
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:944, Vertex vertex_1446726117927_0092_1_01 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Map 1, vertexId=vertex_1446726117927_0092_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:23, Vertex vertex_1446726117927_0092_1_00 [Map 1] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 As I suspected, there is a memory issue BUT THIS IS OCCURING ON THE SECOND ATTEMPT, the FIRST ONE FAILED for unknown reason : FatalError cause TezChildexit.:java.lang.OutOfMemoryError:Java heap space The question is which parameters need to be changed ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Tez