Member since
10-04-2016
243
Posts
281
Kudos Received
43
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1169 | 01-16-2018 03:38 PM | |
6139 | 11-13-2017 05:45 PM | |
3032 | 11-13-2017 12:30 AM | |
1518 | 10-27-2017 03:58 AM | |
28426 | 10-19-2017 03:17 AM |
10-18-2017
11:39 PM
2 Kudos
Scenario: The cluster is using both Hive and Atlas components. Sometimes a simple query like 'show databases' fails with the error stack shown below: beeline> show databases; Getting log thread is interrupted, since query is done! Error: Error while processing statement: FAILED: Hive Internal Error: java.util.concurrent.RejectedExecutionException(Task java.util.concurrent.FutureTask@e871c01 rejected from java.util.concurrent.ThreadPoolExecutor@5b868755[Running, pool size = 1, active threads = 1, queued tasks = 10000, completed tasks = 14807]) (state=08S01,code=12) java.sql.SQLException: Error while processing statement: FAILED: Hive Internal Error: java.util.concurrent.RejectedExecutionException(Task java.util.concurrent.FutureTask@e871c01 rejected from java.util.concurrent.ThreadPoolExecutor@5b868755[Running, pool size = 1, active threads = 1, queued tasks = 10000, completed tasks = 14807]) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:282) at org.apache.hive.beeline.Commands.execute(Commands.java:848) at org.apache.hive.beeline.Commands.sql(Commands.java:713) at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:983) at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:823) at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:781) at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:485) at org.apache.hive.beeline.BeeLine.main(BeeLine.java:468) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) HiveServer2 Log: 2017-10-10 14:00:38,985 INFO [HiveServer2-Background-Pool: Thread-273112]: log.PerfLogger (PerfLogger.java:PerfLogBegin(135)) - <PERFLOG method=PostHook.org.apache.atlas.hive.hook.HiveHook from=org.apache.hadoop.hive.ql.Driver> 2017-10-10 14:00:38,986 ERROR [HiveServer2-Background-Pool: Thread-273112]: ql.Driver (SessionState.java:printError(962)) - FAILED: Hive Internal Error: java.util.concurrent.RejectedExecutionException(Task java.util.concurrent.FutureTask@3f389d45 rejected from java.util.concurrent.ThreadPoolExecutor@5b868755[Running, pool size = 1, active threads = 1, queued tasks = 10000, completed tasks = 14807]) java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@3f389d45 rejected from java.util.concurrent.ThreadPoolExecutor@5b868755[Running, pool size = 1, active threads = 1, queued tasks = 10000, completed tasks = 14807] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.atlas.hive.hook.HiveHook.run(HiveHook.java:174) Root Cause Often users are led to believe that this issue can be fixed by removing 'org.apache.atlas.hive.hook.HiveHook' from hive.exec.post.hooks property. hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook
However, when you are using both Atlas and Hive, then 'org.apache.atlas.hive.hook.HiveHook' should not be removed. Instead, the error clearly indicates that this issue is due to improper ThreadPool configuration. In this case the max thread pool size is 1 and the waiting queue size is 10000. Solution 1. In hive-site.xml, verify the value for property "hive.server2.async.exec.threads". If set to 1, increase to 100. 2. Increase max thread pool related values with respect to Atlas threads in hive-site.xml, example <property>
<name>atlas.hook.hive.maxThreads</name>
<value>5</value>
</property>
<property>
<name>atlas.hook.hive.minThreads</name>
<value>1</value>
</property>
... View more
Labels:
07-25-2018
01:12 PM
Cluster is not kerberized in my case.
... View more
10-15-2017
10:08 PM
3 Kudos
When the FSImage file is large (like 30 GB or more), sometimes due to other contributing factors like RPC bandwidth, network congestion, request queue length etc, it can take a long time to upload/download. This in turn can leads the Zookeeper to believe that the NameNode is not responding. It displays the SERVICE_NOT_RESPONDING status. Thereafter, it triggers a failover transition. The logs display the following statements: 2017-09-04 05:02:26,017 INFO namenode.TransferFsImage (TransferFsImage.java:receiveFile(575)) -
"Combined time for fsimage download and fsync to all disks took 237.14s.
The fsimage download took 237.14s at 141130.21 KB/s.
Synchronous (fsync) write to disk of /opt/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957
took 0.00s. Synchronous (fsync) write to disk of
/var/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957 took 0.00s..
2017-09-04 05:02:26,018 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) -
Remote journal 192.168.1.1:8485 failed to write txns 12106579989-12106579989.
Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
IPC's epoch 778 is less than the last promised epoch 779
2017-09-04 05:02:26,019 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) -
Transport-level exception trying to monitor health of namenode at nn1.test.com/192.168.1.2:8023:
java.io.EOFException End of File Exception between local host is: "nn1.test.com/192.168.1.2";
destination host is: "nn1.test.com/192.168.1.2":8023; : java.io.EOFException;
For more details see: http://wiki.apache.org/hadoop/EOFException
2017-09-04 05:02:26,020 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) -
Entering state SERVICE_NOT_RESPONDING
2017-09-04 05:02:26,021 INFO ha.ZKFailoverController (
ZKFailoverController.java:setLastHealthState(852)) -
Local service NameNode at nn1.test.com/192.168.1.2:8023 Entered state: SERVICE_NOT_RESPONDING If the contributing factors are not addressed and FSImage file size continues to be high, then such fail overs will become very frequent (3 or more times in a week).
Root Cause This issue occurs in the following scenarios: The FSImage upload/download is making the disk/network too busy, which is causing request queues to build up and the NameNode to appear unresponsive. Typically, in overloaded clusters where the NameNode is too busy to process heartbeats, it spuriously marks DataNodes as dead. This scenario also leads to spurious fail overs. Solution To resolve this issue, do the following: Add image transfer throttling. Throttling will use less bandwidth for image transfers. Hence, although the transfer takes longer, the NameNode will remain more responsive throughout. Throttling can be enabled by setting dfs.image.transfer.bandwidthPerSec in hdfs-site.xml. It always expects value in bytes. The following example will limit the transfer bandwidth to 50MB/s. <property>
<name>dfs.image.transfer.bandwidthPerSec</name>
<value>50000000</value>
</property> Enable the DataNode life protocol. This will reduce spurious failovers. The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira HDFS-9239). It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead. For a non-HA cluster, the feature can be enabled with the following configuration in hdfs-site.xml: <property>
<name>dfs.namenode.lifeline.rpc-address</name>
<value>mynamenode.example.com:8050</value>
</property>
(Replace mynamenode.example.com with the hostname or IP address of your namenode. The port number can be different too.) For an HA cluster, the lifeline RPC port can be enabled with the following setup, replacing mycluster, nn1 and nn2 appropriately. <property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
<value>mynamenode1.example.com:8050</value>
</property>
<property>
<name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
<value>mynamenode2.example.com:8050</value>
</property>
Additional lifeline protocol settings are documented in the HDFS-9239 release note. However, these can be left at their default values for most clusters. Note: Changing the lifeline protocol settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime. For some amazing tips on Scaling HDFS, refer to this 4 part guide
... View more
Labels:
10-15-2017
09:08 PM
3 Kudos
Two YARN Queues have been setup - default, llap Whichever queue is marked as Hive LLAP Queue (Interactive Query Queue), that queue does not process any queries/job The following are the steps to reproduce the issye:
There are two YARN Queues setup. In Ambari, default queue is selected for LLAP from the Interactive Query Queue drop down (Ambari > Hive > Configs > Interactive Query Queue dropdown). Any job/query submitted to default queue does not run. However, all queries submitted to LLAP queue runs successfully. Using the same process listed above, change the Interactive Query Queue to LLAP. Any query/job submitted to LLAP queue does not run. However, all queries submitted to default queue runs successfully. Basically, at any time, the LLAP queue does not run any job/query. Root Cause: This issue occurs when there is an issue with queue prioritization, where both queues have priority set to 0 which means both are of equal priority. This can be viewed from Ambari > Yarn Queue Manager view. The Hive LLAP (Low-Latency Analytical Processing) enables us to run Hive queries with low-latency in near real-time. To ensure low-latency, set the priority of the queue used for LLAP to a higher priority, especially if the cluster includes long-running applications. Solution To resolve this issue, set the priority of queue LLAP to a higher value higher than the default queue. After setting the higher priority, ensure to save and refresh the queues for the change to take effect. For YARN Queue Priorities to be applied, enable preemption. To enable preemption, refer to this documentation.
... View more
Labels:
10-13-2017
03:56 PM
3 Kudos
Scenario 1: Only one instance of Spark Thrift Server is needed Approach: If you are installing the Spark Thrift Server on a Kerberos-secured cluster, the following instructions apply:
The Spark Thrift Server must run in the same host as HiveServer2 , so that it can access the hiveserver2 keytab. Edit permissions in /var/run/spark and /var/log/spark to specify read/write permissions to the Hive service account. /var/run/spark and /var/log/spark should be able read/write to hive. So, just seeing contents as user hive is not enough, you need to be able to write to those folders. One way is to give 77x permissions on these folders. Since spark:hadoop is owner:group and hive belongs to group hadoop, it will have write access with this setup. Use the Hive service account to start the thriftserver process. It is recommend that you run the Spark Thrift Server as user hive instead of user spark . This ensures that the Spark Thrift Server can access Hive keytabs, the Hive metastore, and data in HDFS that is stored under user hive . When the Spark Thrift Server runs queries as user hive , all data accessible to user hive will be accessible to the user submitting the query. For a more secure configuration, use a different service account for the Spark Thrift Server. Provide appropriate access to the Hive keytabs and the Hive metastore. If you still do not want to install the STS on the same host as HiveServer2 for some reason, then you must follow the below approach. Scenario 2 : Install multiple Spark Thrift Server instances on hosts other than HiveServer2 Approach : Run all commands as the root user.
Back up hive.service.keytab in /etc/security/keytabs on Hive Server host by making a copy of the file and move the copy to a different directory than /etc/security/keytabs. If Spark Thrift Server host also has hive.service.keytab in/etc/security/keytabs, make a copy of the file and move the copy to a different directory than /etc/security/keytabs. On the Ambari Server node, run the following command from the command line to obtain and cache Kerberos ticket-granting tickets.
kinit [admin principal] Type in the admin principal password when asked. The admin principal name and the admin principal password are the ones used to enable Kerberos via Ambari. For example: If the admin principal used to enable Kerberos was root/admin and corresponding password was abc123, run kinit root/adminand type abc123 when prompted for password by the command line. On the Ambari Server node, in a temporary directory, run the following command to open kadmin shell
kadmin
Add a new principal as hive/[spark_thrift_server_host]@[Kerberos realm]. Replace [spark_thrift_server_host] with the host name of the Spark Thrift Server on the cluster. Replace [Kerberos realm] with the Kerberos realm used when enabling Kerberos in Ambari. For example, if Kerberos is enabled in Ambari with Kerberos realm MyDomain.COM, use it to replace [Kerberos realm].
addprinc -randkey hive/[spark_thrift_serverhost]@[Kerberos realm]
Add all Hive principals to the Hive service keytab file. This should include the existing one for the Hive Server host and the one created in the previous step. ktadd -k hive.service.keytab
hive/[spark_thrift_server_host]@[Kerberos realm]
ktadd -k hive.service.keytab
hive/[hive_server_host]@[Kerberos realm] Replace [spark_thrift_server_host], [hive_server_host]and [Kerberos realm] with information specifically for the cluster. kadmin shell should print out messages indicating the principal is added to the file. For example: kadmin: ktadd -k hive.service.keytab
hive/myserver1.mydomain.com@MyDomain.COM
Entry for principal hive/ myserver1.mydomain.com@MyDomain.COM with kvno 3, encryption type aes256-cts-hmac-sha1-96 added to keytab
WRFILE:hive.service.keytab.Entry for principal hive/ myserver1.mydomain.com@MyDomain.COM with kvno 3, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:hive.service.keytab. Type exit to exit the kadmin shell. Find the newly generated hive.service.keytab in the current directory location.
Add it to /etc/security/keytabs on Spark Thrift Server host. Use it to replace /etc/security/keytabs Hive Server host.
Update permission and ownership of the file on both Spark Thrift Server host and Hive Server host as shown below. chmod 400 hive.service.keytab
chown
[hive_user]:[hive_user_primary_group]hive.service.keytab Stop all Spark components via Ambari web UI. Ensure there are no running Spark processes on the Spark component hosts. Restart Hive from Ambari UI. Start Spark Service from Ambari UI.
... View more
Labels:
10-18-2017
11:35 PM
thank you. I had to increase the maxThreads to get it working.
... View more
09-29-2017
04:07 PM
hive.support.concurrency =true hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager will install a lock manager (there are several; ZooKeeper based is the default) w/o enabling full Acid. If you do use hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager then hive.lock.manager is ignored and you will be using Metastore based lock manager that is used by Acid but if you don't create your tables with "transactional=true" all your tables remain the same. I believe external tables should be locked in this case.
... View more
09-18-2017
06:52 PM
3 Kudos
Scenario: The spark log4j
properties(Ambari > Spark > Configs) are not configured to log to a file.
When running a job in yarn-client mode, the driver logs are spilled on the console.
For long running jobs, it can be difficult to capture the driver logs due to
various reasons like the user may lose connection with the terminal, or may
have closed the terminal etc. The
driver log is a useful artifact if we have to investigate a job failure. In
such scenarios, it is better to have the spark driver log to a file instead of
console. Here
are the steps:
Place a driver_log4j.properties file in a certain location (say /tmp)
on the machine where you will be submitting the job in yarn-client mode Contents of driver_log4j.properties #Set everything to be logged to the file
log4j.rootCategory=INFO,FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/SparkDriver.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=500MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
#Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO Change the value of log4j.appender.FILE.File as needed. 2. Add the following to the spark-submit command so that it picks the
above log4j properties and makes the driver log to a file. --driver-java-options "-Dlog4j.configuration=file:/tmp/driver_log4j.properties" Example spark-submit --driver-java-options "-Dlog4j.configuration=file:/tmp/driver_log4j.properties"
--class org.apache.spark.examples.JavaSparkPi --master yarn-client --num-executors 3
--driver-memory 512m --executor-memory 512m --executor-cores 1 spark-examples*.jar 10 3. Now, once you submit this new command, spark driver will log at the
location specified by log4j.appender.FILE.File
in driver_log4j.properties. Thus, it will log to /tmp/SparkDriver.log Note: The Executor logs can always be fetched
from Spark History Server UI whether you are running the job in yarn-client or
yarn-cluster mode. a.Go to Spark History Server UI b.Click on the App ID c.Navigate to Executors tab d.The Executors page will list the link to stdout and stderr logs
... View more
Labels:
09-15-2017
02:37 AM
3 Kudos
The parent article discusses how to access the ams-hbase instance using Phoenix client when it is a standalone environment. This article is an extension to that as I list the steps to access ams-hbase instance using Phoenix client when Zookeeper is installed on the cluster. As described in the parent article, login to AMS Collector host machine: 1. Check the /etc/ams-hbase/conf/hbase-site.xml 2. From the above file we pick hbase.zookeeper.quorum , hbase.zookeeper.property.clientPort and zookeeper.znode.parent values 3. Goto cd /usr/lib/ambari-metrics-collector/bin 4. Invoke the client with the connection url as shown below by substituting the values: ./sqlline.py hbase.zookeeper.quorum : hbase.zookeeper.property.clientPort 😕 zookeeper.znode.parent Example ./sqlline.py myzkqrm.com:61181:/ams-hbase-secure
... View more
09-14-2017
04:04 AM
4 Kudos
Often the namenode log files grow in size and they have too many kinds of messages. One of the most commonly faced scenario is where there had been multiple state changes in hdfs and investigating them becomes a pain when there are multiple occurrence in huge log files. Luckily there is a very easy way to make a few configuration changes to ensure that state change log statements get logged to a separate file. To isolate and log state change log messages to another file, add the following to hdfs-log4j and restart the namenodes. You can make this change from Ambari. Ambari > HDFS service > Configs tab > Advanced tab > Advanced hdfs-log4j section > hdfs-log4j template # StateChange log
log4j.logger.org.apache.hadoop.hdfs.StateChange=INFO,SCL
log4j.additivity.org.apache.hadoop.hdfs.StateChange=false
log4j.appender.SCL=org.apache.log4j.RollingFileAppender
log4j.appender.SCL.File=${hadoop.log.dir}/hdfs-state-change.log
log4j.appender.SCL.MaxFileSize=256MB
log4j.appender.SCL.MaxBackupIndex=20
log4j.appender.SCL.layout=org.apache.log4j.PatternLayout
log4j.appender.SCL.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n In this way, HDFS StateChange log messages will be written in ${hadoop.log.dir}/hdfs-state-change.log
... View more
Labels: