About bleonhardi

bleonhardi · ‎02-09-2016

Glad that you fixed it. Unfortunately I don't really understand how common this error is to happen. The patch code is pretty complex. You definitely do not want to disable the mapjoin conversion. I don't suppose an upgrade is in sight?

bleonhardi · ‎02-09-2016

That is definitely not too many files. During the init all objects are initialized, tasks are placed context objects created, ... I am wondering if during Task creation also dimension tables are loaded. ( in case of a mapside join ) but you don't have one your join is happening in the reducer.. So I am really at a loss what could go wrong here. I think you might have to open a support ticket for this one.

bleonhardi · ‎02-09-2016

Not sure how the effect would be: Normally commonly needed files like libraries are stored with a high replication level so every task easily has access to them. However Yarn actively puts tasks on local nodes. In other words you do not have to increase the replication level to make tasks local since tasks are placed TO the data by the system. Increasing the replication level might actually reduce the ability of the Operating System to cache the data in the OS read buffer. ( I.e. if two tasks run on the same node data will be read from the Linux FS buffer and not from disc while increasing the number of copies makes this much less likely. On the other hand you give Yarn/Tez more possibilities to place and reuse containers so that might be good. In the end you would have to try it out to see how the effects turn out. Your usecase is something that would be very suited for LLAP by the way. It keeps an ORC cache in memory.

bleonhardi · ‎02-08-2016

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data Are you using CDH or HDP? In HDP I would propose ORC format. Its very similar to Parquet and just better supported and tested. If your load from SQL Server is slow its most likely not the hive creation but sqoop. So you could increase the number of mappers but there might not be an easy fix. If you have the problems in the INSERT INTO you can look into the PPT for tips. ( Specifically the distribution methods near the end )

bleonhardi · ‎02-08-2016

Look at the presentation for settings to increase your task RAM. OOM can occur for example in ORC creation. But normally they only occur if a task needs to write to too many partitions at a time therefore using one of the redistribution mechanisms in the ppt is helpful. How do you load them? Sqoop directly? Is your target table ORC? Are you loading one partition (day, month) at a time or a full time range? As explained in the PPT dynamically loading into dozens of partitions easily leads into OOM exceptions. Redistribute your data to increase load performance and avoid OOM.

bleonhardi · ‎02-08-2016

How many files do you have in each partition folder? I had problems before when too many files in too many partitions exceeded the size limit for Tez Query definitions ( 64MB I think ) I am not sure if he failed immediately or If I had to look into the logs of the application to figure that out. But you said there were no errors so this is weird. Are these ORC files? If there should be hundreds or thousands of files in each day and this is the problem you could try alter table concatenate to merge them into fewer files. ( or use something like pig for text files or change your loading )

bleonhardi · ‎02-08-2016

Which version of HDP are you running? The hadoop command environment settings are added in the hadoop environment hadoop-env template in ambari in Advanced core-site.xml under HDFS HADOOP_OPTS and HADOOP_CLIENT_OPTS Here I have the sections of HDP 2.3.4. Using Java 1.8 perhaps you can compare them with yours? I see one line that has the settings. But its in the if java version < 8 section of the template. export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m -XX:MaxPermSize=512m $HADOOP_CLIENT_OPTS" # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. export JAVA_HOME={{java_home}} export HADOOP_HOME_WARN_SUPPRESS=1 # Hadoop home directory export HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}} # Hadoop Configuration Directory {# this is different for HDP1 #} # Path to jsvc required by secure HDP 2.0 datanode export JSVC_HOME={{jsvc_path}} # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE="{{hadoop_heapsize}}" export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms{{namenode_heapsize}}" # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" # Command specific options appended to HADOOP_OPTS when specified HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile={{hdfs_log_dir_prefix}}/$USER/hs_err_pid%p.log -XX:NewSize={{jtnode_opt_newsize}} -XX:MaxNewSize={{jtnode_opt_maxnewsize}} -Xloggc:{{hdfs_log_dir_prefix}}/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xmx{{jtnode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT -Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}" HADOOP_TASKTRACKER_OPTS="-server -Xmx{{ttnode_heapsize}} -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}" {% if java_version < 8 %} SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile={{hdfs_log_dir_prefix}}/$USER/hs_err_pid%p.log -XX:NewSize={{namenode_opt_newsize}} -XX:MaxNewSize={{namenode_opt_maxnewsize}} -XX:PermSize={{namenode_opt_permsize}} -XX:MaxPermSize={{namenode_opt_maxpermsize}} -Xloggc:{{hdfs_log_dir_prefix}}/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{namenode_heapsize}} -Xmx{{namenode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT" export HADOOP_NAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node\" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 ${HADOOP_NAMENODE_OPTS}" export HADOOP_DATANODE_OPTS="-server -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{dtnode_heapsize}} -Xmx{{dtnode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_DATANODE_OPTS}" export HADOOP_SECONDARYNAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node\" ${HADOOP_SECONDARYNAMENODE_OPTS}" # The following applies to multiple commands (fs, dfs, fsck, distcp etc) export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m -XX:MaxPermSize=512m $HADOOP_CLIENT_OPTS" {% else %} SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile={{hdfs_log_dir_prefix}}/$USER/hs_err_pid%p.log -XX:NewSize={{namenode_opt_newsize}} -XX:MaxNewSize={{namenode_opt_maxnewsize}} -Xloggc:{{hdfs_log_dir_prefix}}/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{namenode_heapsize}} -Xmx{{namenode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT" export HADOOP_NAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node\" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 ${HADOOP_NAMENODE_OPTS}" export HADOOP_DATANODE_OPTS="-server -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{dtnode_heapsize}} -Xmx{{dtnode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_DATANODE_OPTS}" export HADOOP_SECONDARYNAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node\" ${HADOOP_SECONDARYNAMENODE_OPTS}" # The following applies to multiple commands (fs, dfs, fsck, distcp etc) export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS" {% endif %} HADOOP_NFS3_OPTS="-Xmx{{nfsgateway_heapsize}}m -Dhadoop.security.logger=ERROR,DRFAS ${HADOOP_NFS3_OPTS}" HADOOP_BALANCER_OPTS="-server -Xmx{{hadoop_heapsize}}m ${HADOOP_BALANCER_OPTS}" # On secure datanodes, user to run the datanode as after dropping privileges export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER:-{{hadoop_secure_dn_user}}} # Extra ssh options. Empty by default. export HADOOP_SSH_OPTS="-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR" # Where log files are stored. $HADOOP_HOME/logs by default. export HADOOP_LOG_DIR={{hdfs_log_dir_prefix}}/$USER # History server logs export HADOOP_MAPRED_LOG_DIR={{mapred_log_dir_prefix}}/$USER # Where log files are stored in the secure data environment. export HADOOP_SECURE_DN_LOG_DIR={{hdfs_log_dir_prefix}}/$HADOOP_SECURE_DN_USER # File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default. # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves # host:path where hadoop code should be rsync'd from. Unset by default. # export HADOOP_MASTER=master:/home/$USER/src/hadoop # Seconds to sleep between slave commands. Unset by default. This # can be useful in large clusters, where, e.g., slave rsyncs can # otherwise arrive faster than the master can service them. # export HADOOP_SLAVE_SLEEP=0.1 # The directory where pid files are stored. /tmp by default. export HADOOP_PID_DIR={{hadoop_pid_dir_prefix}}/$USER export HADOOP_SECURE_DN_PID_DIR={{hadoop_pid_dir_prefix}}/$HADOOP_SECURE_DN_USER # History server pid export HADOOP_MAPRED_PID_DIR={{mapred_pid_dir_prefix}}/$USER YARN_RESOURCEMANAGER_OPTS="-Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY" # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER # The scheduling priority for daemon processes. See 'man nice'. # export HADOOP_NICENESS=10 # Use libraries from standard classpath JAVA_JDBC_LIBS="" #Add libraries required by mysql connector for jarFile in `ls /usr/share/java/*mysql* 2>/dev/null` do JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile done # Add libraries required by oracle connector for jarFile in `ls /usr/share/java/*ojdbc* 2>/dev/null` do JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile done export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:${JAVA_JDBC_LIBS} # Setting path to hdfs command line export HADOOP_LIBEXEC_DIR={{hadoop_libexec_dir}} # Mostly required for hadoop 2.0 export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH} export HADOOP_OPTS="-Dhdp.version=$HDP_VERSION $HADOOP_OPTS" {% if is_datanode_max_locked_memory_set %} # Fix temporary bug, when ulimit from conf files is not picked up, without full relogin. # Makes sense to fix only when runing DN as root if [ "$command" == "datanode" ] && [ "$EUID" -eq 0 ] && [ -n "$HADOOP_SECURE_DN_USER" ]; then ulimit -l {{datanode_max_locked_memory}} fi {% endif %}

bleonhardi · ‎02-08-2016

I do not understand the LOADBALANCER idea. Partitions only help you if you can filter based on the partition. But you will never run a query SELECT * FROM TABLE WHERE LOADBALANCER = 1 right? Then Hive needs to read all folders anyway which will actually result in worse performance in the HiveServer ( lots of partitions result in overhead in the split generation ). The day partitioning only helps you if you have queries like SELECT * FROM TABLE WHERE DAY = 3;

bleonhardi · ‎02-08-2016

How long will a reload take? Depends on your load settings ( how many reducers see ppt ) and the data volume. As I said loading a large partitioned table is a bit tricky: Regarding the partitioning scheme: You can do month, day or just day. Depends on which columns you want to filter on. ( having just day is a bit easier for rollin, rollout but multi-level partitioning is also possible ) Loadbalancer idea in a sec

bleonhardi · ‎02-08-2016

Could be a lot of things: Anything from yarn queue misconfigured to job just takes a long time. Have a look into the resourcemanager UI (:8088) and get the logs for the job that is being kicked off. You can see if the job is kicked off if containers have been allocated and are running and if he is reading/writing data. ( Don't use tez in the beginning I would figure out first what is wrong with the job or yarn. ) So -> go to Resourcemanager UI -> Click on the job that is kicked off ( you can see the job name in pig or just look for one running and pig ) -> Click on Application Master -> You should see all containers started for it. IF there is none you might have a yarn queue problem -> If mappers are started click on Maps ( lower right ) and check if they are running and perhaps in the logs to see what is going on.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Hive mapper not initializing

Re: Hive mapper not initializing

Re: HDFS replication and impact on concurrency

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

Re: Hive mapper not initializing

Re: Post Java 1.8 upgrade: Stderr writes warning w...

Re: can we apply the partitioning on the already e...

Re: can we apply the partitioning on the already e...

Re: pig script status running but always remain at...