About dperez

dperez · ‎07-26-2022

With the need of keeping metrics centralized into a single spot, people have sought a way to configure grafana for observability by attempting to collect the wide range of metrics available in a Datahub / Datalake in order to keep their dashboards arranged in a centralized hub. This tutorial will walk you through step-by-step in how to configure Grafana to query the metrics available in the Cloudera Manager of a datahub cluster. First, we need to ensure that the machine where Grafana is running has direct connection with the CM Server. In other words, it must be able to resolve the CM FQDN and establish a straight up connection with the service, bypassing knox completely. Unlike the Grafana utilized by other experiences, such as DWX, CDW and so forth of which use prometheus for integration, this grafana will have to integrate with CM Server for authentication and during the creation of a datasource when querying the metrics therein. This integration is available through a plugin that must be installed after the grafana deployment, which are by no means maintained or developed by Cloudera. Installing Grafana: * yum -y install grafana * systemctl start grafana-server Make sure the service has proper access to the folder "/var/lib/grafana/"" * grafana-cli plugins install foursquare-clouderamanager-datasource Grafana server must be restarted prior to utilizing the plugins. * systemctl restart grafana-server The first step is to locate the machine whose CM Server is running and fetch the appropriate client certificate in use by the service, and this step can be accomplished from any server with proper access to the Cloudera Manager. Ex. openssl s_client -showcerts -connect <datahub-name>.dperez-a.a465-9q4k.cloudera.site:7183 the port 7183 is the secured HTTPS endpoint used by CM, therefore we can use openssl in order to extract the client certificate actively used by the service, widely adopted by many sysadmins when having to deal with TLS/SSL circumstances. As a next step, we can toggle on the option "With CA Cert" and paste the certificate acquired as part of the previous step in addition with the workload user and password under "Basic Auth Details" The format of the URL should follow the pattern down below, where the CM_SERVER_FQDN must be replaced with the appropriate CM Server in use by your Datahub. If there are any errors along the way, you should see a pop up message displaying the exact error message. Furthermore, you can always inspect the grafana logs for further clarifications if the error isn't intuitive at first sight. https://CM_SERVER_FQDN:7183 After the procedure is completed, you should be able to start setting up the charts by providing a valid tsquery. The tsquery language is used to specify statements for retrieving time-series data from the Cloudera Manager time-series datastore. You can check the tsquery of those many charts available in your CM UI and use it as a reference when building your own set of charts. Ref: https://grafana.com/grafana/plugins/foursquare-clouderamanager-datasource/?tab=installation Ref: https://github.com/foursquare/datasource-plugin-clouderamanager Ref: https://docs.cloudera.com/cloudera-manager/7.4.2/monitoring-and-diagnostics/topics/cm-tsquery-language.html Ref: https://docs.cloudera.com/cloudera-manager/7.4.2/metrics/topics/cm-metrics-reference.html

dperez · ‎05-14-2018

By default, hiveserver2 doesn't have an entry in the hive-env.sh that provides a way to change the type of GC. The CMS collector is designed to eliminate the long pauses associated with the full GC cycles of the throughput and serial collectors. CMS stops all application threads during a minor GC, which it also performs with multiple threads. In order to change the type of GC: Ambari > Hive > Configs > Advanced hive-env > hive-env template > Add the follow properties. if [ "$SERVICE" = "hiveserver2" ]; then if [ -z "$DEBUG" ]; then export HADOOP_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hive/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms1024m -Xmx1024m" else export HADOOP_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hive/$USER/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms1024m024m -Xmx1024m" fi fi Make sure that the new settings has been applied successfully, as well as the heap sizes. /usr/jdk64/jdk1.8.0_112/bin/jcmd <hiveserver_pid> VM.flags /usr/jdk64/jdk1.8.0_112/bin/jmap -heap <hiveserver_pid> OBS: This article is not meant to provide the best JVM flags, this will vary according to your environment. The idea is to always scale out the load avg adding more HS2 instances, in case your HS2 are highly utilized. Please, check with a HWX consultant to better align it.

dperez · ‎04-18-2018

Hello As some of you already know, Solr through knox on HDP platform isn't fully supported yet, however, it is possible to achieve that using an IBM IOP distribution. Here are some steps: Problems like that are usually related with kerberos issues. Pre-reqs: You should have already configured your knox with the desired authentication mode: OBS: In order to use this flag "Dsun.security.krb5.rcache", the jdk 1.8 or above must be used. Root cause: a) You may have not configured your browser for authentication ( SPNEGO ), b) You haven't included your users into the SolR Plugin on Ranger. c) You are hitting a known issue related with this parameter "Dsun.security.krb5.rcache=none", which is better described in this forum https://community.hortonworks.com/content/supportkb/150162/error-gssexception-failure-unspecified-at-gss-api.html 1) Add the following parameters into your Hadoop core-site.xml: hadoop.proxyuser.knox.groups = * hadoop.proxyuser.knox.hosts = * OBS: You can change the impersonation requirements accordingly with your environment. 2) According to the "known" issue aforementioned, you will have to add this parameter "Dsun.security.krb5.rcache=none" on Ambari > SolR > solr.env > The configuration needs to be configured like that: SOLR_OPTS="-Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.confdir=/etc/hadoop/conf -Dsolr.hdfs.home={{fs_root}}{{solr_hdfs_home_dir}} -Dsolr.hdfs.security.kerberos.enabled={{security_enabled}} -Dsolr.hdfs.security.kerberos.keytabfile={{solr_kerberos_keytab}} -Dsolr.hdfs.security.kerberos.principal={{solr_kerberos_principal}} -Dsun.security.krb5.rcache=none -Dsolr.log4j.dir={{solr_log_dir}}" 3) Go to the "Quick Links > Ranger > Ranger Admin UI > Solr" and add the user "knox" After these steps, your SolR UI should work fine through Knox.

thang · ‎10-07-2018

Thank for your great information. I have a trouble with connecting Mongodb with .ssl (.pem configuartion) from spark and scala via IDEA. Do you have any suggestion on this?

dperez · ‎11-06-2017

Hello, I'm still seeing some people struggling to run their own mapreduce applications using a command line. For those who are not java developers, here is some quick guidance. Let's create a new directory and put our new java extension within it. import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } From the client-side, we need to be able to resolve external resources classes / libraries ( import lines ). Let's find out our hadoop classpath to resolve any dependency.: -sh-4.1$ hadoop classpath /usr/hdp/2.6.2.0-205/hadoop/conf:/usr/hdp/2.6.2.0-205/hadoop/lib/*:/usr/hdp/2.6.2.0-205/hadoop/.//*:/usr/hdp/2.6.2.0-205/hadoop-hdfs/./:/usr/hdp/2.6.2.0-205/hadoop-hdfs/lib/*:/usr/hdp/2.6.2.0-205/hadoop-hdfs/.//*:/usr/hdp/2.6.2.0-205/hadoop-yarn/lib/*:/usr/hdp/2.6.2.0-205/hadoop-yarn/.//*:/usr/hdp/2.6.2.0-205/hadoop-mapreduce/lib/*:/usr/hdp/2.6.2.0-205/hadoop-mapreduce/.//*::mysql-connector-java-5.1.17.jar:mysql-connector-java.jar:/usr/hdp/2.6.2.0-205/tez/*:/usr/hdp/2.6.2.0-205/tez/lib/*:/usr/hdp/2.6.2.0-205/tez/conf /usr/jdk64/jdk1.8.0_112/bin/javac -classpath $(/usr/hdp/current/hadoop-client/bin/hadoop classpath) -d job/ job/WordCount.java Now, all the classes were turned into a .class, let's group them all into a single jar. -sh-4.1$ /usr/jdk64/jdk1.8.0_112/bin/jar -cvf Test.jar -C job/ . Execute the mapreduce program. -sh-4.1$ hadoop jar Test.jar WordCount /tmp/sample_07.csv /tmp/output_mapred 17/11/05 23:41:50 INFO client.RMProxy: Connecting to ResourceManager at minotauro3.hostname.br/xxx.xx.xxx.xx:8050 17/11/05 23:41:51 INFO client.AHSProxy: Connecting to Application History server at minotauro3.hostname.br/xxx.xx.xxx.xx:10200 17/11/05 23:41:51 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 12603 for bob1 on ha-hdfs:cluster2 17/11/05 23:41:51 INFO security.TokenCache: Got dt for hdfs://cluster2; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:cluster2, Ident: (HDFS_DELEGATION_TOKEN token 12603 for bob1) ...... File Input Format Counters Bytes Read=46055 File Output Format Counters Bytes Written=36214

honza_nguyen1 · ‎11-11-2017

Using the the sandbox 2.6 solved this problem....thank you @Danilo Perez

Online	Offline
Last Visited	‎03-05-2023 10:54 PM

Member Since	‎03-01-2017 02:54 PM
Last Visited	‎03-05-2023 10:54 PM
Posts	58
Kudos received	2

Cloudera Community

Re: Zeppelin numbers without commas

Configuring Grafana for Datahub Clusters

How to change Hiveserver2 GC from Parallel GC to C...

Configuring Solr with Knox + Ranger in a kerberize...

Re: How to configure your spark application to use...

How to compile your own custom mapreduce program f...

Re: Zeppelin numbers without commas