About guillaume_roger

guillaume_roger · ‎09-27-2017

I have a graphite server, to which I want to send Hadoop metrics2. On paper it's easy. Just add log4j.logger.org.apache.hadoop.metrics2=DEBUG to the log4j template and update hadoop-metrics2.properties template with: *.sink.graphite.class=org.apache.hadoop.metrics2.sink.GraphiteSink *.sink.graphite.server_host=10.x.x.x *.sink.graphite.server_port=2003 datanode.sink.graphite.metrics_prefix=datanode namenode.sink.graphite.metrics_prefix=namenode resourcemanager.sink.graphite.metrics_prefix=resourcemanager nodemanager.sink.graphite.metrics_prefix=nodemanager jobhistoryserver.sink.graphite.metrics_prefix=jobhistoryserver journalnode.sink.graphite.metrics_prefix=journalnode maptask.sink.graphite.metrics_prefix=maptask reducetask.sink.graphite.metrics_prefix=reducetask applicationhistoryserver.sink.graphite.metrics_prefix=applicationhistoryserver It works very well with one service (eg. datanode). If I put more than one, I will only get 2 services in graphite, and I cannot confirm that all metrics for those services are present. Not knowing what metrics to expect and wanting to experiment, I do not want to filter on actual metric to limit their number. On collectd side I can see one metric dropped (invalid), but one metric only. It does not account for all the rest. Furthemore, setting CollectInternalStats to true shows me that no metrics is dropped. On Hadoop side... Well, I could not find anything telling me if metrics ar actually sent or not, if it succeeds or fail... Not logging anywhere. So my 2 questions are: How can I debug metrics2? Is there any known reasons why I am missing metrics? Context: hdp2.6 on AWS.

guillaume_roger · ‎07-06-2017

@Vani I am trying to understand what will this memory be used for. My understanding is that: any application will require its own AM one AM will use 1 container only tez-site/tez.am.resource.memory.mb defines the memory usable by the total of all AM So logically all AM memory should never be more than half of the available memory (for the worst case scenario where all application only use one container) I should allocate in tez-site/tez.am.resource.memory.mb (minimum container size * expected number of applications) Could you confirm my understanding?

guillaume_roger · ‎07-04-2017

@Vani, Thanks for your answer. I do not see an immediate change, but I carry on looking in this direction. What would be a good logical value for this maximum-am-resource-percent? Currently the AM memory (tez-site/tez.am.resource.memory.mb) is set to the min container size (5GB in my case). Does that make sense?

guillaume_roger · ‎07-03-2017

I have a small one node hdp2.6 cluster (8 CPUs, 32GB ram), and I cannot run more than 1 query at a time, although I was pretty sure that I configures the relevant settings to allow more than one container. The relevant configs are: yarn-site/yarn.nodemanager.resource.memory-mb = 27660 yarn-site/yarn.scheduler.minimum-allocation-mb = 5532 yarn-site/yarn.scheduler.maximum-allocation-mb = 27660 mapred-site/mapreduce.map.memory.mb = 5532 mapred-site/mapreduce.reduce.memory.mb = 11064 mapred-site/mapreduce.map.java.opts = -Xmx4425m mapred-site/mapreduce.reduce.java.opts = -Xmx8851m mapred-site/yarn.app.mapreduce.am.resource.mb = 11059 mapred-site/yarn.app.mapreduce.am.command-opts = -Xmx8851m -Dhdp.version=${hdp.version} hive-site/hive.execution.engine = tez hive-site/hive.tez.container.size = 5532 hive-site/hive.auto.convert.join.noconditionaltask.size = 1546859315 tez-site/tez.runtime.unordered.output.buffer.size-mb = 414 tez-interactive-site/tez.am.resource.memory.mb = 5532 tez-site/tez.am.resource.memory.mb = 5532 tez-site/tez.task.resource.memory.mb = 5532 tez-site/tez.runtime.io.sort.mb = 1351 hive-site/hive.tez.java.opts = -server -Xmx4425m -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps capacity-scheduler/yarn.scheduler.capacity.resource-calculator = org.apache.hadoop.yarn.util.resource.DominantResourceCalculatororg.apache.hadoop.yarn.util.resource.DominantResourceCalculator yarn-site/yarn.nodemanager.resource.cpu-vcores = 6 yarn-site/yarn.scheduler.maximum-allocation-vcores = 6 mapred-site/mapreduce.map.output.compress = true hive-site/hive.exec.compress.intermediate = true hive-site/hive.exec.compress.output = true hive-interactive-env/enable_hive_interactive = false Which if I understand it well, gives 5GB per container. If I run a hive query, it will use 5GB, 1 core, leaving about 15GB and 5 cores for the rest. I do not understand why the next query cannot start at the same time. Any help would be much welcome.

guillaume_roger · ‎06-15-2017

I was using hive 1 with hive.server2.enable.doas=true. Now I want to use hive-interactive, but hive.server2.enable.doas has to be false apparently (that is what ambari says). This of course makes most of my queries break because of wrong permissions. I am curious to know why this setting cannot be true is there know workaround for this. Context: hdp 2.6 with hive and hive-interactive. Thanks!

guillaume_roger · ‎06-15-2017

Thanks, but I am not interested in this surrogate key. The point of defining the PK was to help eg. reporting tools to find out automatically joins between tables. This surrogate key would thus not do. Thanks!

guillaume_roger · ‎06-14-2017

The example I gave was a trimmed-down version of what I wanted to do to show the technical problem. My expected PK is actually a compound PK, with a few partitioned columns and a few non-partitioned columns. But I am afraid that your answer says it all, no can do :(. Thanks!

guillaume_roger · ‎06-14-2017

I want to add primary key constraints to hive tables. The only think is that my PK is actually a partitioned column. For instance: CREATE TABLE pk ( id INT, PRIMARY KEY(part) DISABLE NOVALIDATE ) PARTITIONED BY (part STRING) This fails with the error message: DBCException: SQL Error [10002] [42000]: Error while compiling statement: FAILED: SemanticException [Error 10002]: Invalid column reference part Is there a way to use a partitioned column as PK? Context: hdp 2.6, hive 2.1 with llap.

guillaume_roger · ‎06-14-2017

I want to add primary/foreign key constraints to a hive table. The only think is that my PK is actually a partitioned column. For instance: CREATE TABLE pk ( id INT, PRIMARY KEY(part) DISABLE NOVALIDATE ) PARTITIONED BY (part STRING) This fails with the error message: DBCException: SQL Error [10002] [42000]: Error while compiling statement: FAILED: SemanticException [Error 10002]: Invalid column reference part Is there a way to use a partitioned column as PK? Context: hp 2.6, hive 2.1 with llap.

guillaume_roger · ‎04-24-2017

The answer is that is is not possible to set those parameters globally. @Murali Ramasami has the right workaround.

Online	Offline
Last Visited	‎11-22-2019 08:48 AM

Member Since	‎10-13-2016 09:10 AM
Last Visited	‎11-22-2019 08:48 AM
Posts	68
Kudos received	10

Cloudera Community

Re: Hive odbc with prepared statements: ParseExcep...

Re: Fix under replicated blocks very slow

Re: Adding a host to ambari from another DC behind...

Hadoop metrics2 to Graphite: only 2 are received

Re: Why can't I run more than 1 query in parallel ...

Re: Why can't I run more than 1 query in parallel ...

Why can't I run more than 1 query in parallel in H...

hive-interactive and hive.server2.enable.doas

Re: Hive Primary key on partitioned column

Re: Hive Primary key on partitioned column

Hive Primary key on partitioned column

Hive primary on a partitioned column

Re: Where to put oozie.launcher.* configuration?