About sgowda

sgowda · ‎11-03-2016

@Timothy Spann Thanks for your reply. Running the bulk upload as below, resolved the problem. This is a workaround for a bug as per this link HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv

mramasami · ‎09-26-2016

@Santhosh B Gowda To Increase the size of the workflow job definition size. Please add the following property to oozie-site.xml: oozie.service.WorkflowAppService.WorkflowDefinitionMaxLength=<The maximum length of the workflow definition in bytes> For Example: oozie.service.WorkflowAppService.WorkflowDefinitionMaxLength=1000000

sgowda · ‎09-21-2016

Ambari 2.4.0.0 officially supports LogSearch Component [ Tech Preview] . To learn more about Log Search component please refer to link Ambari LogSearch . While LogSearch component does support pushing the logs to Kafka topic [ based on which real time log analytics can be performed ] this is official not supported in Ambari 2.4.0.0. This might get addressed in Ambari 2.5.0.0 probably. This article provides the details on how to configure LogSearch [ LogFeeder component ] to push to Kafka topic if there is a need to capture and perform real time analytics based on logs in your cluster. 1.After installing LogSearch component from Ambari 2.4.0.0, go to the LogSearch component config screen and add below property under "Advanced logfeeder-properties" to property "logfeeder.config.files" {default_config_files},kafka-output.json 2. Create the kafka-output.json file with below content under directory /etc/ambari-logsearch-logfeeder/conf/ on the nodes which has logfeeder [ ideally all the nodes in your cluster ] { "output": [ { "is_enabled": "true", "destination": "kafka", "broker_list": "ctr-e25-1471039652053-0001-01-000006.test.domain:6667", "topic": "log-streaming", "conditions": { "fields": { "rowtype": [ "service" ] } } } ] } 3. Configure Kafka PLAINTEXT listener as below if the cluster is Kerberozied because workaround to push to PLAINTEXTSASL is not available.Make sure broker endpoint configured in Step#2 is PLAINTEXT PLAINTEXT://localhost:6667,PLAINTEXTSASL://localhost:6668 4. Create Kafka topic and provide ACLs to ANONYMOUS user. Below command help in the same. ./bin/kafka-topics.sh --zookeeper zookeeper-node:2181 --create --topic log-streaming --partitions 1 --replication-factor 1 ./bin/kafka-acls.sh --authorizer kafka.security.auth.SimpleAclAuthorizer --authorizer-properties zookeeper.connect=zookeeper-node:2181 --add --allow-principal User:ANONYMOUS --operation Read --operation Write --operation Describe --topic log-streaming 5. Restart LogSearch service from Ambari and thats it ! logs should be pushing by now. Below is the command to check the same. /bin/kafka-console-consumer.sh --zookeeper zookeeper-node:2181 --topic log-streaming --from-beginning --security-protocol PLAINTEXT

cstanca · ‎08-25-2016

@suresh krish Answer from Santhosh B Gowda could be helpful, but that is brute force with 50-50% chance of luck. You need to understand query execution plan, how much data is processed, how many tasks execute the job. Each task has a container allocated. You could increase the RAM allocated for the container but if you have a single task performing the map and data is more than the container allocated memory you are still seeing "Out of memory". What you have to do is to understand how much data is processed and how to chunk it for parallelism. Increasing the size of the container is not always needed. It is almost like saying that instead of tuning a bad SQL, let's throw more hardware at it. It is better to have reasonable size containers and have enough of them to process your query data. For example, let's take a cross-join of a two tables that are small, 1,000,000 records each. The cartesian product will be 1,000,000 x 1,000,000 = 1,000,000,000,000. That is a big size input for a mapper. You need to translate that in GB to understand how much memory is needed. For example, assuming that the memory requirements are 10 GB and tez.grouping.max-size is set to the default 1 GB, 10 mappers will be needed. Those will use 10 containers. Now assume that each container is set to 6 GB each. You will be wasting 60 GB for 10 GB need. In that specific case, it would be actually better to have 1 GB container. Now, if your data is 10 GB and you have only one 6 GB container, that will generate "Out of memory". If the execution plan of the query has one mapper that means one container is allocated and if that is not big enough, you will get your out of memory error. However, if you reduce tez.grouping.max-size to a lower value that will force the execution plan to have multiple mappers, you will have one container for each and those tasks will work in parallel reducing the time and meeting data requirements. You can override the global tez.grouping.max-size for your specific query. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html describes Tez parameters and some of them could help, however, for your case you could give tez.grouping.max-size a shot. Summary: - Understanding data volume that needs to be processed - EXPLAIN SqlStatement to understand the execution plan - tasks and containers - use ResouceManager UI to see how many containers are used and cluster resources used for this query; Tez View can also give you a good understanding of Mapper and Reducer tasks involved. The more of them the more resources are used, but the response time is better. Balance that to use reasonably resources for a reasonable response time. - setting tez.grouping.max-size to a value that makes sense for your query; by default is set to 1 GB. That is a global value.

shiremath · ‎08-17-2016

Increasing the 'tickTime' value of zk helps to reduce ConnectionLoss due to delay/missing of heartbeats, basically it increases the session timeout. the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.

sgowda · ‎08-11-2016

Kafka from 0.9 onwards started support SASL_PLAINTEXT ( authentication and non-encrypted) for communication b/w brokers and consumer/produce r with broker. To know more about SASL, please refer to this link. Maven Dependency Add below maven dependency to your pom.xml <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.10.0.0</version> </dependency> 2. Kerberos Setup Configure JAAS configuration file with contents as below KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true principal="user@EXAMPLE.COM" useKeyTab=true serviceName="kafka" keyTab="/etc/security/keytabs/user.headless.keytab"; }; Above configuration is set to use key tab and ticket cache. The Kakfa Client Producer use above info to get TGT and authenticates with Kafka broker. Note: a] Make sure the /etc/krb5.conf has realms mapping for "EXAMPLE.COM" and also the default_realm is set to "EXAMPLE.COM" under [libdefaults] section. Please refer this link for more information. b] Run below command and make sure it is successful kinit -kt /etc/security/keytabs/user.headless.keytab user@EXAMPLE.COM 3. Initialization Kafka Producer The Kafka Producer Client needs certain information to initialize itself. This can be provided either as a property file input or as a HashMap as below. Properties properties = new Properties(); properties.put("bootstrap.servers","comma-seperated-list-of-brokers"); properties.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");// key serializer properties.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer"); //value serializer properties.put("acks","1"); //message durability -- 1 mean ack after writing to leader is success. value of "all" means ack after replication. properties.put("security.protocol","SASL_PLAINTEXT"); // Security protocol to use for communication. properties.put("batch.size","16384");// maximum size of message KafkaProducer<String,String> producer = new KafkaProducer<String, String>(properties); 3. Push Message producer.send(new ProducerRecord<String, String>(", "key", "value"), new Callback() { public void onCompletion(RecordMetadata metadata, Exception e) { if (e != null) { LOG.error("Send failed for record: {}", metadata); } else { LOG.info("Message delivered to topic {} and partition {}. Message offset is {}",metadata.topic(),metadata.partition(),metadata.offset()); } } }); } producer.close(); Above code pushes message to kafka broker and on completion( acked ) the method "onCompletion" is invoked. 4. Run While running this code add below VM params -Djava.security.auth.login.config=<PATH_TO_JAAS_FILE_CREATED_IN_STEP2> -Djava.security.krb5.conf=/etc/krb5.conf

yogesh_kakodkar · ‎11-05-2016

Can this be done when the authorizer class being used is RangerKafkaAuthorizer and not SimpleAclAuthorizer?

sgowda · ‎08-12-2016

Found the hive shell logs at /tmp/{USER}/hive.log and figured out that the hive shell was OOMing. Increased the mem size and have restarted for now.

ArpitAgarwal · ‎08-04-2016

@sgowda, thanks for confirming you just want to mount volumes at a new location. If you are just remounting then your existing HDFS metadata and data files will be present but under new Linux paths. In that case decommissioning is not necessary. You just need to to update NameNode and DataNode configuration settings like dfs.namenode.name.dir, dfs.datanode.data.dir to point to the new locations. See this link for a full list of settings, not all may apply to you. Don't reformat the NN else you will lose all your data. The simplest approach is: Take a full cluster downtime and bring down all HDFS services. Remount volumes at the new location on all affected nodes. Update NN and DN configurations via Ambari to point to the new storage roots. Restart services. If you are not familiar with these settings I recommend learning more about HDFS first since its easy to lose data via administrative mistakes.

sgowda · ‎08-04-2016

So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down. Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.

Online	Offline
Last Visited	‎10-08-2018 05:03 AM

Member Since	‎05-20-2016 05:56 PM
Last Visited	‎10-08-2018 05:03 AM
Posts	155
Kudos received	220

Cloudera Community

Re: Enabling JMX for Nifi

Re: Can Oozie invoke and RestFul service for notif...

Re: Ambari installation failed through ambari-blu...

Re: Hbase: Failed to become active master

Re: what's the spark history server 's name?

Re: phoenix csv bulk upload error

Re: max size for oozie workflow definition exceede...

hadoop logs push to Kafka topic using Ambari LogSe...

Re: HIVE TEZ Java Heap size

Re: hiveserver 2 zookeeper discovery connection ti...

Simple Kafka Producer using Java in a Kerberozied ...

Re: Multiple listeners of Kafka in Kerberozied Clu...

Re: tpcds data loading to hive table failure

Re: how do I change namenode and datanode dir for ...

Re: Name Node going down due to QJM timeout