Member since
05-20-2016
155
Posts
220
Kudos Received
30
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7246 | 03-23-2018 04:54 AM | |
| 2656 | 10-05-2017 02:34 PM | |
| 1480 | 10-03-2017 02:02 PM | |
| 8408 | 08-23-2017 06:33 AM | |
| 3239 | 07-27-2017 10:20 AM |
11-03-2016
06:07 PM
4 Kudos
@Timothy Spann Thanks for your reply. Running the bulk upload as below, resolved the problem. This is a workaround for a bug as per this link HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix-<version>-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table EXAMPLE --input /data/example.csv
... View more
09-26-2016
09:22 AM
6 Kudos
@Santhosh B Gowda To Increase the size of the workflow job definition size. Please add the following property to oozie-site.xml: oozie.service.WorkflowAppService.WorkflowDefinitionMaxLength=<The maximum length of the workflow definition in bytes> For Example: oozie.service.WorkflowAppService.WorkflowDefinitionMaxLength=1000000
... View more
09-21-2016
02:37 PM
4 Kudos
Ambari 2.4.0.0 officially supports LogSearch Component [ Tech Preview] . To learn more about Log Search component please refer to link Ambari LogSearch . While LogSearch component does support pushing the logs to Kafka topic [ based on which real time log analytics can be performed ] this is official not supported in Ambari 2.4.0.0. This might get addressed in Ambari 2.5.0.0 probably. This article provides the details on how to configure LogSearch [ LogFeeder component ] to push to Kafka topic if there is a need to capture and perform real time analytics based on logs in your cluster. 1.After installing LogSearch component from Ambari 2.4.0.0, go to the LogSearch component config screen and add below property under "Advanced logfeeder-properties" to property "logfeeder.config.files" {default_config_files},kafka-output.json 2. Create the kafka-output.json file with below content under directory /etc/ambari-logsearch-logfeeder/conf/ on the nodes which has logfeeder [ ideally all the nodes in your cluster ] {
"output": [
{
"is_enabled": "true",
"destination": "kafka",
"broker_list": "ctr-e25-1471039652053-0001-01-000006.test.domain:6667",
"topic": "log-streaming",
"conditions": {
"fields": {
"rowtype": [
"service"
]
}
}
}
]
}
3. Configure Kafka PLAINTEXT listener as below if the cluster is Kerberozied because workaround to push to PLAINTEXTSASL is not available.Make sure broker endpoint configured in Step#2 is PLAINTEXT PLAINTEXT://localhost:6667,PLAINTEXTSASL://localhost:6668 4. Create Kafka topic and provide ACLs to ANONYMOUS user. Below command help in the same. ./bin/kafka-topics.sh --zookeeper zookeeper-node:2181 --create --topic log-streaming --partitions 1 --replication-factor 1
./bin/kafka-acls.sh --authorizer kafka.security.auth.SimpleAclAuthorizer --authorizer-properties zookeeper.connect=zookeeper-node:2181 --add --allow-principal User:ANONYMOUS --operation Read --operation Write --operation Describe --topic log-streaming
5. Restart LogSearch service from Ambari and thats it ! logs should be pushing by now. Below is the command to check the same. /bin/kafka-console-consumer.sh --zookeeper zookeeper-node:2181 --topic log-streaming --from-beginning --security-protocol PLAINTEXT
... View more
Labels:
08-25-2016
12:24 AM
5 Kudos
@suresh krish Answer from Santhosh B Gowda could be helpful, but that is brute force with 50-50% chance of luck. You need to understand query execution plan, how much data is processed, how many tasks execute the job. Each task has a container allocated. You could increase the RAM allocated for the container but if you have a single task performing the map and data is more than the container allocated memory you are still seeing "Out of memory". What you have to do is to understand how much data is processed and how to chunk it for parallelism. Increasing the size of the container is not always needed. It is almost like saying that instead of tuning a bad SQL, let's throw more hardware at it. It is better to have reasonable size containers and have enough of them to process your query data. For example, let's take a cross-join of a two tables that are small, 1,000,000 records each. The cartesian product will be 1,000,000 x 1,000,000 = 1,000,000,000,000. That is a big size input for a mapper. You need to translate that in GB to understand how much memory is needed. For example, assuming that the memory requirements are 10 GB and tez.grouping.max-size is set to the default 1 GB, 10 mappers will be needed. Those will use 10 containers. Now assume that each container is set to 6 GB each. You will be wasting 60 GB for 10 GB need. In that specific case, it would be actually better to have 1 GB container. Now, if your data is 10 GB and you have only one 6 GB container, that will generate "Out of memory". If the execution plan of the query has one mapper that means one container is allocated and if that is not big enough, you will get your out of memory error. However, if you reduce tez.grouping.max-size to a lower value that will force the execution plan to have multiple mappers, you will have one container for each and those tasks will work in parallel reducing the time and meeting data requirements. You can override the global tez.grouping.max-size for your specific query. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html describes Tez parameters and some of them could help, however, for your case you could give tez.grouping.max-size a shot. Summary: - Understanding data volume that needs to be processed - EXPLAIN SqlStatement to understand the execution plan - tasks and containers - use ResouceManager UI to see how many containers are used and cluster resources used for this query; Tez View can also give you a good understanding of Mapper and Reducer tasks involved. The more of them the more resources are used, but the response time is better. Balance that to use reasonably resources for a reasonable response time. - setting tez.grouping.max-size to a value that makes sense for your query; by default is set to 1 GB. That is a global value.
... View more
08-17-2016
01:38 PM
2 Kudos
Increasing the 'tickTime' value of zk helps to reduce ConnectionLoss due to delay/missing of heartbeats, basically it increases the session timeout. the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.
... View more
08-11-2016
10:05 PM
4 Kudos
Kafka from 0.9 onwards started support SASL_PLAINTEXT ( authentication and non-encrypted) for communication b/w brokers and consumer/produce r with broker. To know more about SASL, please refer to this link. Maven Dependency Add below maven dependency to your pom.xml
<dependency><br> <groupId>org.apache.kafka</groupId><br> <artifactId>kafka-clients</artifactId><br> <version>0.10.0.0</version><br></dependency>
2. Kerberos Setup Configure JAAS configuration file with contents as below KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
principal="user@EXAMPLE.COM"
useKeyTab=true
serviceName="kafka"
keyTab="/etc/security/keytabs/user.headless.keytab";
};
Above configuration is set to use key tab and ticket cache. The Kakfa Client Producer use above info to get TGT and authenticates with Kafka broker. Note: a] Make sure the /etc/krb5.conf has realms mapping for "EXAMPLE.COM" and also the default_realm is set to "EXAMPLE.COM" under [libdefaults] section. Please refer this link for more information. b] Run below command and make sure it is successful kinit -kt /etc/security/keytabs/user.headless.keytab user@EXAMPLE.COM 3. Initialization Kafka Producer The Kafka Producer Client needs certain information to initialize itself. This can be provided either as a property file input or as a HashMap as below.
Properties properties = new Properties();
properties.put("bootstrap.servers","comma-seperated-list-of-brokers");
properties.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");// key serializer
properties.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer"); //value serializer
properties.put("acks","1"); //message durability -- 1 mean ack after writing to leader is success. value of "all" means ack after replication.
properties.put("security.protocol","SASL_PLAINTEXT"); // Security protocol to use for communication.
properties.put("batch.size","16384");// maximum size of message
KafkaProducer<String,String> producer = new KafkaProducer<String, String>(properties);<br>
3. Push Message producer.send(new ProducerRecord<String, String>(", "key", "value"), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
LOG.error("Send failed for record: {}", metadata);
}
else {
LOG.info("Message delivered to topic {} and partition {}. Message offset is {}",metadata.topic(),metadata.partition(),metadata.offset());
}
}
});
}
producer.close(); Above code pushes message to kafka broker and on completion( acked ) the method "onCompletion" is invoked. 4. Run While running this code add below VM params -Djava.security.auth.login.config=<PATH_TO_JAAS_FILE_CREATED_IN_STEP2> -Djava.security.krb5.conf=/etc/krb5.conf
... View more
Labels:
11-05-2016
01:16 PM
Can this be done when the authorizer class being used is RangerKafkaAuthorizer and not SimpleAclAuthorizer?
... View more
08-12-2016
11:03 AM
1 Kudo
Found the hive shell logs at /tmp/{USER}/hive.log and figured out that the hive shell was OOMing. Increased the mem size and have restarted for now.
... View more
08-04-2016
09:33 PM
1 Kudo
@sgowda, thanks for confirming you just want to mount volumes at a new location. If you are just remounting then your existing HDFS metadata and data files will be present but under new Linux paths. In that case decommissioning is not necessary. You just need to to update NameNode and DataNode configuration settings like dfs.namenode.name.dir, dfs.datanode.data.dir to point to the new locations. See this link for a full list of settings, not all may apply to you. Don't reformat the NN else you will lose all your data. The simplest approach is:
Take a full cluster downtime and bring down all HDFS services. Remount volumes at the new location on all affected nodes. Update NN and DN configurations via Ambari to point to the new storage roots. Restart services. If you are not familiar with these settings I recommend learning more about HDFS first since its easy to lose data via administrative mistakes.
... View more
08-04-2016
06:58 AM
1 Kudo
So this is what I did , since the datanode and zookeeper was writing to the same disk, the zookeeper writes was slowing down, due to which all the services dependent on zookeeper was going down. Soln: Brought down the datanode's on the zookeeper machines and started the job -- This has solved the problem for now.
... View more
- « Previous
- Next »