About sunile_manjee

sunile_manjee · ‎11-11-2016

@Timothy Spann counts on ORC tables should be fast as it can use the strip footer info and run much faster. Have you run stats on the table?

sunile_manjee · ‎11-11-2016

I would say as the latest release of HDP, I see very little to any reason to use MR over Tez. I would say default to tez and use MR if and when required (not may use cases).

sunile_manjee · ‎11-10-2016

One way you can do it easily is by using hive-serde-schema-generator (https://github.com/strelec/hive-serde-schema-gen). Another way is to use hive json serde (https://github.com/rcongiu/Hive-JSON-Serde_ The formatted json is below: { "repoType":1, "repo":"abc_hadoop", "reqUser":"ams", "evtTime":"2016-09-19 13:14:40.197", "access":"READ", "resource":"/ambari-metrics-collector/hbase/data/hbase/meta/1588230740/info/ed3e52d8b86e4800801539fc4a7b1318", "resType":"path", "result":1, "policy":41, "reason":"/ambari-metrics-collector/hbase/data/hbase/meta/1588230740/info/ed3e52d8b86e4800801539fc4a7b1318", "enforcer":"ranger-acl", "cliIP":"123.129.390.140", "agentHost":"hostname.sample.com", "logType":"RangerAudit", "id":"94143368-600c-44b9-a0c8-d906b4367537", "seq_num":1240883, "event_count":1, "event_dur_ms":0 } since the json is not nested, it seems the above choices are most definitely doable. However maybe the most easiest way to do it is using this (https://community.hortonworks.com/articles/37937/importing-and-querying-json-data-in-hive.html) option

sunile_manjee · ‎11-10-2016

As a root user you should be able to see same files. You might be in a differerent directory when you use the web shell (4200) vs when you ssh into linux box. run the pwd command and verify you are in the same location when you issue ls

sunile_manjee · ‎11-10-2016

I found the issue. 9092 was not my port. I went to ambari and found the listening port was set to 6667

sunile_manjee · ‎11-09-2016

On hdp 2.5 I am running simple test to create message on a kafka topic test 1 and it fails. I have 1 broker and running this on broker node. [kafka@sunman0 bin]$ ./kafka-console-producer.sh --broker-list localhost:9092 --topic test1 jump [2016-11-09 21:21:45,184] ERROR Error when sending message to topic test1 with key: null, value: 4 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms. Any ideas?

sunile_manjee · ‎11-09-2016

is it possible to export linage from atlas via kafka? I don't see that possible using the topics Atlas creates. However worth a ask on HCC.

sunile_manjee · ‎11-09-2016

Does HDP officially support multipule Kafka brokers on single node? If that is the case, can someone point me in the direction how to set this up correctly for it to be supported?

sunile_manjee · ‎11-08-2016

ah my bad I found the answer here https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html it is not the same. UI runs on all nodes. Primary Node: Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below). ZooKeeper is used to automatically elect a Primary Node. If that node disconnects from the cluster for any reason, a new Primary Node will automatically be elected. Users can determine which node is currently elected as the Primary Node by looking at the Cluster Management page of the User Interface. Isolated Processors: In a NiFi cluster, the same dataflow runs on all the nodes. As a result, every component in the flow runs on every node. However, there may be cases when the DFM would not want every processor to run on every node. The most common case is when using a processor that communicates with an external service using a protocol that does not scale well. For example, the GetSFTP processor pulls from a remote directory, and if the GetSFTP Processor runs on every node in the cluster tries simultaneously to pull from the same remote directory, there could be race conditions. Therefore, the DFM could configure the GetSFTP on the Primary Node to run in isolation, meaning that it only runs on that node. It could pull in data and - with the proper dataflow configuration - load-balance it across the rest of the nodes in the cluster. Note that while this feature exists, it is also very common to simply use a standalone NiFi instance to pull data and feed it to the cluster. It just depends on the resources available and how the Administrator decides to configure the cluster.

sunile_manjee · ‎11-08-2016

I found the information here https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html NiFi Cluster Coordinator: A NiFi Cluster Cluster Coordinator is the node in a NiFI cluster that is responsible for carrying out tasks to manage which nodes are allowed in the cluster and providing the most up-to-date flow to newly joining nodes. When a DataFlow Manager manages a dataflow in a cluster, they are able to do so through the User Interface of any node in the cluster. Any change made is then replicated to all nodes in the cluster.

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Re: How do you speed up count(*) on tables in Hive

Re: When Tez/MR is better for query execution?

Re: create hive table

Re: Difference between root in sandbox and HDP?

Re: Kafka fails to receive message via producer

Kafka fails to receive message via producer

Atlas export linage via Kafka?

Does HDP support multipule Kafka brokers on single...

Re: Is NiFi Primary Node and Dataflow manager the ...

Re: Role of NiFi Cluster Coordinator