About hduraiswamy

hduraiswamy · ‎09-21-2016

Thanks @ajaysingh

hduraiswamy · ‎09-21-2016

I know Syncsort is a possible solution here, but wanted to check if we HDF can do the job and if we have any other recommendation other than Syncsort ??

hduraiswamy · ‎09-01-2016

Just an update - the 'SelectHiveQL' has been added as part of Nifi 0.7

hduraiswamy · ‎08-30-2016

Before I answer the question specifically, let me address (based on my research) the fault tolerance for each of the components within Storm: 1) Nimbus - A stateless daemon which sits on master node which deploys the job (Topology) and keeps track of it. Now there are two scenarios. First, If Nimbus goes down after you submit the topology it will not have any adverse affect on current topology as it is running on worker nodes (and not on the master node), Now if you have kept this process under supervision the process will restart and when your nimbus comes back as it is fail-fast it will retrieve all the meta information of all the active topologies on the cluster from Zookeeper and start tracking them again. Second, If Nimbus goes down before you submit the topology you will simply have to restart it. 2) Supervisor - This daemon is responsible to keep track of worker processes (JVM Process) on the node he sits and coordinate the state with Nimbus through Zookeeper. If this daemon goes down your worker process will not be affected and it will keep on running unless it doesn't get crashed, once it comes back (due to supervisord or monit) it will collect the state from Zookeeper and resume tracking the worker processes. If timeout occurs Nimbus will reschedule the topology on different worker node. 3) Worker Processes (JVM Processes) - These container processes actually execute your topology’s components (Spouts + Bolts), If it goes down Supervisor will simply restart them on different ports(on same worker node) and if it is out of ports it will notify Nimbus and then it will reschedule the process on different worker node. 4) Worker Node (Supervisor + Worker Process) - During this scenario Nimbus will stop receiving the heartbeats (due to timeout) from the Worker Node and then the Nimbus will simply reassign the work to different Worker Node(s) in the cluster. 5) Zookeeper (Zk) - Okay! From all the above you might have infered that all the state gets stored on Zk, Hmmh! What if it goes down or can it go down? Zk is again not a single node process, it has its own cluster and the state stored in Zk is constantly replicated, so even if a single Zk node goes down, the new node will be elected as a leader which will start communicating with Apache Storm. Now, going back to the specific question: When a supervisor with 4 slots (ports) go down The very first thing the Nimbus will try to do is to restart the processes on the SAME worker node on the available ports. And for those processes that it does not have port, it will be reassigned to a different worker node - so yes, it will increase the executor threads on these worker nodes. And from design perspective itself, you should not necessarily counter for redundant ports as Nimbus is designed to take care of this by either restarting the processes on that port or by re-distributing among other worker nodes.

hduraiswamy · ‎08-30-2016

Thanks @Rajkumar Singh

hduraiswamy · ‎08-30-2016

Say, if there were 3 extra slots, and a supervisor with 4 slots (supervisor.slots.ports) go down, what happens?? - Does storm automatically increase the number of executor threads in worker process from other supervisors?

hduraiswamy · ‎08-20-2016

@gkeys The mergecontent processor has 2 properties that I normally use to determine the output file size Minimum number of entries Minimum group size For your question as how do i increase the file size to reach a desired file size (say 1gb)? - Set the minimum group size to the size that you would like (i.e 1 gb) AND set the minimum number of entries to 1. This will merge the content to the 1 gb before it writes out to the next processor Can you clarify a little more about your other question on how do i double the size on existing setting?? - do you mean double the size of incoming file? - this will be direct. Just set the minimum number of entries to 2 and minimum group size to 0 b

hduraiswamy · ‎08-06-2016

@Iyappan Gopalakrishnan Download the nifi-0.7.0-bin.zip file from the downloads link https://nifi.apache.org/download.html After that, if you unzip the file, you will see the folder structure similar to this one below: Then based on the OS, you can either use 'bin/run-nifi.bat' for windows or 'bin/nifi.sh start' for mac/linux. More details on how to start nifi is here https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#starting-nifi You can tail the logs from logs/nifi-app.log (to see if it starts properly) OPTIONAL: By default, nifi starts on port 8080 - but if you see any port conflict or want to start this on a different port, you can change that by editing the file 'conf/nifi.properties', search for 8080 and update the port number. If you like the answer, please make sure to upvote or accept the answer.

hduraiswamy · ‎08-06-2016

@Iyappan Gopalakrishnan Follow the below steps: save your hdf flow files to xml templates download the nifi 0.7 from apache nifi downloads site (https://nifi.apache.org/download.html) unzip the file, edit the port (if you would like) and start nifi import the templates If this answer and comment is helpful, please upvote my answer and/or select as best answer. Thank you!!

hduraiswamy · ‎08-04-2016

There must be a new nifi processor 'SelectHiveQL' that queries from hive. Also, there is a processor now to insert or update data directly to hive 'PutHiveQL'

Online	Offline
Last Visited	‎06-01-2017 06:37 PM

Member Since	‎12-14-2015 01:38 AM
Last Visited	‎06-01-2017 06:37 PM
Posts	70
Kudos received	92

Cloudera Community

Re: Change Data Capture using NiFi

Re: Hive metastore and postgres authentication to ...

Re: what is the best way to store small files in h...

Re: cases where changing hadoop block size is not ...

Re: How to delete log folder faster having files l...

Re: What is recommended way of moving mainframe da...

What is recommended way of moving mainframe data i...

Re: Unable to fetch data from hive table using Apa...

Re: When designing a storm cluster, how many extra...

Re: When designing a storm cluster, how many extra...

When designing a storm cluster, how many extra sup...

Re: How do I best set MergeContent properties to c...

Re: Unable to fetch data from hive table using Apa...

Re: Unable to fetch data from hive table using Apa...

Re: Unable to fetch data from hive table using Apa...