About RahulSoni

RahulSoni · ‎10-22-2018

@Thomas Bazzucchi Were you able to fix the issue?

RahulSoni · ‎10-22-2018

@Thomas Bazzucchi That's why ORC should be your preferred file format 🙂 Ok, jokes apart, here is a similar issue reported already! This one talks about access issue to the log files and the user was able to fix it. Please have a look and let know if that fix your issue.

RahulSoni · ‎07-02-2018

In this article, I will discuss one of the most exciting new features coming with NiFi 1.7.0 is the possibility of terminating running threads from the NiFi UI. I can think of a couple of examples in the previous NiFi versions where we either had to wait for a running thread to end before being able to make any changes to the processor configs or, in worst case scenarios, restart the NiFi cluster because of some thread is in a deadlock condition. For example An ExecuteSQL processor is stuck since the source RDBMS is not able to handle the data pull and has yielded under pressure. Or the other processes are not able to use the RDBMS since resources are hogged by this complete database scan operation. Either we wait for a literally infinite period of time or if the problem is serious, stop the cluster all together. Some custom script/processor has a deadlock situation and the thread won't stop ever. The only option we have in this scenario was to restart the machine/cluster running that process. Thanks to NiFi 1.7.0, now we have a more elegant solution to these kinds of problems, Terminate the thread from the UI itself. Follows a quick example of how we can do it. So for my flow, I created a sample flow with a GenerateFlowFile processor which is running continuously on all the possible nodes, a single one, in this case, my Mac 🙂 I have made the thread to run for longer once initiated and hence, even if I stop the processor, the thread will still keep on running. Have a look into the snapshots below. When I stopped the processor, the number of thread increase from 1 to 2, since now the thread to stop the processor is waiting for actively running thread. But with this new version of NiFi, NiFi 1.7.0, we have this option of terminating the threads explicitly from the UI itself, see snapshot #2 for the Terminate option. When the Terminate option is chosen Interrupt for the thread will be issued. A new instance of the processor will be created. The old instance will be eventually shut down. So here we are! With the new power to interrupt the threads from the NiFi UI. But please be careful! With greater powers, come greater responsibilities! I will add more information on what can be the probable issues, if any, of stopping the threads in between. Please feel free to leave comments to let know about the flow and for questions and queries. Hope that helps!

RahulSoni · ‎06-15-2018

@Sudheer Velagapudi What version of HDF/NiFi are you using? As @Shu mentioned, this is a known issue with HDF 3.1.1. The reason is the Kerberos ticket is not auto-generated once it is expired. Due to which, the connection to Hive is not established and your query doesn't run. I would recommend upgrading to HDF 3.1.2 to fix this issue, provided you are on a lower HDF version and this is the reason for your problem. Please have a look at this document for further details on HDF 3.1.2 release and the issues that it is addressing.

RahulSoni · ‎05-11-2018

In this article, I'm going to cover a simple solution to control the data processing in NiFi serially or based on an event trigger. This article use Wait and Notify processors and DistributedMapCache controller service to achieve this functionality. Follows the links to the Usage Guide of these processors and Controller Services. Wait Processor Notify Processor DistributedMapCache Server DistributedMapCache client service NiFi Flow The flow that I am going to present in this article has a very simple use case. Get the data from the source, at whatever rate, but process it one flow file at a time. Once a flow file is processed, it should trigger the processing of the next flow file. Follows a step by step explanation of what I have implemented to solve this problem. First, the pre-requisites to start the flow. We will be storing the DistributedMapCache to store and retrieve the "signals" to process the flow files and hence we would need the Server and Client Service for that. Follows a quick description. DistributedMapCacheServer We will be storing some information about the flow file which has been processed, which will help us to trigger the next flow file processing. To do so, we will be using the DistributedMapCacheServer controller service. It provides a map (key/value) cache that can be accessed over a socket. Interaction with this service is typically accomplished via a DistributedMapCacheClient service, which is discussed below. Follows a snapshot of this controller service. Nothing fancy about it and I have used the default settings for it. DistributedMapCacheClientService Now to access the DistributedMapCacheServer, hosted at port 4557 above, we need a client service. This will aid us in storing and retrieving the cache. I am keeping it simple and leaving the default settings again for simplicity. Now the NiFI flow details Follows the snapshot of the flow for a quick overview Data Generation For this use case, I am generating the data using a GenerateFlowFile processor. Flow file tagging This is an important part of the processing. Here I am using an UpdateAttribute processor. This processor assigns a token to each flow file, which is incremented by 1 every time a flow file passes through it. The important part is that we are storing the state of this token variable and hence are able to assign a unique and auto incremented value to each of our flow files. This token will help us process the data in a serial fashion. Follows a snapshot of this processor. Tagged? Now let's make them wait! Once the flow files are tagged, they are redirected to the Wait processor. This processor makes the flow files to wait and don't release them until a matching release signal is stored in Distributed Cache. Have a look at the configuration of the Wait processor. We are looking at the DistributedMapCache Server for a counter called tokenCounter and when the value of tokenCounter will equal the value of Release Signal Identifier, which is the token number of the flow file, in this case, it will release that flow file. So how does the DistributedMapCache get this tokenNumber? If you look at the NiFi flow, before Wait processor, we have the RouteonAttribute. This is just for to handle the very first flow file. It will redirect the flow file with token #1 to the Notify processor. The Notify processor picks the value from the token attribute and stores it in the DistributedMapCache for against the key tokenCounter. This will instruct the Wait processor to release the flow file with token #1 for further processing. What's next? Next, the desired processing can be done on the flow file and once done, simple one up the token attributed and feed it to the Notify processor to release the next flow file. For example, flow file with token #1, once processed, will be updated to increment the token # to 2 and then sent to Notify processor. This will trigger the release of the file with token #2 by the Wait processor and cycle will go on. So here we are! With our flow to control the processing of our data according to our need in NiFi. Please feel free to leave comments to let know about the flow and for questions and queries. Hope that helps!

RahulSoni · ‎05-06-2018

Couple of questions. Your InvokeHTTP is trying a POST? On the top right corner, you will see the hamburger menu. Look out for the cluster option and click. Please share the details. Are all your nodes up?

RahulSoni · ‎05-03-2018

@NA The Hadoop Archive will create a HAR file from the input directories mentioned by creating the HAR. It will reduce both Number of files Size of data If your use case is just reducing the file count/merging small files and not compression, I would recommend having a look at the merge option. Try using the following code snippet to merge the files. hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-<your version>.jar \ -Dmapred.reduce.tasks=<NUMBER OF FILES YOU WANT> \ -input "/hdfs/input/dir" \ -output "/hdfs/output/dir" \ -mapper cat \ -reducer cat Let know if that helps!

RahulSoni · ‎04-28-2018

@Anurag Mishra When you use the following commands to push the data into your Hive table in new cluster hdfs dfs -mv /part2 /part2_old hdfs dfs -mv /part1 /part2 Your Hive engine and metastore don't get any intimation that a new partition has been added. It's simply a data copy/move operation on HDFS and Hive don't have any idea about it. A show partitions operation on your table in new cluster won't show anything. show partitions <your table name>; //Should not return anything You can tell your Hive engine to look into the HDFS and identify if some new data has been added outside of Hive by using the following command. msck repair table <your table name> Now if you do a show partitions, you shall be able to see the partitions that you just "created" by using the HDFS commands. Also, a select operation should work fine. If this reply helps you understand and fix your issue, please mark it as Accepted so that other community users can benefit from it.

RahulSoni · ‎04-25-2018

@Thiago Charchar HBase replication might not be the best approach to synchronize the data in the initial phase of migration. I would have recommended snapshots but since you are upgrading to a higher version, that may not work as well. So follows the multi-step approach to migrate your HBase data over. Bulk HBase export to HDFS (time-in-point recovery approach). Hadoop Distcp sequence files to remote cluster where HBase tables are already created. Setup Replication and let tables be current. Choose a Date-time, plan a stagged cut-over of Applications. A replication once you have the majority of your data copied over will put way less stress on your cluster bandwidth and you shall be easily able to take care of the migration with bandwidth available for other operations. As far as the migration of "Hive structures" is concerned, do you mean the metadata or the underlying data? If you are talking about underlying data, of course distcp is the best option available. For metadata migration, there are multiple options available and metastore mapping to new cluster is one of the options. Let know if this answer helped resolving your query.

RahulSoni · ‎04-16-2018

@laiju cbabu You do not need to do anything "special or manual" for NiFi flow to run on the other machine in case of a node failure. NiFi employs a Zero-Master Clustering paradigm. Each node in the cluster performs the same tasks on the data, but each operates on a different set of data. So if a node fails, the other one has "sufficient information" to keep continuing. You can have a more in-depth understanding here .

Online	Offline
Last Visited	‎10-08-2020 11:27 AM

Member Since	‎08-03-2019 10:44 AM
Last Visited	‎10-08-2020 11:27 AM
Posts	186
Kudos received	33

Cloudera Community

Re: Hive / HBase migration - Different clusters

Re: Flowfiles are stuck in que/connection of Nifi

Re: Save dataframe with header in spark 1.6

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: Sqoop import table as parquet file then read i...

Re: Sqoop import table as parquet file then read i...

Interrupt a running thread from NiFi UI

Re: Put HiveQl NiFi processor is failing sometimes...

Trigger based/Serial Data processing in NiFi using...

Re: NIFI Flow Files stopped , and the processors ...

Re: Does Hadoop Archive both reduce the number of ...

Re: hive partition table issue while copying into ...

Re: Hive / HBase migration - Different clusters

Re: how clustering works for nifi