About Shu_ashu

Shu_ashu · ‎06-01-2018

@Mustafa Ali Qizilbash You don't have to use SelectHiveQL processor. Flow: 1. GenerateFlowfile configure the processor with your insert statement insert into <db.name>.<internal.tab.name> select * from <db.name>.<external.tab.name> 2.PutHiveql Configure/enable hive controller service and this processor expects flowfile content to be the HiveQL command to execute.

Shu_ashu · ‎06-01-2018

@Yu-An Chen Instead of using GetHDFS processor use List/Fetch HDFS processors and then use MergeContent processor for merge. ListHDFS processor stores the state and check only for the new files that are created after the stored state. (or) Create Hive table on top of this HDFS directory then use SelectHiveQL processor with your query select * from <db.name>.<tab_name> order by <field-name> asc Then you don't need to use MergeContent processor you can directly Put the result of SelectHiveQL processor to PutSFTP. (or) Once the merge is completed if you want to order by some field on the flowfile content then you can use QueryRecord processor and add new dynamic property with the value like select * from flowfile order by <field-name> asc then use the relationship to connect to PutSFTP processor. https://community.hortonworks.com/articles/121794/running-sql-on-flowfiles-using-queryrecord-process.html (or) Consider EnforceOrder processor before MergeContent processor this processor enforces the order of flowfiles reaching to MergeContent processor. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.EnforceOrder/index.html - If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎06-01-2018

@Yu-An Chen Could you please add more details(like flow/config screenshots,sample input data,expected output) regarding what you are trying to achieve, so that we can understand your requirements clearly.

Shu_ashu · ‎06-01-2018

@robert cavalcante You can use Put/Fetch Distributed Map Cache processors to set some value for the variable(attribute) at the end of flow then retrieve the already set value at starting of the flow. Flow: 1.Trigger the flow with cron 2.FetchDistributedMapCache Processor //to fetch the value that got set up last execution 3.do some processing 4.PutDistributedMapCache Processor //set the new value for the variable Refer to this link to configure/Usage of Put/Fetch DistributedMapCache processors. - If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎05-31-2018

@Yu-An Chen You can use either of Merge Strategies 'Defragment' algorithm combines fragments that are associated by attributes back into a single cohesive FlowFile. The 'Bin-Packing Algorithm' generates a FlowFile populated by arbitrarily chosen FlowFiles and there are bunch of other properties that we need configure based on what you are trying to achieve. Please refer to the below HCC threads regarding MergeContent processor usage and configurations https://community.hortonworks.com/questions/64337/apache-nifi-merge-content.html https://community.hortonworks.com/questions/161827/mergeprocessor-nifi-using-the-correlation-attribut.html https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html Let us know if you are facing any issues ..!!

Shu_ashu · ‎05-29-2018

@JAy PaTel The error that you are facing is because of missing --hive-import argument. Sqoop job storing the data to /user/root/hivetable/(because you are running sqoop import as root user and the table name is hive table). if you have already created hive external table then your sqoop options file needs to be like below. import --connect jdbc:sqlserver://<HOST>:<PORT> --username XXXX --password XXXX --table <mssql_table> --hive-import --hive-table <hivedatabase.hivetable> --fields-terminated-by "," -m 1 by using this way we are going to append the data into hive table every time.

Shu_ashu · ‎05-29-2018

@Koly SALL Q:after start GetHase, we must stop this processor? A: Stopping of GetHBase processor is going to done by left hand side flow once you start the flow then in first step we are going to stop GetHBase processor. in out Left Hand side of the flow we are going to schedule the processor by using Cron (or) Timer driven First step we are going to trigger shell script and the shell script is going to stop GetHBase processor. Second step clear the state of GetHBase processor Third step start GetHBase processor Now once we start GetHBase processor and the processor will run based on the scheduling strategy. Let's assume GetHBase processor scheduled to run for every 5 mins then processor will run every five minutes and checks is there any new records got added to HBase table. What happens when we have scheduled the processor to run at 10000 min i.e ~167 days? Processor will run once we start the processor and the next run will be triggered after 10000 min. So the processor will run once in 167 days by using this scheduling we will make sure we are not going to run the processor again and again. Q:it is imperative to plan the processor at 10000 min? By using this scheduling strategy we are going to run the processor once 10000 min. In addition if you want to make sure that all the records that got pulled off from the GetHbase processor before stopping again then you need to check ActiveThreadCount value from the GetHbase processor if ActiveThreadCount is 0 then only you need to stop the processor --> clear the state --> start again the processor.

Shu_ashu · ‎05-29-2018

@Koly SALL You can use Scan Hbase processor introduced in NiFi-1.6 and this processor won't store the state. (or) By using RestApi you can clear the stored state in GetHbase processor before getting all the records from HBase table. In this way we have to stop GetHbase processor first then Clear the state of GetHbase processor then StartHBase processor again Flow: In left hand side of the flow screenshot you have to do Step1: Stop GetHBase Processor Refer to this link how to stop the processor using RestAPI and use Chrome developer tools to view the what are the api calls are making while stopping the processor. Step2: Clear the state of GetHBase processor curl -X POST http://localhost:8080/nifi-api/processors/<processor-id>/state/clear-requests (or) Use InvokeHTTP processor with HTTP method as POST to clear state requests. Step3: Start GetHBase Processor Refer to this link how to start the processor using RestAPI and use Chrome developer tools to view the what are the api calls are making while stopping the processor. Now on the right hand side of the flow schedule GetHbase processor to run once i.e use Timer driven as scheduling strategy and keep run schedule as 10000 min..(or) etc. By using this way we have to schedule only left hand side flow as in the step3 we are going to starting the GetHBase processor and the processor scheduled to run only once. Let us know if you are facing any issues..!! - If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎05-29-2018

@JAy PaTel While running hive import target-dir argument value controls where the data needs to store temporarily before loading into Hive table, but target-dir doesn't create hive table in that location. If you want to import to specific directory then use target-dir without hive-import argument and create hive table on top of HDFS directory. (or) Create Hive external table pointing to your target-dir then in sqoop import remove --create-hive-table argument and --target-dir. - your issue in comments is because of --target-dir already exists, so comment it out (or) remove the --target-dir arguments in your options-file then run sqoop import again.

Shu_ashu · ‎05-28-2018

@JAy PaTel in your file.par file keep --target-dir argument value in new line not in the same line that --target-dir I think now you are having in options file --target-dir as below --target-dir '/user/root/hivetable' Change the --target-dir argument value to newline i.e --target-dir '/user/root/hivetable' Example: Sample file.par i have tried bash$ cat file.par import --connect 'jdbc:sqlserver:/'<HOST>:<PORT>' --username XXXX --password XXXX --table <tab_name> --hive-import --hive-table default.<tab_name> --create-hive-table --target-dir '/user/root/hivetable' --fields-terminated-by ',' -m 1

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: NiFi: Pull data from Hive External table and l...

Re: Order of files MergeContent processor

Re: Order of files MergeContent processor

Re: Set a variable into context and retreive later...

Re: Order of files MergeContent processor

Re: [Solved] : sqoop-import MsSQL table into HDFS

Re: How Retrieve entire records in Hbase

Re: How Retrieve entire records in Hbase

Re: [Solved] : sqoop-import MsSQL table into HDFS

Re: [Solved] : sqoop-import MsSQL table into HDFS