About Shu_ashu

Shu_ashu · ‎10-24-2017

@PJ When you are having huge data in orc table then it will take a while to convert all the results and store them as csv file. Here is what i tried:- foo is an orc table hive#select * from foo; +---------+--+ | foo.id | +---------+--+ | 1 | | 2 | | 3 | | 4 | +---------+--+ bash#hive-e "select * from foo1">>foo1.txt bash# cat foo1.txt +----------+--+ | foo1.id | +----------+--+ | 1 | | 2 | | 3 | | 4 | +----------+--+ When we are having small set of data it will be done very quickly. if the number of records are really big then Ideal way to do this is as follows hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo; or else you can write the data to local directories also just add local hive#INSERT OVERWRITE LOCAL DIRECTORY '<Local-Dir-Path>' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SELECT * FROM default.foo; Also, note that the insert overwrite directory basically removes all the existing files under the specified folder and then create data files as part files and this may create multiple files and you may want to concatenate them on the client side after it's done exporting. Using this approach means you don't need to worry about the format of the source tables, can select your own delimiters and output formats. ** I would suggest try to avoid saving large file to local directory if possible, Use insert overwrite directory and store the results to HDFS directory ** For more details refer to this link. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

Shu_ashu · ‎10-24-2017

@Karl Fredrickson Yes ListSFTP processor only look for new files that got created after the state that processor stored. State value is max time stamp of the file created in that directory. Example:- lets assume that listsftp processor has listed all the files in the directory until 4:10 then processor scheduled to run for every 10 minutes next run is at 4:20. There are new files(test1.txt,test2.txt) got created at 4:11 then these new files(test.txt,test2.txt) will only be listed at 4:20 run(because processor runs for every 10 mins) and then processor updates the state with the 4:11 time stamp.(you can view by right clicking on the processor and click on view state). Although flow files got created at 4:11 still they will be listed only at 4:20 run, because in this run processor checks for the new files that got created after state value. If you configure this processor to less frequent i.e less than 10 minutes then processor will looks for new files that got created more often.

Shu_ashu · ‎10-24-2017

@Narasimma varman if you want only nifi-app_2017 log files then Input Directory property <directory-path> change the File Filter property to nifi-app_2017.*\.log if you want any logs that having app_2017 in the filename then use .*app_2017.*\.log Keep in mind get file processor is configured KeepsourceFile property is set to be false by default, once it pulls the files then it deletes them on the directory If you don't want to delete the files from the directory then change property KeepsourceFile to true then the processor wont delete the files once they got pulled. if you set Recursive SubDirectories property to true then make sure nifi having access to your input Directory. in addition if you want to do any tailing on the logs then follow the below links to how to configure processors to tail the logs https://community.hortonworks.com/questions/141403/in-nifi-tailing-multiple-directories-with-the-same.html?childToView=141479#comment-141479 https://community.hortonworks.com/questions/141502/how-tailfile-works-with-multiple-files.html?childToView=141517#answer-141517

Shu_ashu · ‎10-23-2017

@dhieru singh As ListFile processor will be lists all the files that is just flowfile with attributes(absolute.path,filename.etc)associated with the flowfiles, we will make use of these attributes in FetchFile Processor to do actual fetch of the data. So when you keep access to your network location for both nifi nodes then If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. for example:- Consider you have given directory access from only 1 node which is primary node now. Then primary node(1 node) has listed files in the directory until 10/23/201716:30 and then the primary node got changed to another node(2 node). Now the primary node is 2 node and our list file processor configured to run only on primary node in this case 2 node try to access those directories to list only the files that got created after 10/23/201716:30 and 2 node wont have access because we have given access to directory only for 1 node in this case the processor throws an error because 2 node(current primary node wont have access to directory). So we need to have access to directories from both nodes(1,2) and if primary node changes then new primary node will pick up where the old primary node left off and does listing the files from the directories.

Shu_ashu · ‎10-23-2017

@Arsalan Siddiqi One way to do this is extract all the content of the json file as attributes associated with the flowfile. Then once you make invoke http extract the results of invokehttp to attributes again. By following this process all the contents will be associated as attributes of the flowfile i.e the attributes associated with the flowfile includes the content of your json file, results of your invoke http processors because we are extracting every content and keeping them as attributes to the flowfile. In final use AttributesToJson processor and keep all the attributes you need in that, this processor will result a json message and store that to mongodb. Flow:- 1.EvaluateJsonProcessor //extract the content to flowfile attribute 2.invokehttp 3.EvaluateJsonProcessor//extract the result of invoke http to flowfile attribute 4.AttributesToJSON //give all the list of attributes you need. so that this processor prepares new json message. Example:- Input Json file:- { "id": 1, "name": "HCC", "age": 20 } extract all the content i.e id,name,age using Evaluatejsonpath change Destination property to flowfile-attribute processor, now all the content will be associated as attributes now. Evaljsonpath configs:- Then do invokehttp now the result flowfile from invokehttp also will have id,name,age attributes associated with it. use another evaluatejsonpath(if response from invokehttp is json) and extract the invokehttp results as attributes again. lets consder invokehttpresult will have { "dept": "community"} extract dept as attribute. Right now your flowfile will have id,name,age,dept as attributes to the flowfile. you can do as many invokehttps as you want after each invoke http extract the content as attribute. Now you need to prepare a json message use AttribiutestoJSON processor Change the property attributes list as id,name,age,dept This processor now prepares new json message with all the attributes you listed in the processor. { "id": 1, "name": "HCC", "age": 20,"dept": "community"} Use the json result from Attributestojson processor to store into mongoDB. AttributesToJSON Configs:-

Shu_ashu · ‎10-22-2017

Hi @Mohan Sure, We can get results as you expected by using EvaluateXquery //we can keep all the required contents as attributes of flowfile. UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor. ReplaceText //replace the flowfile content with attributes of flowfile PutHDFS //store files into HDFS EvaluateXquery Configurations:- Change the existing properties 1.Destination to flowfile-attribute 2.Output: Omit XML Declaration to true Add new properties by clicking + sign 1.author //author 2.book //book 3.bookstore //bookstore Input:- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="myfile.xsl" ?> <bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore> Output:- As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile. EvaluateXquery Processor configs screenshot:- Update Attribute Processor:- 1.author ${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')} updating the author attribute input to updateattribute processor:- <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> Output:- <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> 2.book ${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')} Input:- <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> Output:- style="autobiography" <price>12</price> 3.bookstore ${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')} Input:- <bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore> Output:- specialty="novel" Configs:- ReplaceText Processor:- Cchange the properties of Replacement Strategy to alwaysreplace and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content. add 2 more replacetext processors for book and author attributes. Output:- <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> PutHDFS processor:- Configure the processor and give the directory name where you want to store the data. Flow Screenshot:- For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.

Shu_ashu · ‎10-20-2017

Hi @dhieru singh When you are running distributeload processor with next available as distribution strategy that means if one of the destinations(either 1 or 2) won't accept flowfiles(i.e if they reached max queue..etc) it will transfers those files to next available destinations. If you kept distribution strategy as round robin and number of relations as 2 that means only if both of the destinations(1 and 2) are going to accept flowfiles then only the processor distributes the load if one destination of the queue is full and second destination queue is empty still the processor wont transfer flowfiles to two destination Because the strategy is round robin it transfers flowfiles only when both destinations accepts the flowfiles. So if you have configured as Next Available as strategy when one of the destinations is not accepting flowfiles then it changes the order of events coming(all the flowfiles goes to flowfiles accepting destination). else it wont change any order of events if you got 2 flowfiles then first flowfile goes to one destination and second flowfile goes to second destination or vice versa, no guarantee in order of transferring flowfile. If you have configured as round robin then the processor evenly distributes the load to both the destinations in addition to that processor checks the destinations are they accepting flowfiles are not, before sending flowfiles to them. You need to select which strategy will fit for your case now.

Shu_ashu · ‎10-20-2017

@mayki wogno, We can do alias for the delta-1 column as delta_1 this alias will help you to resolve your issue. select country,year,`delta-1` as delta_1 from <table>

Shu_ashu · ‎10-20-2017

@mayki wogno Can you use hive escape character in your select statement Hive escape character:- ` Use the above character to escape from your select statement and try to run select HiveQL again New Select HIVEQL statement:- select country,year,`delta-1` from <table> Select HiveQL Processor config Screenshot:-

Shu_ashu · ‎10-20-2017

Hi @Mathi Murugan, The simply way to retrieve particular Rowkey from hbase to HDFS is, Run the scan command using echo with hbase shell and store the results to local then copy the results to HDFS. Sample Shell script would be bash# cat hbase_scan.sh echo "scan 'test_use',{FILTER =>\"(PrefixFilter ('4'))\"}"|hbase shell>hbase.txt hadoop fs -put -f /<local-path-to>/hbase.txt /<hadoop-path>/ wait rm <local-path-to>/hbase.txt In this script we are storing the scan results to hbase.txt file in local Then copying the file to HDFS Then deleting the local file.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: convert orc table data into csv

Re: NiFi ListSFTP and timestamps

Re: how to use File filter in GetFile processor at...

Re: Nifi cluster, list files from one of the nodes...

Re: Composing Json in Nifi

Re: Split each xml attribute into separate tables...

Re: using distribute load processor to connect to ...

Re: [NIFI] [SelectHiveSQL] failed negative string ...

Re: [NIFI] [SelectHiveSQL] failed negative string ...

Re: How to retrieve a particular Rowkey from Hbase...