About Shu_ashu

Shu_ashu · ‎10-01-2017

Yair Ogen, Can you do desc formatted csvdemo; You can get the location of the table where it is stored in hdfs and change permissions to that parent directory. Remove recursive -r and try to run the below command. hadoop fs -chmod 777 <table location directory>

Shu_ashu · ‎10-01-2017

Hi @Yair Ogen, Right now csvdemo table is a parquet table and expects the data that is loading to csvdemo table should be in parquet format data but you are loading csv data to parquet table. 1.create table a normal text table. CREATE TABLE csvdemo (id Int, name String, email String) row format delimited fields terminated by ',' STORED AS TEXTFILE; 2.load the data into text table load data inpath '/user/admin/MOCK_DATA.csv'into table csvdemo; 3.then create an another table as parquet format. CREATE TABLE csvdemo_prq (id Int, name String, email String) stored as parquet; 4.insert into parquet table select from text table insert into csvdemo_prq select * from csvdemo; then your csvdemo_prq table is in parquet format, hive doesn't convert csv data directly into parquet.

Shu_ashu · ‎09-30-2017

Hi @Simon Jespersen, you can use extract text processor and get status attribute to ff and then convert csvtoavro then use routeonattribute processor to split the data into 2 routes. GetFile----->SplitText----->ExtractText---->InferAvroSchema----->ConvertCSVToAvro----->RouteOnAttribute---->(status=6),(status=7) SplitText Processor:- we need to extract status value as attribute for this purpose we need to split our file into each record in to seperate flowfile. so that the input to ExtractText processor would be one record as ff. Connect the splits relation to ExtactText processor. Input:- id=10,age=10,status=6,salary=90000 id=11,age=11,status=7,salary=100000 id=12,age=12,status=8,salary=110000 output:- We mentioned in processor configs LineSplitCount as 1 output of splittext splits the file into individual records as one record per flowfile. ff1:- id=10,age=10,status=6,salary=90000 ff2:- id=11,age=11,status=7,salary=100000 ff3:- id=12,age=12,status=8,salary=110000 ExtractText Processor:- Evaluates one or more Regular Expressions against the content of a FlowFile. The results of those Regular Expressions are assigned to FlowFile Attributes. We need to extract status value from ff content as attribute of ff by using Regex. Config:- Regex Property add a new property to extract status value as attribute to the flowfile status=(\d*) In this processor we have extracted the status value as attribute to every flowfile. Then use InferAvroSchema processor and ConvertAvrotoJSON and then use RouteOnAttribute to split the flowfiles by adding below properties. status=6 ${status:equals("6")} status=7 ${status:equals("7")} Then make use of these two properties to connect to another processors Flow:- Once make sure in your flow that you have connected only the exact same relations that are in screenshot to the next processors. Hope this will helps ...!!!

Shu_ashu · ‎09-29-2017

@manisha jain, In process group there are ExecuteProcess and Output port once you will push the data out from output port. We need to get this data that got pushed out from output port by using 1.Processor 2.Another Processor Group 3.Remote Processor Group Output port not having capability to transmit data outside of process group, we need to use Output port only inside of Process Group to push data out of PG. Here i have used output port inside Process Group1 and connected Output port to 1. UpdateAttribute Processor you can use any other processors that will allows incoming connections in place of Update Attribute Processor. 2. Another Processor Group In this group you need to have input port that gets data from Output port in PG1. Processors in PG2 :- We need to use Input port inside PG2 to get the data from Output port. 3. RemoteProcessorGroup This is the only ways that you can get the data that got pushed from output port, if the output (or) input ports are not inside the PG then it will not transmit or receive any data.

Shu_ashu · ‎09-29-2017

@manisha jain, Output port only transfers the data from a process group to Outside processors(or) processor groups. Can you add an Processor Group and use these two processors inside Processor Group then you can enable transmission from Output port. Flow:- Inside the processor group keep all your processors.

Shu_ashu · ‎09-29-2017

Yacine Belhoul, As per Apache Hadoop model we cannot create directories if the filename having / or : characters in it. HDFS Path elements MUST NOT contain the characters {'/', ':'} The only way is to replace (:) with %3A hadoop fs -mkdir /2017-09-28\ \12%3A00%3A09.0 If you do dynamic partition by time stamp field also hive stored these colon(:) replace with %3A hadoop directories. My dynamic partition time stamp columns are:- 2011-07-07 02:04:51.0 2011-07-07 02:04:52.0 2013-01-30 08:27:16.0 once i'm done with creating dynamic partitions for the table, if i list out the directories they are replace with %3A in place of (:) Hadoop directories for dynamic partitions: /apps/hive/warehouse/test_fac/dat=2011-07-07 02%3A04%3A51.0 /apps/hive/warehouse/test_fac/dat=2011-07-07 02%3A04%3A52.0 /apps/hive/warehouse/test_fac/dat=2013-01-30 08%3A27%3A16.0 Show partitions for dynamic partitioned table:- if you list the partitions that are in table hive shows those partitions with %3A as a replacement for colon(:) show partitions test_fac; +--------------------------------+--+ | partition | +--------------------------------+--+ | dat=2011-07-07 02%3A04%3A51.0 | | dat=2011-07-07 02%3A04%3A52.0 | | dat=2013-01-30 08%3A27%3A16.0 | +--------------------------------+--+ I tried to add paritition to the table alter table test_fac add partition(dat='2017-09-29 90:00:00'); still it replace colon(:) with %3A. show partitions test_fac; +--------------------------------+--+ | partition | +--------------------------------+--+ | dat=2017-09-29 90%3A00%3A00 | +--------------------------------+--+ but in local file system we can create directories with colon(:) characters in them Example:- [~]$ mkdir 2017-09-28\ 12\:00\:09 [~]$ ls [~]$ 2017-09-28 12:00:09

Shu_ashu · ‎09-28-2017

Hi @Sanaz Janbakhsh, for sure this behaviour is because of your Destination property in AttributestoJson processor default will be flowfile-attribute but we need to change that property to flowfile-content Attributestojson configs Screnshot:- change the below highlighted property you are using flowfile-attribute that means this processor keeps all the list of attributes to ff attributes but we need all the attributes needs to be in flowfile-content. Hope this Helps....!!

Shu_ashu · ‎09-28-2017

Adrian Oprea, Flow:-

Shu_ashu · ‎09-28-2017

Adrian Oprea, Hi i have the same input json as you in generate flow file processor to test the entire flow, { "ARTGEntryJsonResult": { "AnnualChargeExemptWaverDate": "null", "Conditions": [ "" ], "ConsumerInformation": { "DocumentLink": "" }, "EntryType": "Medicine", "LicenceClass": "", "LicenceId": "152567" }, "Products": [ { "AdditionalInformation": [], "Components": [ { "DosageForm": "Drug delivery system, transdermal", "RouteOfAdministration": "Transdermal", "VisualIdentification": "Dull, homogenous" } ], "Containers": [ { "Closure": "", "Conditions": [ "Store at room temperature" ], "LifeTime": "2 Years", "Material": null, "Temperature": "Store below 25 degrees Celsius", "Type": "Sachet" } ], "EffectiveDate": "2017-09-18", "GMDNCode": "", "GMDNTerm": "", "Ingredients": [ { "Name": "Fentanyl", "Strength": "6.3000 mg" } ], "Name": "FENTANYL SANDOZ ", "Packs": [ { "PoisonSchedule": "(S8) Controlled Drug", "Size": "1" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "10" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "2" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "3" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "4" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "5" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "7" }, { "PoisonSchedule": "(S8) Controlled Drug", "Size": "8" } ], "SpecificIndications": [ "Management of chronic pain requiring opioid analgesia." ], "StandardIndications": [], "Type": "Single Medicine Product", "Warnings": [] } ] } JoltTransformation:- i use this processor because we need to extract the name attribute from products array [ { "operation": "shift", "spec": { "*": "&", "Products": { "*": "Products" } } } ] after this processor we get output like below:- { "ARTGEntryJsonResult" : { "AnnualChargeExemptWaverDate" : "null", "Conditions" : [ "" ], "ConsumerInformation" : { "DocumentLink" : "" }, "EntryType" : "Medicine", "LicenceClass" : "", "LicenceId" : "152567" }, "Products" : { "AdditionalInformation" : [ ], "Components" : [ { "DosageForm" : "Drug delivery system, transdermal", "RouteOfAdministration" : "Transdermal", "VisualIdentification" : "Dull, homogenous" } ], "Containers" : [ { "Closure" : "", "Conditions" : [ "Store at room temperature" ], "LifeTime" : "2 Years", "Material" : null, "Temperature" : "Store below 25 degrees Celsius", "Type" : "Sachet" } ], "EffectiveDate" : "2017-09-18", "GMDNCode" : "", "GMDNTerm" : "", "Ingredients" : [ { "Name" : "Fentanyl", "Strength" : "6.3000 mg" } ], "Name" : "FENTANYL SANDOZ ", "Packs" : [ { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "1" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "10" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "2" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "3" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "4" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "5" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "7" }, { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "8" } ], "SpecificIndications" : [ "Management of chronic pain requiring opioid analgesia." ], "StandardIndications" : [ ], "Type" : "Single Medicine Product", "Warnings" : [ ] } } without product as an array, now its easy to get Name attribute from the json message. Configs for jolt:- Evaluate Json path expression:- to extract licenseid and name attributes from the content i added the below properties as licenceid as $.ARTGEntryJsonResult.LicenceId name as $.Products.Name EJ configs:- Split Packs Array using splitjson processor:- change JsonPath Expression property to $.Products.Packs once you split the packs array then the every message in array will be one seperate flow file, as in my input we are having 8 messages in packs array so we get 8 flowfiles having licenceid,name attributes associated with each flowfile. Splijson Config:- Extract PoisonSchedule,Size using evaluatejson path:- now we need to extract all the contents of flowfile as attributes as we are having { "PoisonSchedule" : "(S8) Controlled Drug", "Size" : "1" } so we need to add 2 properties in processor and change Destination property to flowfile-attribute PoisonSchedule as $.PoisonSchedule Size as $.Size Configs:- Right now we are having all the desired contents of message as attributes we can use replace text processor(if you need output as text) (or) attributestojson processor(if you want json message) ReplaceText Processor:- we are having all the list of attributes ${licenseid},${name},${PoisonSchedule},${Size} here i kept , as seperator you can mention what ever you like, so keep them in Replacement Value property and change Replacement strategy to AlwaysReplace. Output:- flowfile1:- 152567,FENTANYL SANDOZ ,(S8) Controlled Drug,4 flowfile2:- 152567,FENTANYL SANDOZ ,(S8) Controlled Drug,3 Configs Replacetext:- AttributesToJSON:- if you want to convert the results as json documents then use this processor and in attributes list keep property as licenseid,name,PoisonSchedule,Size it will converts the attributes as json message Output:- {"Size":"10","PoisonSchedule":"(S8) Controlled Drug","name":"FENTANYL SANDOZ ","licenseid":"152567"} if you want to merge these flowfiles together,use Mergecontent Processor and change the properties as for your requirements. Flow Screenshot:- is attached in comments Hope this Helps...!!

Shu_ashu · ‎09-27-2017

@Foivos A, New_Flow:- in this screenshot i have used GetHDFS processor but you can replace this with ListHDFS--> FetchHDFS --> PutFTP(files without parsing)--> UpdateAttribute(files that got parsed) --> PutFtp You have got all the listed files in the HDFS directory then use PutFTP processor and keep the Success relation of first PutFTP processor to next UpdateAttribute processor Here we i have given failure and rejected to loop it back to the same PutFTP processor, so the files that got stored into remote location only we are processing from First PutFTP processor. All the files that got rejected or failed to store into remote location are routing back to retry one more time with the same processor. Only the files that got successfully stored into remote location will be routed to Update Attribute processor, you can parse and change the names of the files before storing to RemoteLocation(in second PutFTP) processor. In this method we only process the files that got successfully stored into Remote Location by using First PutFTP processor(as we are using only success from FirstPutFTP to updateattribute).For all the other relations like rejected or failed you can auto terminate or you can retry them to store. Note:- we have looped failure and retry to same processor if some files are in this relations they will retry to same processor which will keep more load on the cluster.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Error loading csv file using hive

Re: Error loading csv file using hive

Re: NIFIroute csv flowfile based on contents

Re: NiFi - Output Port : Not able to enable the tr...

Re: NiFi - Output Port : Not able to enable the tr...

Re: create hdfs folder having ':' name,create hdf...

Re: Convert Json to sql format

Re: Extract Data from JSON Array and Merge with Pa...

Re: Extract Data from JSON Array and Merge with Pa...

Re: Create files from GetHDFS processor flowfiles