About Shu_ashu

Shu_ashu · ‎03-19-2018

@Kok Ching Hoo Even you don't need to use jolt transform processor to get only the entry array as the flow file content. We can achieve the same result by using split json processor in more easy way. Configure split json processor as JsonPath Expression $.*.*.*.* By using above json path expression it doesn't matter even if header value changed,array entry has been changed to entry1,exit ... etc until you are having same structure of the json message(same dependency will be applicable by using jolt also), this method will work we are going to split the array it self and then use the splits relation to connect to the next processors. If you want to do dynamically without any dependencies on the attribute names that are going to be defined in the incoming json message/object then go with this approach.

Shu_ashu · ‎03-18-2018

@Kok Ching Hoo For Method1:- in split json processor Use JsonPath Expression like $.['XMLFile_2234.DAILY'].dataset_12232.entry now we are escaping period in XMLFile_2234.DAILY. For method2:- Increase the below properties values in you extract text processor as per your flowfile size and capture group length. Maximum Buffer Size 1 MB Maximum Capture Group Length 1024

Shu_ashu · ‎03-17-2018

@Sai Krishna Makineni Option1:- In hbase as we can have lookup joins by denormalizing all the table while storing into Hbase. Let's take you are having employee table as employeeid 1, name hcc, salary 1000, departmentid 100 and we need to have department name needs to be added to the employee record. so we need to have departments table will having departmentid as rowkey let's consider we are having departmentid 100 departmentname kb, Once we got employee record in avro format then use convertAvroToJson processor, we can extract all the json values and keep as attributes and by using fetchHbase processor we can look for departmentid and extract the departmentname keep it as attribute. then by using AttributesToJson processor we can recreate the final record as follows {"employeeid" :"1", "name": "hcc", "salary" :"1000","departmentid": "100","departmentname" :"kb"} For more reference:- http://hbase.apache.org/0.94/book/joins.html Option2:- Phoenix:- You can create phoenix table will uses indexing on top of Hbase table,queries will be performed way faster compared to hive as you can compare performance between Hive-Hbase vs Phoenix-Hbase. https://phoenix.apache.org/performance.html https://stackoverflow.com/questions/29405629/apache-phoenix-vs-hive-spark https://www.quora.com/How-does-Apache-Phoenix-reduce-latency-over-Hive https://stackoverflow.com/questions/28360888/apache-phoenix-vs-hbase-native-api Option3:- Spark On Hbase:- Even we can load Hbase tables directly by not creating any phoenix (or) hive tables on top into spark by using Spark on Hbase connector. By using this connector you can load all the required tables into spark then perform all joins in spark. Please refer below link for more details https://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/

Shu_ashu · ‎03-17-2018

@Pavan M As you are not transferring any of the flowfiles to REL_FAILURE, Transfer the else flowfiles to failure relation and auto terminate the failure relation, else: session.transfer(flowFile, REL_FAILURE) Auto terminate failure relation (or) You can use session.remove to remove the flowfile else: session.remove(flowFile) by using any of the above ways you can achieve same result as you are expecting.

Shu_ashu · ‎03-17-2018

@Jandoubi Chaima By using EvaluateJsonPath processor you can extract all the values of the json keys and keep them as attributes of the flowfile. EvaluatejsonPath configs:- Keep the Destination property as flowfile-attribute and add all the json keys that you are having in the json message same as shown in the above screenshot. Change the below property: Destination flowfile-attribute Add these new properties to the processor: header_noun $.header_noun header_verb $.header_verb Once you complete configuring this evaluate json path processor then we can create insert statement in ReplaceText processor. Replace text processor Configs:- Search Value (?s)(^.*$) Replacement value insert into <db-name>.<table-name> values('${header_verb}','${header_noun}'...); Maximum Buffer Size 1 MB //change if the flowfile size is greater than 1mb Replacement Strategy Always Replace Evaluation Mode Entire text in this processor we are going to build insert statements into hive table by adding all the extracted attribute values in the evaluatejson processor. Then use PutHiveQl processor to execute all the insert statements. (or) Method2:- Easy/Best way to do is by using convert record processor, even this convert record processor accepts array of json messages/objects, As you are having json message use convert record processor with json reader and avro set writer, in this convert record processor we are converting json array of messages/objects into avro. Then by using Convert AvroToOrc processor we can convert the avro format to ORC format(as orc is optimized for tez execution engine). Use PutHDFS processor to store the data into HDFS directory and create a hive table on top of this directory(you can use hive.ddl attribute from convert AVROtoORC processor to create table). Flow:- GetHDFS-->ValidateXML--->TransformXMLToJson--->JoltTransformation--->ConvertRecord--->ConvertAvroToORC--->PutHDFS in case if you want to create hive table in your flow it self then add new processors after putHDFS processor Replacetext ---> PutHiveQL References for convertrecord processor https://community.hortonworks.com/articles/115311/convert-csv-to-json-avro-xml-using-convertrecord-p.html create hive table references https://community.hortonworks.com/articles/87632/ingesting-sql-server-tables-into-hive-via-apache-n.html Load data into hive references https://community.hortonworks.com/questions/81749/what-is-the-best-approach-to-load-data-into-hive-u.html Let us know if you are facing any issues..!!

Shu_ashu · ‎03-17-2018

@Kok Ching Hoo if you want to split the json content on entry array then you don't have to use regex at all. Method1:- Use Split Json processor with below configs:- JsonPath Expression $.XMLfile_2234.dataset_12232.entry Then use split relation from splitjson processor to connect to the next processor. Input Json content:- input flowfile content that feeding into split json processor. {"XMLfile_2234": {"xsi:schemaLocation": "http://xml.mscibarra.com/random.xsd","dataset_12232": {"entry": [{"record_date": "2017-03-01","country": "USA","funds": "100"},{"record_date": "2018-03-01","country": "Chile","funds": "10000"}]}}} Output from Splitjson processor:- flowfile1:- {"record_date":"2017-03-01","country":"USA","funds":"100"} flowfile2:- {"record_date":"2018-03-01","country":"Chile","funds":"10000"} (or) Method2:- you can use Extract text processor to extract the entry array and keep that as attribute then use ReplaceText processor to overwrite the existing content of the flowfile with new array attribute value, then use splitjson processor to split the array. Extract text configs:- Add new property to the extract text processor by clicking + sign at top right corner and then add the below property array "entry": (.*]) Now we are going to extract all the entry array message and keep that into array attribute to the flowfile. Then use Replacetext processor to replace the contents of flowfile with array message value. Replacetext configs:- Replacement Value ${array} Replacement Strategy Always Replace Change the above property values in the replace text processor, in this processor we are writing array attribute value as contents of the flowfile. now we are going to have entry array message as our flowfile content so we can use split json processor to split the array into individual messages. Input flowfile content:- {"XMLfile_2234": {"xsi:schemaLocation": "http://xml.mscibarra.com/random.xsd","dataset_12232": {"entry": [{"record_date": "2017-03-01","country": "USA","funds": "100"},{"record_date": "2018-03-01","country": "Chile","funds": "10000"}]}}} output flowfile content:- [{"record_date": "2017-03-01","country": "USA","funds": "100"},{"record_date": "2018-03-01","country": "Chile","funds": "10000"}] as you can notice the output flowfile content has been changed in this processor. SplitJson processor:- JsonPath Expression $.* Flow:- 1.Extract text processor 2.Replace Text processor 3.SplitJson processor By following both methods output would be the same by using method 1 would be easy to complete this task.

Shu_ashu · ‎03-16-2018

@Pavan M Could you please see my edited answer. let me know if you still having issues ..

Shu_ashu · ‎03-16-2018

@hema moger Yes, you are right. Once you successfully fetch any file from sftp server then data provenance will be updated accordingly.

Shu_ashu · ‎03-16-2018

@Pavan M Method1:- You can do by using RouteonAttribute processor with negate expression by using not function. exclude tables ${db.table.name:in("table_name1,table_name2,table_name3"):not()} with the above expression we are going to exclude all the table names that are listed. (or) using startsWith function ${db.table.name:startsWith("table_"):not()} with above expression we are matching all table names that starts with table_ and using not to exclude them. Method2:- including all the possible tablename patterns in regex by using or operator (|) Table schema pattern A pattern for matching tables in the database. Within a pattern, "%" means match any substring of 0 or more characters, and "_" means match any one character. The pattern must match the table name as it is stored in the database. If the property is not set, all tables will be retrieved. Example:- if you are trying to exclude all the table that are not starting with ua (or) ub use the below regex pattern [ua|ub]% Now we are going to have all list of tables those are having names starting with ua (or) ub and having 0 or more characters. [ua|ub]_ now with the above regex expression we are going to match all table names starting with ua (or) ub and match any one character after that.

Shu_ashu · ‎03-16-2018

@hema moger Write Attributes means once the processor Fetches the files from the remote path then getsftp processor is going to add those list of write attributes to the flowfile that got pulled from the Remote Path. Writes Attributes: Name Description filename The filename is set to the name of the file on the remote server path The path is set to the path of the file's directory on the remote server. For example, if the <Remote Path> property is set to /tmp, files picked up from /tmp will have the path attribute set to /tmp. If the <Search Recursively> property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to /tmp/abc/1/2/3 file.lastModifiedTime The date and time that the source file was last modified file.owner The numeric owner id of the source file file.group The numeric group id of the source file file.permissions The read/write/execute permissions of the source file absolute.path The full/absolute path from where a file was picked up. The current 'path' attribute is still populated, but may be a relative path Example:- I have configured getsftp processor and fetched a file from the remote path, if i do list queue (or) Data provenance in get sftp processor and click on little i icon at the top left corner of the flowfile then we are going to have Details, Attributes tab. Go to Attributes tab then you are going to see all the attributes that are listed above will be added to the flowfile. As you can view from the above screenshot we got absolute.path,path... attributes added to the flowfile by getsftp processor. We can use those attributes by using expression language ${absolute.path} for our purposes. . If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Removing text before and after [ ] characters

Re: Removing text before and after [ ] characters

Re: Need to parse the updated records from RDBMS t...

Re: Delete/Stop flow files in execute script

Re: What is the best way to ingest data from HDFS ...

Re: Removing text before and after [ ] characters

Re: Exclude tables which matches pattern ListDatab...

Re: Nifi getsftp processor

Re: Exclude tables which matches pattern ListDatab...

Re: Nifi getsftp processor