Created 10-13-2017 03:41 PM
Hello, I’m very new for nifi and new for programming language. Here are my scenario I’m getting different type of nested XML files from HTTP or SFTP or local drives. I have to split the those XML files based on nested (child elements). Same child elements will have to save in same file and they have some primary key or unique key have to know the relationship of the parent and child.
EX: A(root)-> B(Child)-> C (Child of B)-> D(Child of C)
A(root)-> B(Child) C (Child of B) D(Child of C) B(Child) C (Child of B) D(Child of C)
Now I need all data of B’s have to save in one table and C’s in another table and so on.., And must have to maintain the relationship between the child and parents with some unique key (if you have use otherwise we have to generate the unique keys for identification of relationship between the tables.).
Same as the below example:
<customer>
<group>
<site>
<userline></userline>
<userline></userline>
<userline></userline>
</site>
</group>
<group>
<site>
<userline></userline>
<userline></userline>
<userline></userline>
</site>
</group>
</customer>
I saw already same solution for above in "https://community.hortonworks.com/questions/70087/complex-xml-to-hive-table-using-nifi.html" BUt i didn't understand solution so can someone help me the step by step to split XML for separate files.
Thank you in Advance..
Created on 10-22-2017 03:51 AM - edited 08-17-2019 07:34 PM
Hi @Mohan Sure,
We can get results as you expected by using
EvaluateXquery //we can keep all the required contents as attributes of flowfile. UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor. ReplaceText //replace the flowfile content with attributes of flowfile PutHDFS //store files into HDFS
EvaluateXquery Configurations:-
Change the existing properties
1.Destination to
flowfile-attribute
2.Output: Omit XML Declaration to
true
Add new properties by clicking + sign
1.author
//author
2.book
//book
3.bookstore
//bookstore
Input:-
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="myfile.xsl" ?> <bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>
Output:-
As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.
EvaluateXquery Processor configs screenshot:-
Update Attribute Processor:-
1.author
${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}
updating the author attribute
input to updateattribute processor:-
<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>
Output:-
<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>
2.book
${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}
Input:-
<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>
Output:-
style="autobiography" <price>12</price>
3.bookstore
${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}
Input:-
<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>
Output:-
specialty="novel"
Configs:-
ReplaceText Processor:-
Cchange the properties of
Replacement Strategy to
alwaysreplace
and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.
add 2 more replacetext processors for book and author attributes.
Output:-
<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>
PutHDFS processor:-
Configure the processor and give the directory name where you want to store the data.
Flow Screenshot:-
For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.
Created 10-21-2017 09:14 PM
@Shu Thank you for response, here are my input file(xml.xml). And i attached my expected output files output-author.txt, output-books.txt, output-bookstore.txt
. Hope you will understand my problem.
Created 10-21-2017 09:17 PM
@Shu : I just created output as .txt files for better understanding purpose. But i need to store data in Hive or HDFS location.
Created 10-21-2017 09:37 PM
@Shu
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author> <first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
</bookstore>
Created on 10-22-2017 03:51 AM - edited 08-17-2019 07:34 PM
Hi @Mohan Sure,
We can get results as you expected by using
EvaluateXquery //we can keep all the required contents as attributes of flowfile. UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor. ReplaceText //replace the flowfile content with attributes of flowfile PutHDFS //store files into HDFS
EvaluateXquery Configurations:-
Change the existing properties
1.Destination to
flowfile-attribute
2.Output: Omit XML Declaration to
true
Add new properties by clicking + sign
1.author
//author
2.book
//book
3.bookstore
//bookstore
Input:-
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="myfile.xsl" ?> <bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>
Output:-
As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.
EvaluateXquery Processor configs screenshot:-
Update Attribute Processor:-
1.author
${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}
updating the author attribute
input to updateattribute processor:-
<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>
Output:-
<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>
2.book
${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}
Input:-
<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>
Output:-
style="autobiography" <price>12</price>
3.bookstore
${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}
Input:-
<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>
Output:-
specialty="novel"
Configs:-
ReplaceText Processor:-
Cchange the properties of
Replacement Strategy to
alwaysreplace
and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.
add 2 more replacetext processors for book and author attributes.
Output:-
<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>
PutHDFS processor:-
Configure the processor and give the directory name where you want to store the data.
Flow Screenshot:-
For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.