Support Questions

Find answers, ask questions, and share your expertise

Split each xml attribute into separate tables stores in hive using nifi..

avatar
Explorer

Hello, I’m very new for nifi and new for programming language. Here are my scenario I’m getting different type of nested XML files from HTTP or SFTP or local drives. I have to split the those XML files based on nested (child elements). Same child elements will have to save in same file and they have some primary key or unique key have to know the relationship of the parent and child.

EX: A(root)-> B(Child)-> C (Child of B)-> D(Child of C)

A(root)-> B(Child) C (Child of B) D(Child of C) B(Child) C (Child of B) D(Child of C)

Now I need all data of B’s have to save in one table and C’s in another table and so on.., And must have to maintain the relationship between the child and parents with some unique key (if you have use otherwise we have to generate the unique keys for identification of relationship between the tables.).

Same as the below example:

<customer>

<group>

<site>

<userline></userline>

<userline></userline>

<userline></userline>

</site>

</group>

<group>

<site>

<userline></userline>

<userline></userline>

<userline></userline>

</site>

</group>

</customer>

I saw already same solution for above in "https://community.hortonworks.com/questions/70087/complex-xml-to-hive-table-using-nifi.html" BUt i didn't understand solution so can someone help me the step by step to split XML for separate files.

Thank you in Advance..

1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @Mohan Sure,

We can get results as you expected by using

EvaluateXquery //we can keep all the required contents as attributes of flowfile.
UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor.
ReplaceText //replace the flowfile content with attributes of flowfile
PutHDFS //store files into HDFS

EvaluateXquery Configurations:-

Change the existing properties

1.Destination to

flowfile-attribute

2.Output: Omit XML Declaration to

true

Add new properties by clicking + sign

1.author

//author

2.book

//book

3.bookstore

//bookstore

Input:-

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author> 
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
</bookstore>

Output:-

41465-attrjs.png

As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.

EvaluateXquery Processor configs screenshot:-

41464-evaluatexquery.png

Update Attribute Processor:-

1.author

${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}

updating the author attribute

input to updateattribute processor:-

<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

2.book

${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}

Input:-

<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>

Output:-

style="autobiography" <price>12</price>

3.bookstore

${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}

Input:-

<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>

Output:-

specialty="novel"

Configs:-

41466-update-attr.png

ReplaceText Processor:-

Cchange the properties of

Replacement Strategy to

alwaysreplace

and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.

41468-replace-text.png

add 2 more replacetext processors for book and author attributes.

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

PutHDFS processor:-

Configure the processor and give the directory name where you want to store the data.

Flow Screenshot:-

41469-flow.png

For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.

View solution in original post

4 REPLIES 4

avatar
Explorer

@Shu Thank you for response, here are my input file(xml.xml). And i attached my expected output files output-author.txt, output-books.txt, output-bookstore.txt

. Hope you will understand my problem.

avatar
Explorer

@Shu : I just created output as .txt files for better understanding purpose. But i need to store data in Hive or HDFS location.

avatar
Explorer

@Shu

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>

<bookstore specialty="novel">

<book style="autobiography">

<author> <first-name>Joe</first-name>

<last-name>Bob</last-name>

<award>Trenton Literary Review Honorable Mention</award>

</author>

<price>12</price>

</book>

</bookstore>

avatar
Master Guru

Hi @Mohan Sure,

We can get results as you expected by using

EvaluateXquery //we can keep all the required contents as attributes of flowfile.
UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor.
ReplaceText //replace the flowfile content with attributes of flowfile
PutHDFS //store files into HDFS

EvaluateXquery Configurations:-

Change the existing properties

1.Destination to

flowfile-attribute

2.Output: Omit XML Declaration to

true

Add new properties by clicking + sign

1.author

//author

2.book

//book

3.bookstore

//bookstore

Input:-

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author> 
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
</bookstore>

Output:-

41465-attrjs.png

As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.

EvaluateXquery Processor configs screenshot:-

41464-evaluatexquery.png

Update Attribute Processor:-

1.author

${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}

updating the author attribute

input to updateattribute processor:-

<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

2.book

${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}

Input:-

<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>

Output:-

style="autobiography" <price>12</price>

3.bookstore

${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}

Input:-

<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>

Output:-

specialty="novel"

Configs:-

41466-update-attr.png

ReplaceText Processor:-

Cchange the properties of

Replacement Strategy to

alwaysreplace

and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.

41468-replace-text.png

add 2 more replacetext processors for book and author attributes.

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

PutHDFS processor:-

Configure the processor and give the directory name where you want to store the data.

Flow Screenshot:-

41469-flow.png

For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.