Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Split each xml attribute into separate tables stores in hive using nifi..

Solved Go to solution
Highlighted

Split each xml attribute into separate tables stores in hive using nifi..

New Contributor

Hello, I’m very new for nifi and new for programming language. Here are my scenario I’m getting different type of nested XML files from HTTP or SFTP or local drives. I have to split the those XML files based on nested (child elements). Same child elements will have to save in same file and they have some primary key or unique key have to know the relationship of the parent and child.

EX: A(root)-> B(Child)-> C (Child of B)-> D(Child of C)

A(root)-> B(Child) C (Child of B) D(Child of C) B(Child) C (Child of B) D(Child of C)

Now I need all data of B’s have to save in one table and C’s in another table and so on.., And must have to maintain the relationship between the child and parents with some unique key (if you have use otherwise we have to generate the unique keys for identification of relationship between the tables.).

Same as the below example:

<customer>

<group>

<site>

<userline></userline>

<userline></userline>

<userline></userline>

</site>

</group>

<group>

<site>

<userline></userline>

<userline></userline>

<userline></userline>

</site>

</group>

</customer>

I saw already same solution for above in "https://community.hortonworks.com/questions/70087/complex-xml-to-hive-table-using-nifi.html" BUt i didn't understand solution so can someone help me the step by step to split XML for separate files.

Thank you in Advance..

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Split each xml attribute into separate tables stores in hive using nifi..

Super Guru

Hi @Mohan Sure,

We can get results as you expected by using

EvaluateXquery //we can keep all the required contents as attributes of flowfile.
UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor.
ReplaceText //replace the flowfile content with attributes of flowfile
PutHDFS //store files into HDFS

EvaluateXquery Configurations:-

Change the existing properties

1.Destination to

flowfile-attribute

2.Output: Omit XML Declaration to

true

Add new properties by clicking + sign

1.author

//author

2.book

//book

3.bookstore

//bookstore

Input:-

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author> 
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
</bookstore>

Output:-

41465-attrjs.png

As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.

EvaluateXquery Processor configs screenshot:-

41464-evaluatexquery.png

Update Attribute Processor:-

1.author

${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}

updating the author attribute

input to updateattribute processor:-

<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

2.book

${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}

Input:-

<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>

Output:-

style="autobiography" <price>12</price>

3.bookstore

${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}

Input:-

<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>

Output:-

specialty="novel"

Configs:-

41466-update-attr.png

ReplaceText Processor:-

Cchange the properties of

Replacement Strategy to

alwaysreplace

and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.

41468-replace-text.png

add 2 more replacetext processors for book and author attributes.

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

PutHDFS processor:-

Configure the processor and give the directory name where you want to store the data.

Flow Screenshot:-

41469-flow.png

For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.

4 REPLIES 4

Re: Split each xml attribute into separate tables stores in hive using nifi..

New Contributor

@Shu Thank you for response, here are my input file(xml.xml). And i attached my expected output files output-author.txt, output-books.txt, output-bookstore.txt

. Hope you will understand my problem.

Re: Split each xml attribute into separate tables stores in hive using nifi..

New Contributor

@Shu : I just created output as .txt files for better understanding purpose. But i need to store data in Hive or HDFS location.

Re: Split each xml attribute into separate tables stores in hive using nifi..

New Contributor

@Shu

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>

<bookstore specialty="novel">

<book style="autobiography">

<author> <first-name>Joe</first-name>

<last-name>Bob</last-name>

<award>Trenton Literary Review Honorable Mention</award>

</author>

<price>12</price>

</book>

</bookstore>

Re: Split each xml attribute into separate tables stores in hive using nifi..

Super Guru

Hi @Mohan Sure,

We can get results as you expected by using

EvaluateXquery //we can keep all the required contents as attributes of flowfile.
UpdateAttribute //update the contents of attributes that got extracted in evaluatexquery processor.
ReplaceText //replace the flowfile content with attributes of flowfile
PutHDFS //store files into HDFS

EvaluateXquery Configurations:-

Change the existing properties

1.Destination to

flowfile-attribute

2.Output: Omit XML Declaration to

true

Add new properties by clicking + sign

1.author

//author

2.book

//book

3.bookstore

//bookstore

Input:-

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author> 
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
</bookstore>

Output:-

41465-attrjs.png

As you can see in screenshot all the content are as attributes(book,bookstore,author) to the flowfile.

EvaluateXquery Processor configs screenshot:-

41464-evaluatexquery.png

Update Attribute Processor:-

1.author

${author:replaceAll('<author>([\s\S]+.*)<\/author>','$1')}

updating the author attribute

input to updateattribute processor:-

<author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author>

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

2.book

${book:replaceAll('<book\s(.*)>[\s\S]+<\/author>([\s\S]+)<\/book>','$1$2')}

Input:-

<book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book>

Output:-

style="autobiography" <price>12</price>

3.bookstore

${bookstore:replaceAll('.*<bookstore\s(.*?)>[\s\S]+.*','$1')}

Input:-

<bookstore specialty="novel"> <book style="autobiography"> <author> <first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award> </author> <price>12</price> </book> </bookstore>

Output:-

specialty="novel"

Configs:-

41466-update-attr.png

ReplaceText Processor:-

Cchange the properties of

Replacement Strategy to

alwaysreplace

and use your attributes bookstore,book,author in this processor and we are going to overwrite the existing contents of flowfile with the new content.

41468-replace-text.png

add 2 more replacetext processors for book and author attributes.

Output:-

<first-name>Joe</first-name> <last-name>Bob</last-name> <award>Trenton Literary Review Honorable Mention</award>

PutHDFS processor:-

Configure the processor and give the directory name where you want to store the data.

Flow Screenshot:-

41469-flow.png

For testing purpose i have use generate flowfile processor but in your case generate flowfile processor will be the source processor from where you are getting this xml data.

Don't have an account?
Coming from Hortonworks? Activate your account here