Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to merge many json files together using one common field?

avatar
Explorer

Hello everyone!

I have many json files like this:

{
"table_name" : "train_vd",
"data" : [ {
"battery_power" : 1954,
"clock_speed" : 0.5
} ]
}

 

{
"table_name" : "train_vd",
"data" : [ {
"battery_power" : 842,
"clock_speed" : 2.2
} ]
}

...

 

I used the MergeContent and MergeRecord processors and used the table_name field as the Correlation Attribute Name (i have ${table_name} attribute). However, this does not work and the result is as follows:

[{
"table_name" : "train_vd",
"data" : [ {
"battery_power" : 509,
"clock_speed" : 0.6
} ]
}{
"table_name" : "train_vd",
"data" : [ {
"battery_power" : 842,
"clock_speed" : 2.2
} ]
}]

...

 

However, I want to get the following result:

[{
"table_name" : "train_vd",
"data" : [ {
"battery_power" : 509,
"clock_speed" : 0.6
},

{
"battery_power" : 842,
"clock_speed" : 2.2
}]
}]

May you tell me how to solve this problem? Need i use a complex Jolt transformation or to configure the incoming Avro schema in the MergeRecord processor, so that then everything is combined using a single field?

1 ACCEPTED SOLUTION

avatar
Explorer

Thank you for your answer! All my json FlowFiles have a FlowFile attribute on them for "table_name". There may be a problem with the json schema itself. Now the task has changed. I have created a new question about Jolt.

https://community.cloudera.com/t5/Support-Questions/Jolt-transform/td-p/330850

 

If you know the answer to it, I would be very grateful!

View solution in original post

2 REPLIES 2

avatar
Super Mentor

@Protector 

Do all your json FlowFiles have a FlowFile attribute on them for "table_name".  It is not pulling table_name from the FlowFIle content (your json content) itself.

The 
Correlation Attribute Name property in the MergeContent processors is looking for this FlowFile Attribute on each incoming FlowFile in order to allocate those FlowFiles with same value assign to that FlowFile attribute to the same bin.  Then a bin is merged when it meets the other configured mins on the MergeContent, max bin age is reached, or all bins have files allocated to them and another bin is needed forcing the merge of the oldest bin.

 

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

avatar
Explorer

Thank you for your answer! All my json FlowFiles have a FlowFile attribute on them for "table_name". There may be a problem with the json schema itself. Now the task has changed. I have created a new question about Jolt.

https://community.cloudera.com/t5/Support-Questions/Jolt-transform/td-p/330850

 

If you know the answer to it, I would be very grateful!