Support Questions

Find answers, ask questions, and share your expertise

How can I sort record in parquet file?

avatar
Explorer

Hello,

I'm triyng to sort record in parquet file like below in order of timestamp.

{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:58.440995\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message AAA\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:58.429579\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message BBB\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:08.441709\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message CCC\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:08.428501\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message DDD\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:48.440624\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message EEE\"}

I confugured the following query in QueryRecord processor and it passed validation in the processor.

SELECT * from flowfile ORDER BY JSON_EXTRACT(Src_obj__event_metadata, "$.timestamp") ASC

But when I run it, it failed with the following error.
It seems to fail to find timestamp field.

errorerror

Could someone please point out what is wrong with my query?

Thanks.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

JSON_EXTRACT function in the QueryRecord processor may not be interpreting Src_obj__event_metadata as a JSON object. Instead, it likely sees Src_obj__event_metadata as a plain string, so it cannot directly access the "$.timestamp" field. 

We may need to use EvaluateJsonPath processor first 

to extract timestamp from Src_obj__event_metadata into a new attribute:

Destination: flowfile-content

Return Type: json

JSON Path Expression: Use the following configuration in the Properties tab:

Property   Value

timestamp   $.Src_obj__event_metadata.timestamp

Once we extracted timestamp as a separate column, then we could call it directly in QueryRecord processor:

SELECT *
FROM flowfile
ORDER BY timestamp ASC

View solution in original post

10 REPLIES 10

avatar
Community Manager

Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts @MattWho @ckumar @SAMSAL  who may be able to assist you further.

Please feel free to provide any additional information or details about your query. We hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Expert Contributor

The field Src_obj__event_metadata is a JSON string, so to access fields within it, you might need to parse it into a JSON object first. Some systems may require you to explicitly parse JSON strings before extracting fields.

Please try:

SELECT *
FROM flowfile
ORDER BY CAST(JSON_EXTRACT(Src_obj__event_metadata, "$.timestamp") AS TIMESTAMP) ASC

avatar
Explorer

@ywu 
Thank you for the advice.
I tried with the query you suggested but the situation has not changed.

スクリーンショット 2024-11-07 151704.png

Thanks.

avatar
Expert Contributor

If CAST and JSON_PARSE functions are not supported in the Nifi processor you're using, we may try extracting the timestamp value as a string and sorting alphabetically

SELECT *

FROM flowfile

ORDER BY JSON_EXTRACT_SCALAR(Src_obj__event_metadata, "$.timestamp") ASC

avatar
Explorer

@ywu 
Thank you for your prompto reply.
I tried with that query but unfortunately the same result.
Both of 2 queries you provided passed the validation in the processor.

スクリーンショット 2024-11-07 154049.png

Thanks.

 

avatar
Expert Contributor

JSON_EXTRACT function in the QueryRecord processor may not be interpreting Src_obj__event_metadata as a JSON object. Instead, it likely sees Src_obj__event_metadata as a plain string, so it cannot directly access the "$.timestamp" field. 

We may need to use EvaluateJsonPath processor first 

to extract timestamp from Src_obj__event_metadata into a new attribute:

Destination: flowfile-content

Return Type: json

JSON Path Expression: Use the following configuration in the Properties tab:

Property   Value

timestamp   $.Src_obj__event_metadata.timestamp

Once we extracted timestamp as a separate column, then we could call it directly in QueryRecord processor:

SELECT *
FROM flowfile
ORDER BY timestamp ASC

avatar
Explorer

Hello @ywu ,

Thank you for your advice.
Yes, as you noted I don't think JSON_EXTRACT is working as expected in QueryRecord processor.

I converted my parquet file to json format and tried with EvaluateJsonPath processor as you advised.

When I set $.Src_obj__event_metadata.timestamp as timestamp, EvaluateJsonPath processor terminated by unmatched relationship.

スクリーンショット 2024-11-11 175858.png

 

When I changed timestamp value to .Src_obj__event_metadata.timestamp, only [] was genarated as output.

[]


When I changed the value to .Src_obj__event_metadata, the following output was genarated.

["{\"timestamp\":\"2024-11-01T00:23:58.440995\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:58.429579\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:08.441709\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:08.428501\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:48.440624\",\"severity\":\"Info\"}"]

From these results, it seems we need another consideration to correctly specify timestamp field.
Do you have any idea or insights?

Regards,

avatar
Community Manager

@tono425, I noticed that @ywu's response has helped resolve your issue. If it did, please mark the relevant reply as a solution, as it will help others find the answer more easily in the future.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Explorer

Hello,

After trials and errors, I succeeded to sort records by the following steps.


1. Convert flow file to json by ConvertRecord processor
Convert flow file from parquet format to json format so that we can modify it in next step.


2. Format records by ExecuteStreamCommand processor
As double quotation marks outside the {} were preventing us from processing records as json, I removed them and escape characters by sed then sorted records by jq from script.

Sample script

#!/bin/bash

/usr/bin/sed -e 's/\"{/{/g' -e 's/}\"/}/g' -e 's/\\"/"/g' $1 | /usr/bin/jq '. | sort_by(.Src_obj__event_metadata.timestamp)'


3. Convert flow file to parquet by ConvertRecord processor
Convert flow file from json format to parquet format.

 

@ywu 
Your advice was very helpful in resolving the issue.
Thanks a lot.