Support Questions

Find answers, ask questions, and share your expertise

How can I sort record in parquet file?

avatar
Explorer

Hello,

I'm triyng to sort record in parquet file like below in order of timestamp.

{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:58.440995\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message AAA\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:58.429579\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message BBB\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:08.441709\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message CCC\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:08.428501\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message DDD\"}
{"Src_obj__event_metadata":"{\"timestamp\":\"2024-11-01T00:23:48.440624\",\"severity\":\"Info\"}","Src_obj__user_data":"{\"message\":\"Message EEE\"}

I confugured the following query in QueryRecord processor and it passed validation in the processor.

SELECT * from flowfile ORDER BY JSON_EXTRACT(Src_obj__event_metadata, "$.timestamp") ASC

But when I run it, it failed with the following error.
It seems to fail to find timestamp field.

errorerror

Could someone please point out what is wrong with my query?

Thanks.

7 REPLIES 7

avatar
Community Manager

Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts @MattWho @ckumar @SAMSAL  who may be able to assist you further.

Please feel free to provide any additional information or details about your query. We hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Expert Contributor

The field Src_obj__event_metadata is a JSON string, so to access fields within it, you might need to parse it into a JSON object first. Some systems may require you to explicitly parse JSON strings before extracting fields.

Please try:

SELECT *
FROM flowfile
ORDER BY CAST(JSON_EXTRACT(Src_obj__event_metadata, "$.timestamp") AS TIMESTAMP) ASC

avatar
Explorer

@ywu 
Thank you for the advice.
I tried with the query you suggested but the situation has not changed.

スクリーンショット 2024-11-07 151704.png

Thanks.

avatar
Expert Contributor

If CAST and JSON_PARSE functions are not supported in the Nifi processor you're using, we may try extracting the timestamp value as a string and sorting alphabetically

SELECT *

FROM flowfile

ORDER BY JSON_EXTRACT_SCALAR(Src_obj__event_metadata, "$.timestamp") ASC

avatar
Explorer

@ywu 
Thank you for your prompto reply.
I tried with that query but unfortunately the same result.
Both of 2 queries you provided passed the validation in the processor.

スクリーンショット 2024-11-07 154049.png

Thanks.

 

avatar
Expert Contributor

JSON_EXTRACT function in the QueryRecord processor may not be interpreting Src_obj__event_metadata as a JSON object. Instead, it likely sees Src_obj__event_metadata as a plain string, so it cannot directly access the "$.timestamp" field. 

We may need to use EvaluateJsonPath processor first 

to extract timestamp from Src_obj__event_metadata into a new attribute:

Destination: flowfile-content

Return Type: json

JSON Path Expression: Use the following configuration in the Properties tab:

Property   Value

timestamp   $.Src_obj__event_metadata.timestamp

Once we extracted timestamp as a separate column, then we could call it directly in QueryRecord processor:

SELECT *
FROM flowfile
ORDER BY timestamp ASC

avatar
Explorer

Hello @ywu ,

Thank you for your advice.
Yes, as you noted I don't think JSON_EXTRACT is working as expected in QueryRecord processor.

I converted my parquet file to json format and tried with EvaluateJsonPath processor as you advised.

When I set $.Src_obj__event_metadata.timestamp as timestamp, EvaluateJsonPath processor terminated by unmatched relationship.

スクリーンショット 2024-11-11 175858.png

 

When I changed timestamp value to .Src_obj__event_metadata.timestamp, only [] was genarated as output.

[]


When I changed the value to .Src_obj__event_metadata, the following output was genarated.

["{\"timestamp\":\"2024-11-01T00:23:58.440995\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:58.429579\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:08.441709\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:08.428501\",\"severity\":\"Info\"}","{\"timestamp\":\"2024-11-01T00:23:48.440624\",\"severity\":\"Info\"}"]

From these results, it seems we need another consideration to correctly specify timestamp field.
Do you have any idea or insights?

Regards,