About dave_sargrad

dave_sargrad · ‎07-24-2019

I've verified that if the key is set to a field that is in the root level of my JSON then I am able to update/upsert properly. For example, I tried this with the evt_type field: This worked fine. So my only question is how do I handle a field that is one level down in the JSON as is the case with stdds_id {"evt_data": {"stdds_id": "stdds_id_value" , ...}}

dave_sargrad · ‎07-24-2019

By the way, (if I need to) I am able to use EvaluateJSONPath to extract the stdds_id and place it as an attribute on the flowfile (see STDDS_ID) : Id use STDDS_ID as the key into mongo, but I'm not sure how to add this to the flowfile content, so that I can update/upsert using this key.

dave_sargrad · ‎07-24-2019

I have a sequence of flowfiles that I need to put to mongo. A sample is at the bottom of this question. The flowfile contains JSON with a field "evt_data": {"stdds_id": "stdds_id_value" .. I need that stdds_id_value to be the key for update/upsert into mongo. I'm looking for help with the flow configuration that will make this work. My current PutMongo (non-working attempt) configuration is this: I'm trying to put a record that is keyed on the evt_data.stdds_id value as see in the following flowfile JSON: In other words the key for the document in Mongo would be "TARGET-STDDS-KCLT-157481920517" This doesnt work, and I end up seeing multiple documents in mongo for that same key. I should only see one per distinct key. What is the proper way to set the update key?

dave_sargrad · ‎07-23-2019

PutMongo does not like my $or

dave_sargrad · ‎07-23-2019

I need to do an update/upsert into Mongo. Essentially the command that I need to run is seen in the following command (this works in the mongo command line client): Notice in the first update I search for the document that matches the specified STDDS_DST_ID. In the second update, I match any of several Ids including the one that was already matched. In this simple example. I have a set of linked Ids: TFM_ID, FVIEW_ID, STDDS_DST_ID. The set of linked IDs is unique. For in this example that per-set distinction guarantees that you wont find STDDS_DIST_ID 100 associated with another FVIEW_ID, or TFM_ID. You'll only find it with FVIEW_ID 3000 and TFM_ID 300000. So assuming that I have a flowfile that contains some number of fields (e.g. fld1, fld2, fl3), and one or more of the ids: TFM_ID, STDDS_DST_ID, FVIEW_ID, how can I configure PutMongo so that it will update/upsert the appropriate document (that one that matches one or more of these IDs)? Again, in my mind, PutMongo simply needs to be configured consistent with the update you see in the image above. I just dont have much experience with PutMongo. Looking at the documentation, I believe I must do the following: Set Mode to update Set Upsert to true Leave Update Query Key Empty Set the Update Query to something such as the first argument in the sample command: { $or: [{"TFM_ID": "300000"}, {"FVIEW_ID": "3000"},{"STDDS_DST_ID": "100"}]} Set the flowfile content to the data to place in the document (including the $set): {$set: {"fld2": "fld2_val", "fld3": "fld3_val", "TFM_ID": "300000", "FVIEW_ID": "3000","STDDS_DST_ID": "100"}} Are my assumptions on the mark?

dave_sargrad · ‎07-16-2019

Thanks @Nico Verwer. I'll give that a shot, and will be sure to accept your response as an answer, once I verify.

dave_sargrad · ‎07-04-2019

HI @Shu. Could you please explain what sysdate, current_date, etc would do for me with the spark job? I dont fully understand how to use them and the benefits that this technique would offer.

dave_sargrad · ‎07-03-2019

@Shu I like your idea of creating daily archives (Option 3 above). How do I ensure that spark jobs that I create to process those daily files run on the datanode that they are stored on? Does yarn do this by default? I've not yet used yarn. I've only used HDFS. I am hoping to eventually use k8s (kubernetes).

dave_sargrad · ‎07-03-2019

Hi @Shu. Thank you very much for your thoughts. This is the kind of feedback that I was hoping for. I'll absolutely do my best to understand your recommendation. It sounds like I am not completely off-base in the way that I hope to use HDFS. It does sound like you are confirming that I must figure out how to accumulate large files, prior to driving them into HDFS. I will look at the tools and methods that you suggest. Thanks for your insights

dave_sargrad · ‎07-02-2019

I've used the PutHDFS processor as I've started to understand how to deal with big data environments. Up until now I've been putting very small files into HDFS. This seems to be architecturally bad practice. The HDFS block size defaults to about 128 MB, and the hadoop community recommendation seems to be that applications (that write to HDFS) should write files that are GB in size, or even TB. I'm trying to understand how to do this with Nifi. Part of my concern is a concern for the data analysts. What is the best way to logically structure files that are appropriate for HDFS? Currently the files that I am writing contain small JSON objects, or lists. I use MergeRecord to intentionally make the file I write larger. However my JSON objects accumulate fast thousands of JSON records per second potentially. For the Big Data/Nifi experts, I'd appreciate any thoughts relative to the best way to use Nifi to support streaming large data objects into HDFS.

Online	Offline
Last Visited	‎07-24-2019 07:29 PM

Member Since	‎09-18-2018 11:04 AM
Last Visited	‎07-24-2019 07:29 PM
Posts	92
Kudos received	5

Cloudera Community

Re: What is the key to use in PutMongo to Update/U...

Re: What is the key to use in PutMongo to Update/U...

What is the key to use in PutMongo to Update/Upser...

Re: How do I perform the following update using PU...

How do I perform the following update using PUT_MO...

Re: How do you declare a namespace for proper XPat...

Re: What is the best way to stream data to HDFS, a...

Re: What is the best way to stream data to HDFS, a...

Re: What is the best way to stream data to HDFS, a...

What is the best way to stream data to HDFS, accou...