About stevenmatison

Jarinek · ‎09-08-2020

It sounds like your testing solution is exceeding the inbound capabilities of the flow tuning (nifi config, processor/queue config) Correct assessment. It has showed that the pipeline was not properly sized for the amount of data, which lead to a back-pressure in the ingest component

stevenmatison · ‎09-04-2020

@P_Rat98 You need parquet tools to read parquet files from command line. There is no method to view parquet in nifi. https://pypi.org/project/parquet-tools/

stevenmatison · ‎09-04-2020

@DanMcCray1 Once you have the content from Kafka as a flowfile, your options are not just limited to ExecuteScript. Depending on the type of content you can use the following ideas: EvaluateJsonPath - if the content is a single json, and you need one or more values inside the object then this is an easy way to get those values to attributes. ExtractText - if the content is text or some raw format, extractText allows you to regex match against the content to get values to attributes. QueryRecord w/ Record Readers & Record Writer - this is the most recommended method. Assuming your data has structure (text,csv,json,etc) and/or multiple rows/objects you can define a reader, with schema, output format (record writer), and query the results very effectively. If you indeed want to work with Execute Script you should start here: https://community.cloudera.com/t5/Community-Articles/ExecuteScript-Cookbook-part-1/ta-p/248922 https://community.cloudera.com/t5/Community-Articles/ExecuteScript-Cookbook-part-2/ta-p/249018 https://community.cloudera.com/t5/Community-Articles/ExecuteScript-Cookbook-part-3/ta-p/249148 If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

Tokolosk · ‎09-02-2020

Hi Everyone, sorry about the confusion. It was late and I was actually looking at the wrong flow file output: i.e. looking at the top one on the list (oldest) instead of the bottom one on the list (newest). @stevenmatison thank you for your reply and effort in making a template.

Muffex · ‎09-01-2020

@stevenmatison Thanks for your answer. As my tables are relatively small and only used to duplicate existing data - is there any way to remove the existing folders before importing new data? regards

stevenmatison · ‎08-28-2020

@P_Rat98 The error above is saying there is an issue with the Schema Name in your record reader or writer. When inside the properties for Convert Record, click the --> arrow through to the reader/writer and make sure they are configured correctly. You will need to provide the correct schema name (if it is already an existing attribute) or provide the schema text. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-28-2020

@P_Rat98 You need to set the filename (Object Key) of each parquet file uniquely to save different S3 files. If that processor is configure to just ${filename} then it will over write additional executions. For the second option, if you have split in your data flow, the split parts should have key/value pair for the split and total splits. Inspect your queue and list attributes on split flowfiles for these attributes. You use these attributes with MergeContent to remerge everything back together into a single flowfile. You need to do this before converting to parquet, not after. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

Jayavardhini · ‎08-27-2020

Oh, that's great. Thanks for your response. That clarifies my question.

Mike in Austin · ‎08-27-2020

I'd recommend these customers work with their account team to plan their CDP Journey. I've dug into a number of customers facing this and found strategies for migrating/upgrading them to either public cloud, on-prem or the recent released private cloud offering.

stevenmatison · ‎08-27-2020

@derisrayan Your question is impossible to answer without very detailed inspection of the following items: NiFi Cluster Size (# of nodes) and Spec of each Node (CPU/RAM/Disk) The size of the data processing per flowfile The number of pieces of the data arriving per execution of the flow After the above, the configuration of the data flow for concurrency and parallelism is tuned to what NiFi Cluster performance capabilities. This comes down to Total NiFi Nodes, Total Cores, the configuration and how many active threads the NiFi Cluster can handle. With a nicely configured NiFi cluster (3+ nodes) with as much ram and cores as possible, the transactions will be quite impressive. Scaling to 5-10-15+ nodes will increase this to an impressive production ready scale. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

Online	Offline
Last Visited	‎06-01-2022 03:47 PM

Name	Steven Matison
Location	Florida
Member Since	‎07-19-2018 04:45 PM
Last Visited	‎06-01-2022 03:47 PM
Posts	613
Kudos received	101

Cloudera Community

Re: Apache nifi - how to convert a file .txt into ...

Re: Apache Nifi - Using PutParquet, the HDFS file ...

Re: How to extract csv column record and used it f...

Re: Could not connect to Distributed Map Cache ser...

Re: NiFi InvokeHTTP POST JSON

Re: NiFi:ListenHTTP - errors when multiple client ...

Re: Convert Json to Avro in Nifi

Re: How to read content of FlowFile

Re: Converting an attribute epoch timestamp to dat...

Re: Sqoop Hive-import not deleting old data in war...

Re: Hosting an API and pushing Json data to S3 whi...

Re: pushing multiple objects in S3 bucket using Pu...

Re: Is it the best practice to delete the user def...

Re: Will Cloudera decomission HDP?

Re: How much Transaction per Second Apache NiFi ca...