Member since
08-08-2024
55
Posts
9
Kudos Received
4
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 41 | 12-02-2025 08:17 AM | |
| 164 | 11-27-2025 10:02 AM | |
| 634 | 09-08-2025 01:12 PM | |
| 695 | 08-22-2025 03:01 PM |
10-16-2025
09:42 AM
1 Kudo
Understood. Hopefully the missing NAR's pointed on the previous update help you figure the issue.
... View more
10-14-2025
11:41 AM
Hello @AlokKumar, Thanks for using Cloudera Community. As I understand, what you need is to add one more step in your flow: HandleHttpRequest-> MergeContent -> ExecuteScript (Groovy)-> HandleHttpResponse Since you have JSON fields and files, you're getting multiple FlowFiles. So this extra MergeContent phase will combine the JSON and the file into a single FlowFile On the MergeContent, set Merge Strategy as “Defragment” and set Correlation Attribute Name as http.request.id. that is unique from each HandleHttpRequest
... View more
10-06-2025
01:18 PM
Hello @Brenda99, The question is very wide, there are many things that can help to improve the performance. Some basic recomendations are documented here: https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/tuning-spark/topics/spark-admin_spark_tuning.html Take a look on the documentation, that could help you. Also, it will worth to talk with the team in charge of your account to found deeper performance tuning analysis.
... View more
09-21-2025
02:55 AM
I guess, that my problem has not solution from NiFi side and we just need to correct HDFS settings to accept other encryption types in addition to arcfour-hmac-md5.
... View more
09-15-2025
09:44 PM
Hello @Jack_sparrow That should be possible. You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader. First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks. And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."
... View more
09-15-2025
10:55 AM
Thank you for replying, that's the exact solution I eventually settled on. Best, Shelly
... View more
09-11-2025
03:03 PM
I create a resource as a file, because the python-env resources are specifically for managing Python packages in requirements.txt, according to the documentation. Thanks!
... View more
09-11-2025
11:04 AM
Hello @Jack_sparrow, Spark should automatically do it, you can control that with these settings: Input splits are controlled by spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed. spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files. Shuffle parallelism spark.sql.shuffle.partitions (default 200). Configiure around 2–3 times per total executor cores. Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.
... View more
09-08-2025
09:01 PM
1 Kudo
Thank you for the response.
... View more
09-05-2025
05:16 AM
1 Kudo
@yoonli This thread is growing in to multiple queries that are not directly related. Please start a new community question so the information is easier for our community members to follow when they have similar issues. Thank you, Matt
... View more
- « Previous
-
- 1
- 2
- Next »