Member since
08-08-2024
64
Posts
10
Kudos Received
4
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 336 | 12-02-2025 08:17 AM | |
| 480 | 11-27-2025 10:02 AM | |
| 755 | 09-08-2025 01:12 PM | |
| 839 | 08-22-2025 03:01 PM |
10-06-2025
01:18 PM
Hello @Brenda99, The question is very wide, there are many things that can help to improve the performance. Some basic recomendations are documented here: https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/tuning-spark/topics/spark-admin_spark_tuning.html Take a look on the documentation, that could help you. Also, it will worth to talk with the team in charge of your account to found deeper performance tuning analysis.
... View more
09-21-2025
02:55 AM
I guess, that my problem has not solution from NiFi side and we just need to correct HDFS settings to accept other encryption types in addition to arcfour-hmac-md5.
... View more
09-15-2025
09:44 PM
Hello @Jack_sparrow That should be possible. You don't need to manually specify partitions or HDFS paths; Spark handles this automatically when you use a DataFrameReader. First, you will need to read the source table using "spark.read.table()". Since table is a Hive partitioned table, Spark will automatically discover and read all 100 partitions in parallel, as long as you have enough executors and cores available. Then, Spark creates a logical plan to read the data. Repartition the data is next, To ensure you have exactly 10 output partitions and to control the parallelism for the write operation, you can use the "repartition(10)" method. This will shuffle the data to create 10 new partitions, which will be processed by 10 different tasks. And then, write the table. Use "write.saveAsTable()". You must specify the format using ".format("parquet")."
... View more
09-15-2025
10:55 AM
Thank you for replying, that's the exact solution I eventually settled on. Best, Shelly
... View more
09-11-2025
03:03 PM
I create a resource as a file, because the python-env resources are specifically for managing Python packages in requirements.txt, according to the documentation. Thanks!
... View more
09-11-2025
11:04 AM
Hello @Jack_sparrow, Spark should automatically do it, you can control that with these settings: Input splits are controlled by spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed. spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files. Shuffle parallelism spark.sql.shuffle.partitions (default 200). Configiure around 2–3 times per total executor cores. Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.
... View more
09-08-2025
09:01 PM
1 Kudo
Thank you for the response.
... View more
09-05-2025
05:16 AM
1 Kudo
@yoonli This thread is growing in to multiple queries that are not directly related. Please start a new community question so the information is easier for our community members to follow when they have similar issues. Thank you, Matt
... View more
08-25-2025
09:18 PM
You did it! I changed my tracking strategy to "Tracking timestamps" and it now populated the "View State" window. Thank you very much for your assistance!
... View more
08-25-2025
05:20 AM
@HoangNguyen Keep in mind that the Apache NiFi Variable Registry no longer exist in Apache NiFi 2.x releases and there is no more development of the Apache NIFi 1.x versions. NiFi Parameter Contexts, which were introduced in later versions of Apache NiFi 1.x, provides similar capability going forward and should be used instead of the variable registry. You'll be forced to transition to Parameter Contexts in order to move to Apache NiFi 2.x. versions. Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
- « Previous
- Next »