Member since
08-08-2024
44
Posts
2
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
223 | 09-08-2025 01:12 PM | |
345 | 08-22-2025 03:01 PM |
09-11-2025
02:37 PM
Hello @ariajesus, Welcome to our community. Glad to see you here. How did you create the resource? As a File Resource or as a Python Environment? Here are the steps how you can create it: https://docs.cloudera.com/data-engineering/1.5.4/use-resources/topics/cde-create-python-virtual-env.html
... View more
09-11-2025
11:04 AM
Hello @Jack_sparrow, Spark should automatically do it, you can control that with these settings: Input splits are controlled by spark.sql.files.maxPartitionBytes (default 128MB). If smaller, more splits or parallel tasks will be executed. spark.sql.files.openCostInBytes (default 4MB) influences how Spark coalesces small files. Shuffle parallelism spark.sql.shuffle.partitions (default 200). Configiure around 2–3 times per total executor cores. Also, make sure df.write.parquet() doesn’t set everything into few files only. For that, you can use .repartition(n) to increase the parallelism before writing.
... View more
09-09-2025
09:26 AM
Hello @Jack_sparrow, Glad to see you again in the forums. 1. The resource allocation is something kind of complicated to tell, because it depends in a lot of factors. Take in mind how big is your data, how much memory you have on the cluster, do not forget the overhead and other things. There is very useful information here: https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/tuning-spark/topics/spark-admin-tuning-resource-allocation.html Under that parent section there is more tuning suggestions on each topic. 2. From the second option I understand that you want to read the data separately using each type. That should be possible with something like this: if input_path.endswith(".parquet"): df = spark.read.parquet(input_path) elif input_path.endswith(".orc"): df = spark.read.orc(input_path) elif input_path.endswith(".txt") or input_path.endswith(".csv"): df = spark.read.text(input_path) # o .csv con opciones else: raise Exception("Unsupported file format") Then, you can handle each data in a separate way. 3. The data movement should avoid going to the driver, to avoid issues and extra work, so collect() or .toPandas() are not the best options. If you want to move data without transformations, distcp should be a good option. To write you can use this: df.write.mode("overwrite").parquet("ofs://ozone/path/out") And other suggestions can be tuning the partitions with "spark.sql.files.maxPartitionBytes" and change the compression to snappy using "spark.sql.parquet.compression.codec".
... View more
09-08-2025
01:12 PM
1 Kudo
Hello @Jack_sparrow, Glad to see you on the Community. As far as I know, df.write is not possible to be used on an rdd.foreach or rdd.foreachpartition. The reason is because df.write is a driver-side action, it triggers a Spark job. rdd.foreach or rdd.foreachpartition are executors, and executors cannot trigger jobs. Check these references: https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-foreachpartition-foreach https://stackoverflow.com/questions/46964250/nullpointerexception-creating-dataset-dataframe-inside-foreachpartition-foreach https://sparkbyexamples.com/spark/spark-foreachpartition-vs-foreach-explained The option that looks like it works for you is this: df.write.partitionBy Something like this: df.write.partitionBy("someColumn").parquet("/path/out")
... View more
08-27-2025
01:26 PM
Hi @MattWho, I think you tagged the wrong person. @yoonli, take a look on @MattWho update.
... View more
08-26-2025
11:06 AM
Hello @yoonli, I was checking the configuration and comparing it with other threads and it looks fine to me. Now, I was checking that users.xml and authorizations.xml cannot already exist to be created. You will need to stop the NiFi and then rename those files: cp conf/authorizations.xml conf/authorizations.xml.backup cp conf/users.xml conf/users.xml.backup Then you can retry. Also, it will worth to check this thread as well, that contains a lot of information on this same issue: https://community.cloudera.com/t5/Support-Questions/Untrusted-proxy-error-Authentication-Failed-o-a-n-w-s/m-p/399540
... View more
08-22-2025
03:01 PM
Hello @HoangNguyen, If I do not understand wrong, what you want is not possible. ListFile does not support incoming FlowFile as an source. To do that, you will need to use variable registry. Look here: Display Name: Input Directory API Name: Input Directory Description: The input directory from which files to pull files Supports Expression Language: true (will be evaluated using variable registry only) https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ListFile/index.html Looking on that, FetchFile will do what you need: Display Name: File to Fetch API Name: File to Fetch Default Value: ${absolute.path}/${filename} Description: The fully-qualified filename of the file to fetch from the file system Supports Expression Language: true (will be evaluated using flow file attributes and variable registry) https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.FetchFile/index.html
... View more
08-22-2025
01:50 PM
Hello @Amry, Strange, you should be able to see something like this: Do you have that button? Can you confirm your CFM version so I can take a look on a version like that as well?
... View more
08-21-2025
09:28 AM
1 Kudo
Hello @MoJadallah, Sorry for no one answering so far. We do want to give you a good experience on your trial. I assume you requested your free trial from here: https://www.cloudera.com/products/cloudera-public-cloud-trial.html?internal_keyplay=ALL&internal_campaign=FY25-Q1-GLOBAL-CDP-5-Day-Trial&cid=FY25-Q1-GLOBAL-CDP-5-Day-Trial&internal_link=WWW-Nav-u01 When was that error hit? Just after the sign-up?
... View more
08-18-2025
08:13 AM
Default configurations are the tested and for that reason are set in that way, but those are configurable for a reason, sometimes, and depending on the environment, the use case and much more, they need to be tuned in an specific way.
... View more
- « Previous
-
- 1
- 2
- Next »