Member since
12-02-2017
9
Posts
0
Kudos Received
0
Solutions
10-14-2019
02:25 AM
I'm trying to write an ordered Dataframe/Dataset into multiples CSV Files, and preserve both global and local sort. I have the following code : df
.orderBy("date")
.coalesce(100)
.write
.csv(...) Does this code guarantee that : - I will have 100 output files - Each single CSV file is locally sorted, I mean by the "date" column ascending - Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on .. Thanks
... View more
Labels:
- Labels:
-
Apache Spark
09-07-2018
05:57 PM
Hi community, I try to build Cloudera Oozie 5 from the Apache web site for educational purpose against Hadoop 3. Every time I run the mkdistro.sh command with parameters like "-Dhadoop.version=3.0.0" the build failed on Oozie Core. So I'm trying the first option to add the hadoop jars : 1. At install time, copy the hadoop and hcatalog libraries to libext and run oozie-setup.sh to setup Oozie. This is
suitable when same oozie package needs to be used in multiple set-ups with different hadoop/hcatalog versions. Could you please explain how it works in such last case ? Or could you give me a way that works to build Oozie with Hadoop 3 ? Thanks
... View more
Labels:
- Labels:
-
Apache Oozie
01-12-2018
02:34 AM
Hi @Tim Armstrong Thanks for the quality answer 🙂 As you mention, we don't use in production the lastest Impala version, so it is indeed possible that there are bugs in Impala or UDF/UDAF. I will check the changelog and evaluate possibles issues on thoses very large requests. Regarding our self-made UDF, the good thing is, after reviewing our logs history, such error was also triggered before the deployement of such UDF. So if there are some memory leaks currently, it might be then unrelated to our work, and it might be a minor issue as we have just to wait to upgrade the cluster (and Impala version) Otherwise, many thanks for implementation details you give me, it helps to better understand !
... View more
01-04-2018
07:49 AM
Hello, I see that the following error is triggered regularly on my cluster: > Memory limit exceeded > FunctionContextImpl::AllocateLocal's allocations exceeded memory limits Such error, appears when : requests address a very large number of tuples requests seems to use only built-in UDF/UDAF like SUM, COUNT, or DISTINCT I see that if I reran the request, the request then succeed almost everytime. However, i have haproxy in use with lot of impalads behind it. Questions : 1) Is FunctionContextImpl::AllocateLocal called by theses built-in UDAF ? It is a typical error that means that impalad is running out of memory for the current request ? If yes, increasing memory of each impalad could solve the problem easily ? 2) Could the problem be related to memory leaks ? I mean that I have also self-made C++ UDF that are in use with others Impala requests. Such requests (which address smaller number of tuples) succeed. However in such UDFs I use FunctionContextImpl::Allocate/Free ( not the AllocateLocal one ), and the StringVal constructor with context parameter. So basically, if a memory leak is actually happening in the self-made UDF, could it be related to the previous error ? I mean the "AllocateLocal's allocations exceeded memory" error that occurs on request which are not using the self-made UDF. Thanks !
... View more
Labels:
- Labels:
-
Apache Impala
12-10-2017
11:17 AM
Thanks again !
... View more
12-10-2017
11:14 AM
Hi @Tim Armstrong Thank you very much for the reply !
... View more
12-08-2017
04:53 AM
Hello, We have multiples Impala C++ UDFs that we want to deploy on our production Cloudera cluster. We have carefully rewieved the source code, in order to avoid memory leaks, segmentation fault and race conditions. However, if we have not seen something and a segmentation fault, memory leaks, or race conditions still occurs, what could be the risks for the entire cluster ? If an error like that occurs, does the corresponding Impalad could crash ? If an Impalad crash owing to an UDF, restarting it will be enough to go back to good health ? What about the Impalad isolation ? Again if a segmentation fault, memory leaks or race conditions occurs, does other cloudera services instances can be affected ? (HDFS, Hive, ...) Could you please quickly summarize the associated risks with buggy C++ UDF ? Thanks !
... View more
Labels:
- Labels:
-
Apache Impala
12-02-2017
11:18 AM
I have the following UDF : CREATE FUNCTION myudf(string) RETURNS string LOCATION '/user/cloudera/myudflib.so' SYMBOL='Process' PREPARE_FN='PrepareLibrariesAndDataStructures' CLOSE_FN='CloseLibrariesAndCleanupDataStructures'; As you can see, my C++ UDF need for each Impala thread to initialize some libraries and data structures with the PrepareLibrariesAndDataStructures function BEFORE the Process function start to be called multiples times. On the other hand, CloseLibrariesAndCleanupDataStructures need to always be called when the corresponding Impala thread has no other Process function to call, in order to freeup data structure and cleanup libraries. In order to avoid memory leaks, does Cloudera Impala guarantee that when, either the user cancel the query, or either the Process function fails with setError(), the CLOSE_FN will be still called ? In other words, can we trust Cloudera Impala, to always call CLOSE_FN when a corresponding PREPARE_FN is called ? Or must we put the data_structures/library initialization/cleanup directly in the SYMBOL Process function to minimize the memory leaks risks ? Thank you very much !
... View more
Labels:
- Labels:
-
Apache Impala