Member since
03-23-2015
1288
Posts
114
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3295 | 06-11-2020 02:45 PM | |
5012 | 05-01-2020 12:23 AM | |
2815 | 04-21-2020 03:38 PM | |
2619 | 04-14-2020 12:26 AM | |
2315 | 02-27-2020 05:51 PM |
10-14-2019
04:46 PM
@Plop564 I am not an expert in Spark, but my understand is below: 1. I will have 100 output files >>> this depends how many partitions you have in your original DF. "coalesce" can only reduce number of partitions, so if you have less than 100 partitions before, then it won't do anything, as "coalesce" does not do shuffling. If you want to guarantee number of output files, I believe "repartition" function is better. 2. Each single CSV file is locally sorted, I mean by the "date" column ascending >>> Yes 3. Files are globally sorted, I mean CSV part-0000 have "date" inferior to CSV part-0001, CSV part-0001 have "date" inferior to CSV part-0002 and so on .. >>> I believe it is also Yes, but will wait for other Spark experts to confirm. Cheers Eric
... View more
10-10-2019
10:41 PM
@sbn, /etc/spark2/conf should be a symlink to /etc/spark2/conf.cloudera.spark2_on_yarn, can you confirm by running: ls -al /etc/spark2 Cheers Eric
... View more
10-06-2019
03:32 PM
1 Kudo
@Mekaam, Glad that it helped. Cheers Eric
... View more
10-06-2019
03:20 PM
@pramana , Looks like that you are using Ubuntu "bionic", which is not supported in CDH/CM 5.16.x, Bionic is only supported from CDH 6.2 onwards. https://docs.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#c516_supported_os https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_os_requirements.html#c63_supported_os So you need to either try 6.2 version of CM/CDH, or change the version of your Ubuntu OS. Hope that helps. Cheers Eric
... View more
10-06-2019
03:13 PM
@gimp077 , Did you mean that "REFRESH" takes time, and eventually you can see the update data, but just some delay? How big is the table? I mean in terms of number of partitions and number of files in HDFS? Eric
... View more
10-06-2019
03:11 PM
@priyanka1_munja, Are you complaining that same partition appears multiple times? Did you notice the extra space before some of the partition keys? For example, "03-04-2015" vs " 03-04-2015"? I think that's the reason for the duplicates. Cheers Eric
... View more
10-04-2019
04:02 AM
1 Kudo
Hmm, you missed the database name in the connection string, try below: beeline -u 'jdbc:hive2://slave1:10000/default;ssl=true;sslTrustStore=/var/run/cloudera-scm-agent/process/72-hive-HIVESERVER2/cm-auto-host_keystore.jks;trustStorePassword=yeap4IhJzRvK5gBGVMeTahoL21BNmBF2TSi46pbQTP6' Cheers Eric
... View more
10-04-2019
12:04 AM
@Mekaam, Can you please add quotes around the JDBC connection string? So like below: beeline -u 'jdbc:hive2://slave1:10000;ssl=true;sslTrustStore=/var/run/cloudera-scm-agent/process/72-hive-HIVESERVER2/cm-auto-host_keystore.jks;trustStorePassword=yeap4IhJzRvK5gBGVMeTahoL21BNmBF2TSi46pbQTP6' I believe without quotes it will cause issues. If still not working, check HS2 log to see what it complains on the server side. Cheers Eric
... View more
09-30-2019
04:33 PM
1 Kudo
@parthk , There is no current date locked in for the new impala release that will support Ranger at the moment. However, I would like to ask why you do not want to have kerberos? Authorization does not work properly without Authentication in the front. Think about an online application, you surely want users to be able to login first, before you can say what level of access they should have. Same applies in CDH world. Kerberos acts as the front end login, and Sentry/Ranger acts as the backend authorization control. So without Kerberos, you are allowing everyone to be able to access CDH. I strongly suggest you to implement Kerberos first before Sentry, Ranger is the same story regardless. Cheers Eric
... View more