About falbani

falbani · ‎08-10-2018

@Girish Khole How did you installed the spark client that is not part of the cluster? There are few considerations if the node is not managed by ambari such as: 1. The spark client version should be same as the one in the cluster 2. You need to make sure all the configuration files for hdfs/yarn/hive are copied from the cluster 3. When you launch a client in spark master mode this does not run in the cluster. This is running in standalone mode. To test cluster you need to use --master yarn (which can be used with client or cluster deployment modes) HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-10-2018

@Mark sure, here is the link to the pyspark network word count example: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py HTH

falbani · ‎08-09-2018

@Harald Berghoff I checked the docker-deploy script for hdp 2.6.5 and we are doing docker pull from the hortonworks/sandbox-hdp in dockerhub. However this deploy script is not just doing that. Having that said you might want to wait until the sandbox for 3.0 is added on the hortonworks portal along with the corresponding scripts & instructions. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-09-2018

@Matt Krueger you should look at spark history server/spark ui to see the correct environment settings being used. Setting executor cores to 3 is actually going to use 3 concurrent threads in each executor. AFAIK this might not be same as yarn v-core concept. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-09-2018

@Harun Zengin By default Livy will launch an application on yarn, and usually the default master is set to yarn-cluster. This means and authenticated user could push code that could potentially run on any cluster worker nodes that have a running node manager. This containers are lunched by yarn, and the container process is always owned by the caller user (on this case the user that made the request to livy) So this container process will be running as the caller user and only have access to the caller users authorized resources. There is no way a user could read a keytab from the /etc/security/keytab directory. Same happens with HDFS data, unless this user has permissions to the files, user wont be able to access those. And this is valid also without Livy, as a user could use hdfs/webhdfs client to read data directly. At the same time there are other ways to push application code which are not limited to Livy. Like using spark-submit/spark-shell. Which work in similar fashion except perhaps those tend to be used from edge nodes on which only few users have access to. Having all that said, if you like to restrict access to Livy and not only rely on authentication. Look for Knox, Livy and Ranger integration to achieve this. This way you could reduce the number of users that use Livy's rest api by authorizing only specific groups/users. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-09-2018

@Takefumi Oide No, but you can have multiple hiveserver2 processes configured with different authentication mechanisms. Lets say you need to have all the auth mechanisms listed above, then you add 1 hiveserver 2 process and configure it with SIMPLE+LDAP and then add another hiveserver2 process and configure it with LDAP+Kerberos. With ambari this can be done using config groups. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-09-2018

@Sudharsan Ganeshkumar if my answer has helped you please remember to login and mark it as accepted.

falbani · ‎08-09-2018

@Is Ta null means the conversion failed. I think this is due your initial creationDate is actually a timestamp not a date. The following code is scala-spark as I'm not used to java-spark so much hopefully you can change it for java: //dataframe is the original dataframe containing the creationDate column val ds = dataframe.withColumn("timestamp",to_timestamp($"creationDate","dd/MM/yyyy HH:mm:ss")) val result = ds.withColumn("date_formatted",date_format($"timestamp","dd/MM/yyyy HH:mm:ss")) result.show This is some example of the output: +-------------------+-------------------+-------------------+ | input_date| timestamp| date_formatted| +-------------------+-------------------+-------------------+ |15/06/2018 09:15:28|2018-06-15 09:15:28|15/06/2018 09:15:28| |03/06/1982 09:15:28|1982-06-03 09:15:28|03/06/1982 09:15:28| +-------------------+-------------------+-------------------+ This also is saved correctly when you write to a file since the actual date_formatted column is a string. HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-08-2018

@harish you can use webhdfs to save necessary files to hdfs. Then use the oozie rest api over knox and run your oozie workflows: https://oozie.apache.org/docs/4.0.1/WebServicesAPI.html HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-07-2018

@Sudharsan Ganeshkumar Out of the box spark provides the fileStream. You can read more here: https://spark.apache.org/docs/latest/streaming-programming-guide.html HTH *** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Online	Offline
Last Visited	‎02-05-2025 11:14 AM

Member Since	‎06-09-2016 09:21 PM
Last Visited	‎02-05-2025 11:14 AM
Posts	529
Kudos received	129

Cloudera Community

Re: Dependency of HDP Atlas on Ranger

Re: Spark throws "Invalid Sync" Error when trying ...

Re: Does HS2 integration with AD impact zeppelin c...

Re: zeppelin jdbc interpreter issue when HS2 is i...

Re: Accessing hive database outside the cluster ne...

Re: How do I identify Spark 2.3.1 installed on HDP...

Re: spark streaming json to hive

Re: pull docker images from docker hub (docker.io/...

Re: Container metrics do not match between spark U...

Re: Livy Server Security

Re: Is it possible to enable 2 or more authenticat...

Re: I need to create a spark streaming application...

Re: Write column as date with format Java-Spark

Re: how to submit a sample pyspark job using oozie...

Re: I need to create a spark streaming application...