About hubbarja

hubbarja · ‎07-27-2018

As this question has already been marked resolved and you are looking for python examples instead of pyspark, you may want to ask in a new question. But, you may also want to look at the various python libraries that already implement functionality to access HDFS data.

hubbarja · ‎10-11-2017

Can you verify that shuffle auxilary service is enabled within YARN?

hubbarja · ‎06-22-2017

You still need the spark-streaming dependency, but instead of version 2.1.1 you will want to match your spark core version of 1.6.3.

hubbarja · ‎06-15-2017

Why are you trying to connect to Impala via JDBC and write the data? You can write the data directly to the storage through Spark and still access through Impala after calling "refresh <table>" in impala. This will avoid the issues you are having and should be more performant.

hubbarja · ‎06-13-2017

Hi Sidhartha, It appears you are using a newer version of spark-streaming (2.1.1). The spark streaming twitter includes spark streaming 1.6.3 and you are using spark 1.6, this may be causing conflicts. There is no need to include the spark-streaming dependency as it will be pulled in with the spark streaming twitter as a transitive dependency. Jason

hubbarja · ‎06-13-2017

Hi Msdhan, What's the schema and fileformat of the Impala table? Why not write the data directly and avoid a jdbc connection to impala? Jason

hubbarja · ‎02-02-2017

This is currently an issue with Numeric datatypes. This is resolved in 2.0, but you can work around the issue by casting to Varchar or importing data into an RDD then converting to DataFrame.

hubbarja · ‎01-03-2017

Yes, this would typically not be recommened, but is a work around for a bug. This is fixed in CM 5.9, so when using 5.9 and newer, you should not need to disable parcel relation validation.

hubbarja · ‎12-26-2016

No problem. The name is a bit misleading, but 5.7 is the minimum version required, installing that parcel won't be a problem with 5.9. the requirements section[1] has a bit more information on supported versions. 1. http://www.cloudera.com/documentation/spark2/latest/topics/spark2_requirements.html

hubbarja · ‎12-26-2016

Spark 2.0 is available as a parcel as well, so you shouldn't need to move to packages unless you have another reason. Spark 2.0 is out of beta now and is GA. Here is more information on how to install Spark 2 with Cloudera Manager: http://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html

Online	Offline
Last Visited	‎11-02-2018 12:33 PM

Member Since	‎05-10-2016 01:39 PM
Last Visited	‎11-02-2018 12:33 PM
Posts	97
Kudos received	19

Cloudera Community

Re: getText method is not available while working ...

Re: SparkSQL key not found: scale

Re: Can I upgrade Apache Spark when I'm using pack...

Re: Akka http exception on Spark

Re: Writing Timestamp columns in Parquet Files t...

Re: PySpark + YARN + Kerberos = Chaos?

Re: Spark job running from a spark-shell fails wit...

Re: getText method is not available while working ...

Re: SPARK Dataframe and IMPALA CREATE TABLE issue

Re: getText method is not available while working ...

Re: SPARK Dataframe and IMPALA CREATE TABLE issue

Re: SparkSQL key not found: scale

Re: Spark 2

Re: Can I upgrade Apache Spark when I'm using pack...

Re: Can I upgrade Apache Spark when I'm using pack...