About hubbarja

hubbarja · ‎12-26-2016

Spark includes akka, and it looks like you are getting a version mismatch. You will need to look into the versions, changing the version of akka you are trying to use with what is used within Spark and possibly shade dependencies so the correct versions are used.

hubbarja · ‎12-22-2016

It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time. If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet. You could also consider doing both, storing as is and also storing a converted timestamp manually.

hubbarja · ‎12-21-2016

It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone. I personally prefer not modifying data automatically and control any time zone corrections within my ETL process. I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later. Just a note, Hive also has a feature to disable time zone conversion now as well, hive.parquet.timestamp.skip.conversion, that was added in 1.2 as part of https://issues.apache.org/jira/browse/HIVE-9482. Finally, I'll note that both Hive and Impala try to be smart about when to apply the conversion when reading because multiple engines handle it differently so Hive will detect if it was written by a hive process or Impala and Impala will do the same. It makes sense to play around with the settings some inserting and reading from different settings to ensure you fully understand how it works.

hubbarja · ‎12-14-2016

When spark determines it needs to use yarn's localizer, it will always load the jar to HDFS, it does not attempt to check if the file changed before loading. When using the Spark distributed included with CDH, the spark jar is already loaded to all nodes and specifies the jar is local. When specifying it is local, spark will not upload the jar and yarn's localizer is not used.

hubbarja · ‎12-13-2016

It's also possible to establish an ssl tunnel in order to connect to a remote debug session. Take a look at the -L option for ssh, you will be able to open a local port and setup the remote port within the ssh command. This will work for private IPs as long as you can connect to a server from a public IP that has access to the private network. Note though that there can be extreme latency and still be difficult to debug in setups like this.

hubbarja · ‎12-13-2016

Hi Ranan, Because this is an older thread and already marked as solved, lets keep this conversation on the other thread you opened: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Debug-Spark-program-in-Eclipse-Data-in-AWS/m-p/48472#U48472

hubbarja · ‎12-12-2016

When running on Spark, the spark archive gets distributed to worker nodes via the ContainerLocalizer (aka distributed cache). Spark first uploads files to HDFS and then worker nodes can handle downloading the jar when needed from HDFS. The localizer has some checks to only download the jar when it has changed or has been removed from the worker, so it can reuse the jar and not have to download it again if it still exists lo

hubbarja · ‎11-29-2016

Hi zhuangmz, Looks like you found the Hive service option withing the Spark configuration and that solved your problem based one your other post. Feel free to mark this as solved with the link provided or another brief description here.

hubbarja · ‎11-04-2016

I've seen issues with some hardware where using local[*] doesn't use number of cores like expected. The java method to get the number of cores available for the process isn't always consistent. Instead, try specifying the number explicitly like local[6] and try again.

hubbarja · ‎11-03-2016

There are two tasks for this spark job, both should be able to run in parrallel. It looks like there is an issue while running this locally. How are you launching the job? If you are setting master on command line, could there be somewhere in the code that is overriding to a single thread? When launching this within YARN and multiple executors, you should see the two tasks run in parallel, but it should be possible locally as well.

Online	Offline
Last Visited	‎11-02-2018 12:33 PM

Member Since	‎05-10-2016 01:39 PM
Last Visited	‎11-02-2018 12:33 PM
Posts	97
Kudos received	19

Cloudera Community

Re: getText method is not available while working ...

Re: SparkSQL key not found: scale

Re: Can I upgrade Apache Spark when I'm using pack...

Re: Akka http exception on Spark

Re: Writing Timestamp columns in Parquet Files t...

Re: Akka http exception on Spark

Re: Writing Timestamp columns in Parquet Files t...

Re: Writing Timestamp columns in Parquet Files t...

Re: How does spark runtime jar (../spark-2.0.1/jar...

Re: Debug Spark program in Eclipse Data in AWS

Re: Connection timeout in spark program (Eclipse)

Re: How does spark runtime jar (../spark-2.0.1/jar...

Re: Spark2 beta Hive metastore configuration

Re: How to save each partition of a Dataframe/Data...

Re: How to save each partition of a Dataframe/Data...