Member since
05-10-2016
97
Posts
19
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3052 | 06-13-2017 09:20 AM | |
9184 | 02-02-2017 06:34 AM | |
4041 | 12-26-2016 12:36 PM | |
2809 | 12-26-2016 12:34 PM | |
51278 | 12-22-2016 05:32 AM |
12-26-2016
12:34 PM
Spark includes akka, and it looks like you are getting a version mismatch. You will need to look into the versions, changing the version of akka you are trying to use with what is used within Spark and possibly shade dependencies so the correct versions are used.
... View more
12-22-2016
05:32 AM
1 Kudo
It would depend on your requirements some, but impala/hive and parquet don't store the time zone with the timestamp so you may be losing data if you don't have a seperate timezone column (ie what timezone the event took place), so I'll typically leave the timestamp as is in parquet, include the timezone, and allow convertions to be made at query time. If the users of the table need all timestamps to be in the same timezone, you could make the change before saving to parquet. You could also consider doing both, storing as is and also storing a converted timestamp manually.
... View more
12-21-2016
07:03 AM
1 Kudo
It depends on if you would prefer your timestamps to be converted to UTC (knowing the HiveServer would need to reside in the time zone the data was generated) or left alone. I personally prefer not modifying data automatically and control any time zone corrections within my ETL process. I also prefer to include a time zone column to be able to specify what time zone the data originated to be able to do any corrections later. Just a note, Hive also has a feature to disable time zone conversion now as well, hive.parquet.timestamp.skip.conversion, that was added in 1.2 as part of https://issues.apache.org/jira/browse/HIVE-9482. Finally, I'll note that both Hive and Impala try to be smart about when to apply the conversion when reading because multiple engines handle it differently so Hive will detect if it was written by a hive process or Impala and Impala will do the same. It makes sense to play around with the settings some inserting and reading from different settings to ensure you fully understand how it works.
... View more
12-14-2016
07:10 PM
When spark determines it needs to use yarn's localizer, it will always load the jar to HDFS, it does not attempt to check if the file changed before loading. When using the Spark distributed included with CDH, the spark jar is already loaded to all nodes and specifies the jar is local. When specifying it is local, spark will not upload the jar and yarn's localizer is not used.
... View more
12-13-2016
11:11 AM
It's also possible to establish an ssl tunnel in order to connect to a remote debug session. Take a look at the -L option for ssh, you will be able to open a local port and setup the remote port within the ssh command. This will work for private IPs as long as you can connect to a server from a public IP that has access to the private network. Note though that there can be extreme latency and still be difficult to debug in setups like this.
... View more
12-13-2016
11:01 AM
Hi Ranan, Because this is an older thread and already marked as solved, lets keep this conversation on the other thread you opened: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Debug-Spark-program-in-Eclipse-Data-in-AWS/m-p/48472#U48472
... View more
12-12-2016
07:48 PM
When running on Spark, the spark archive gets distributed to worker nodes via the ContainerLocalizer (aka distributed cache). Spark first uploads files to HDFS and then worker nodes can handle downloading the jar when needed from HDFS. The localizer has some checks to only download the jar when it has changed or has been removed from the worker, so it can reuse the jar and not have to download it again if it still exists lo
... View more
11-29-2016
07:43 AM
Hi zhuangmz, Looks like you found the Hive service option withing the Spark configuration and that solved your problem based one your other post. Feel free to mark this as solved with the link provided or another brief description here.
... View more
11-04-2016
06:17 AM
I've seen issues with some hardware where using local[*] doesn't use number of cores like expected. The java method to get the number of cores available for the process isn't always consistent. Instead, try specifying the number explicitly like local[6] and try again.
... View more
11-03-2016
05:57 PM
There are two tasks for this spark job, both should be able to run in parrallel. It looks like there is an issue while running this locally. How are you launching the job? If you are setting master on command line, could there be somewhere in the code that is overriding to a single thread? When launching this within YARN and multiple executors, you should see the two tasks run in parallel, but it should be possible locally as well.
... View more