About awatson

awatson · ‎12-18-2015

Hi @Srinivasarao Daruna HDP does not support Spark in Standalone mode. You need to use Spark on Yarn. Running Spark in Yarn Cluster mode you can specify number of executors by using the parameter: --num-executor=6 This will give you 6 executors For additional information regarding using Yarn Cluster mode please see - http://spark.apache.org/docs/latest/running-on-yar... Cheers, Andrew

awatson · ‎12-14-2015

What is the Hive Column count Limit? When should you consider moving the table into Hbase/Phoenix due to performance issues? Is 100,000 columns too many for Hive to handle? Thanks,

awatson · ‎12-08-2015

@Laurence Da Luz Are you talking about Spark as a whole (e.g. Spark core)? or SparkSQL? Either way @Guilherme Braccialli links seem to hit all the key topics.

awatson · ‎12-08-2015

@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it. CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ; Put data in /location/to/parentdirectory/2015/01 then try to query.

awatson · ‎12-08-2015

Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above. It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.

awatson · ‎12-04-2015

@Neeraj Sabharwal I don't see why not. Check out this SequenceIQ blog showing how to use GCP and AWS as Hive metastore warehouse directory - http://blog.sequenceiq.com/blog/2014/11/17/datalake-cloudbreak-2/

awatson · ‎12-04-2015

Hi, I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error: 15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68 15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2 java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:218) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)

awatson · ‎12-02-2015

Hi, I am trying to do some Kafka performance testing and want to input JMX paramters, what is the best way to do to that? In the sh file or via Ambari?

awatson · ‎12-01-2015

Thanks @bsaini . Do you know how HDFS rebalancing would work during an OS upgrade/rebalance? When would HDFS start trying to rebalance the data residing on the DataNode being bounced?

awatson · ‎11-30-2015

Are there any best practices/ documentation around patching or upgrading an OS (e.g. upgrading CentOS 6 --> 7 or security patching ) while the cluster is running? Thanks,

Online	Offline
Last Visited	‎02-21-2017 08:38 PM

Member Since	‎09-24-2015 09:53 PM
Last Visited	‎02-21-2017 08:38 PM
Posts	105
Kudos received	82

Cloudera Community

Re: Using Python <2.7.9 with HDP 2.4

Re: save ranger audit to HDFS Vs Ranger audit to D...

Re: Please suggest what is best way to proceed wit...

Re: how many spark execturos runs for the below co...

Re: Spark HiveContext - Querying External Hive Tab...

Re: how many spark execturos runs for the below co...

Limitations on # of Hive Columns

Re: Spark specific recommendations for configuring...

Re: Spark HiveContext - Querying External Hive Tab...

Re: Spark HiveContext - Querying External Hive Tab...

Re: Hive default location as google storage

Spark HiveContext - Querying External Hive Table

Kafka Performance Testing - JMX Parameters Input

Re: HDP OS Upgrade/Patching Best practices

HDP OS Upgrade/Patching Best practices