Member since
09-24-2015
105
Posts
82
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2120 | 04-11-2016 08:30 PM | |
1749 | 03-11-2016 04:08 PM | |
1749 | 12-21-2015 09:51 PM | |
1021 | 12-18-2015 10:43 PM | |
8632 | 12-08-2015 03:01 PM |
12-18-2015
10:43 PM
Hi @Srinivasarao Daruna HDP does not support Spark in Standalone mode. You need to use Spark on Yarn. Running Spark in Yarn Cluster mode you can specify number of executors by using the parameter: --num-executor=6 This will give you 6 executors For additional information regarding using Yarn Cluster mode please see - http://spark.apache.org/docs/latest/running-on-yar... Cheers, Andrew
... View more
12-14-2015
03:08 PM
3 Kudos
What is the Hive Column count Limit? When should you consider moving the table into Hbase/Phoenix due to performance issues? Is 100,000 columns too many for Hive to handle? Thanks,
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Phoenix
12-08-2015
05:31 PM
@Laurence Da Luz Are you talking about Spark as a whole (e.g. Spark core)? or SparkSQL? Either way @Guilherme Braccialli links seem to hit all the key topics.
... View more
12-08-2015
03:04 PM
@Neeraj Sabharwal see my below comment. If you want to reproduce. Create an external table that references a directory higher than the directory with data in it. Don't specify partitions and try running it. CREATE EXTERNAL TABLE TEST1 (COL1 STRING) location '/location/to/parentdirectory' ; Put data in /location/to/parentdirectory/2015/01 then try to query.
... View more
12-08-2015
03:01 PM
2 Kudos
Okay we figured it out. I was talking about creating an external table in Hive and then using Spark SQL to query it. The external table had sub-directories (e.g. ParentDirectory/2015/01/data.txt), that Hive was easily able to traverse and query However SparkSQL (and Presto) weren't able to, SparkSQL would give the error mentioned above. It wasn't until we properly defined the sub-directories as partitions in Hive (e.g. ParentDirectory/year=2015/month=01) and added them to the metastore (alter table add partition) that SparkSQL (and Presto) were able to finally query the table without issues.
... View more
12-04-2015
07:02 PM
1 Kudo
@Neeraj Sabharwal I don't see why not. Check out this SequenceIQ blog showing how to use GCP and AWS as Hive metastore warehouse directory - http://blog.sequenceiq.com/blog/2014/11/17/datalake-cloudbreak-2/
... View more
12-04-2015
06:45 PM
Hi, I am currently trying to query an external Hive Table that is pointed to a directory via SparkSQL. When I attempt to do a SELECT * FROM TABLE, I get the following error: 15/11/30 15:25:01 INFO DefaultExecutionContext: Created broadcast 3 from broadcast at TableReader.scala:68
15/11/30 15:25:01 INFO FileInputFormat: Total input paths to process : 2
java.io.IOException: Not a file: hdfs://clster/data/raw/EDW/PROD/Prod_DB/test/20151124/2014
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:320)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:218)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:218)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
12-02-2015
08:28 PM
Hi, I am trying to do some Kafka performance testing and want to input JMX paramters, what is the best way to do to that? In the sh file or via Ambari?
... View more
Labels:
- Labels:
-
Apache Kafka
12-01-2015
03:22 PM
Thanks @bsaini . Do you know how HDFS rebalancing would work during an OS upgrade/rebalance? When would HDFS start trying to rebalance the data residing on the DataNode being bounced?
... View more
11-30-2015
08:32 PM
3 Kudos
Are there any best practices/ documentation around patching or upgrading an OS (e.g. upgrading CentOS 6 --> 7 or security patching ) while the cluster is running? Thanks,
... View more
Labels: