About gbraccialli3

gbraccialli3 · ‎12-11-2015

I tried this once and didn't work. let me know it works for you.

gbraccialli3 · ‎12-11-2015

@Divya Gehlot If you want the table to be accessible from hive as well, you cannot use saveAsTable. If you use saveAsTable only spark sql will be able to use it. You have two ways to create orc tables from spark (compatible with hive). I tested codes below with hdp 2.3.2 sandbox and spark 1.4.1 1- Saving orc file from spark and create table directly on hive, see this code: spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m import org.apache.spark.sql._ import org.apache.spark.sql.types._ val people = sc.textFile("/tmp/people.txt") val schemaString = "name age" val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val df = sqlContext.createDataFrame(rowRDD, schema); sqlContext.sql("drop table if exists personhivetable") sqlContext.sql("create external table personhivetable (name string, age string) stored as orc location '/tmp/personhivetable/'") df.write.format("orc").mode("overwrite").save("/tmp/personhivetable") sqlContext.sql("show tables").collect().foreach(println); sqlContext.sql("select * from personhivetable").collect().foreach(println); 2- Registering your data frame as temporary table and performing a create table as select spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m import org.apache.spark.sql._ import org.apache.spark.sql.types._ val people = sc.textFile("/tmp/people.txt") val schemaString = "name age" val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val df = sqlContext.createDataFrame(rowRDD, schema); df.registerTempTable("personhivetable_tmp") sqlContext.sql("drop table if exists personhivetable2") sqlContext.sql("CREATE TABLE personhivetable2 STORED AS ORC AS SELECT * from personhivetable_tmp") sqlContext.sql("show tables").collect().foreach(println); sqlContext.sql("select * from personhivetable2").collect().foreach(println); Also, check this question with more discussion about orc + spark. https://community.hortonworks.com/questions/4292/how-do-i-create-an-orc-hive-table-from-spark.html

gbraccialli3 · ‎12-11-2015

Yes, it picks automatically. You can use same code to run in yarn-master, yarn-client or standalone. If you want, you can define the app name as well: val sparkConf = new SparkConf().setAppName("app-name") val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) I copied my code and pom.xml from @Randy Gelhausen one: https://github.com/randerzander/HiveToPhoenix/blob/master/src/main/scala/com/github/randerzander/HiveToPhoenix.scala

gbraccialli3 · ‎12-11-2015

@Jun Chen I see... I know tez has a new way to define number of mappers tasks, described in link below, not sure about number of reducers. Usually, we define a high number of reducers by default (in ambari) and use auto.reducer parameter, that works well. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

gbraccialli3 · ‎12-11-2015

we love spark! 🙂

gbraccialli3 · ‎12-11-2015

@Andrea D'Orio Thanks for sharing, indeed we need information like this. keep sharing 🙂

gbraccialli3 · ‎12-11-2015

@Luis Antonio Torres I'm glad it worked! It took me a while to make it work also. You can also check this scala code I created, where you can get hive commands from command line instead of "hard coded": https://github.com/gbraccialli/SparkUtils/blob/master/src/main/scala/com/github/gbraccialli/spark/HiveCommand.scala

gbraccialli3 · ‎12-11-2015

@Luis Antonio Torres check your command, your are using /etc/hive/conf/hive-site.xml instead of /usr/hdp/current/spark-client/conf/hive-site.xml I think this is the issue.

gbraccialli3 · ‎12-11-2015

@Jun Chen check if you have parameter below turned on: hive.tez.auto.reducer.parallelism when it's on, tez automatically decrease number of reducer tasks based on output from map. you can disable it if you need.

gbraccialli3 · ‎12-11-2015

@Luis Antonio Torres Please, do not use hive-site.xml from hive. You need a clean hive-site.xml for spark, your hive-site.xml for spark should have only this: <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://sandbox.hortonworks.com:9083</value> </property> </configuration>

Online	Offline
Last Visited	‎09-28-2021 03:33 PM

Member Since	‎09-25-2015 05:42 PM
Last Visited	‎09-28-2021 03:33 PM
Posts	230
Kudos received	236

Cloudera Community

Re: How to reset Ambari Admin password?

Re: Connection Refused trying to access port 8000 ...

Re: Flume + Knox

Re: Ambari stuck with "Install Pending" when creat...

Re: HDP 2,3.4- Running jobs is not getting display...

Re: Sqoop logging when invoked via CLI

Re: org.apache.spark.SparkException: Task failed w...

Re: Spark - Hive tables not found when running in ...

Re: Why the number of reducer determined by Hadoop...

Re: Spark on HDP?

Re: Ambari View: How many users per server

Re: Spark - Hive tables not found when running in ...

Re: Spark - Hive tables not found when running in ...

Re: Why the number of reducer determined by Hadoop...

Re: Spark - Hive tables not found when running in ...