Member since
06-18-2015
55
Posts
34
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1314 | 03-04-2016 02:39 AM | |
1858 | 12-29-2015 09:42 AM |
12-17-2015
07:28 AM
1 Kudo
Hi, I am confused about hue .. 1.This link says have to use hue which shipped with HDP 2.3.2 . 2.When I google it I get this link Which option is correct ? Would really appreciate if somebody can helps me with the steps ? My cluster is on EC2 having RHEL .
... View more
Labels:
- Labels:
-
Cloudera Hue
12-17-2015
05:01 AM
Hi
@
Neeraj Sabharwal
@Jeremy Dyer
Processing and inserting data in hive without schema
//Processing and inserting data in hive without schema
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val df = hiveContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/tmp/cars.csv")
val selectedData = df.select("year", "model")
selectedData.write.format("orc").option("header", "true").save("/tmp/newcars_orc_cust17")
//permission issues as user hive
// org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode="/tmp/newcars_orc_cust17":hdfs:hdfs:drwxr-xr-x
//Updated /tmp/newcars_orc_cust17 directory permissions
hiveContext.sql("create external table newcars_orc_ext_cust17(year string,model string) stored as orc location '/tmp/newcars_orc_cust17'")
hiveContext.sql("show tables").collect().foreach(println)
[cars_orc_ext,false]
[cars_orc_ext1,false]
[cars_orc_exte,false]
[newcars_orc_ext_cust17,false]
[sample_07,false]
[sample_08,false]
hiveContext.sql("select * from newcars_orc_ext_cust17").collect().foreach(println)
ook 1.459321 s
[2012,S]
[1997,E350]
[2015,Volt]
Hive console
hive> show tables ;
OK
cars_orc_ext
cars_orc_ext1
cars_orc_exte
newcars_orc_ext_cust17
sample_07
sample_08
Time taken: 12.185 seconds, Fetched: 6 row(s)
hive> select * from newcars_orc_ext_cust17 ;
OK
2012 S
1997 E350
2015 Volt
Time taken: 48.922 seconds, Fetched: 3 row(s)
Now When I try the same code by defining the custom schema and executing it Getting below errors :
Processing and inserting data in hive with custom schema
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val customSchema = StructType( StructField("year", IntegerType, true),StructField("make", StringType, true),StructField("model", StringType, true),StructField("comment", StringType, true),StructField("blank", StringType, true))
<br>scala> val customSchema = StructType( StructField("year", IntegerType, true),StructField("make", StringType, true),StructField("model", StringType, true),StructField("comment", StringType, true),StructField("blank", StringType, true))
<console>:24: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
val customSchema = StructType( StructField("year", IntegerType, true),StructField("make", StringType, true),StructField("model", StringType, true),StructField("comment", StringType, true),StructField("blank", StringType, true))
Any help/pointers appreciated
Thanks
... View more
12-17-2015
01:56 AM
@vshukla
I am also facing the same issue .. I saved the data in orc format from DF and created external hive table ..when I do show tables in hive context in spark it shows me the table but I couldnt see any table in my hive warehouse so when I query the hive external table. when I just create the hive table(no df no data processing ) using hivecontext table get created and able to query also .Unable to understand this strange behaviour . Am I misisng something ? for ex : hiveContext.sql("CREATE TABLE IF NOT EXISTS TestTable (name STRING, age STRING)") shows me the table in hive also.
... View more
12-14-2015
09:31 AM
Hi, I am new bee to spark and using spark 1.4.1 How can I save the output to hive as external table . For instance ,I have a csv file which I am parsing through spark -csv packages which results me a DataFrame. Now how do I save this dataframe as hive external table using hivecontext. Would really appreciate your pointers/guidance. Thanks, Divya
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
12-14-2015
03:13 AM
@Neeraj Sabharwal
Thanks alot for the prompt response.
I am using HDP2.3.2 Vmware version(Link) . Is there any workaround to make it work?
... View more
12-14-2015
01:55 AM
Is spark-csv packages is not supported by HDP2.3.2?
I am getting below error when I try to run spark-shell that spark-csv package is not supported.
[hdfs@sandbox root]$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0 --master yarn-client --driver-memory 512m --executor-memory 512m
Ivy Default Cache set to: /home/hdfs/.ivy2/cache
The jars for the packages stored in: /home/hdfs/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
:: resolution report :: resolve 332ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: com.databricks#spark-csv_2.10;1.1.0
==== local-m2-cache: tried
file:/home/hdfs/.m2/repository/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.pom
-- artifact com.databricks#spark-csv_2.10;1.1.0!spark-csv_2.10.jar:
file:/home/hdfs/.m2/repository/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.jar
==== local-ivy-cache: tried
/home/hdfs/.ivy2/local/com.databricks/spark-csv_2.10/1.1.0/ivys/ivy.xml
==== central: tried
https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.pom
-- artifact com.databricks#spark-csv_2.10;1.1.0!spark-csv_2.10.jar:
https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.jar
==== spark-packages: tried
http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.pom
-- artifact com.databricks#spark-csv_2.10;1.1.0!spark-csv_2.10.jar:
http://dl.bintray.com/spark-packages/maven/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.databricks#spark-csv_2.10;1.1.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.pom (java.net.ConnectException: Connection refused)
Server access error at url https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.1.0/spark-csv_2.10-1.1.0.jar (java.net.ConnectException: Connection refused)
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.databricks#spark-csv_2.10;1.1.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:995)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:263)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:145)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/12/14 01:49:39 INFO Utils: Shutdown hook called
[hdfs@sandbox root]$
Would really appreciate your help.
... View more
Labels:
- Labels:
-
Apache Spark
12-11-2015
05:46 AM
Hi,
I am using HDP2.3.2 with Spark 1.4.1 and trying to insert data in hive table using hive context. Below is the sample code
spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m
//Sample code
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sc.textFile("/user/spark/people.txt")
val schemaString = "name age"
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
//Create hive context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
//Apply the schema to the
val df = hiveContext.createDataFrame(rowRDD, schema);
val options = Map("path" -> "hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/personhivetable")
df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").options(options).saveAsTable("personhivetable")
Getting below error :
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$writeRows$1(commands.scala:191)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$anonfun$insert$1.apply(commands.scala:160)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$anonfun$insert$1.apply(commands.scala:160)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at $line30.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$anonfun$2.apply(<console>:29)
at $line30.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC$anonfun$2.apply(<console>:29)
at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$writeRows$1(commands.scala:182)
... 8 more
Is it configuration issue? When I googled it I found out that Environment variable named HIVE_CONF_DIR should be there in spark-env.sh Then I checked spark-env.sh in HDP2.3.2,I couldnt find the Environment variable named HIVE_CONF_DIR . Do I need to add above mentioned variables to insert spark output data to hive tables. Would really appreciate pointers.
Thanks,
Divya
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
12-10-2015
04:32 AM
1 Kudo
Hi, @AliBajwa I tried the same steps mentioned in BUG-46851 with VMware Sandbox HDP 2.3.2 . Voila , I am able to view Zeppelin Page. Thanks alot for all your help. Still Trying to figure out whats wrong with virtual box HDP2.3.2 sandbox .
... View more
12-10-2015
03:35 AM
1 Kudo
Hi, @ali Bajwa I am using HDP2.3.2 sandbox for virtual box. I tried the option which you have mentioned . I edited my windows /etc/ hosts file with the ip :127.0.0.1 sandbox.hortonworks.com I tried accessing the Zeppelin page ,Now I am getting "Unable to resolve the server's DNS address." Screenshot for your reference. @ Neeraj Sabharwal :I am less familier with the port forwarding .Can you please elaborate more , what exact steps I need to follow for Zeppelin to work. Thanks in advance Divya
... View more
12-10-2015
01:45 AM
Hi, I have installed HDP 2.3.2 sandbox for virtualbox and when I try to access Zepplin through Ambari . I am getting below screen . Am I missing any configuration ? Would really appreciate your help. Thanks, Divya
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache Zeppelin
- « Previous
- Next »