Member since
05-02-2017
3
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
15647 | 09-24-2018 03:00 PM |
09-24-2018
03:00 PM
1 Kudo
I was able to resolve the problem by manually editing my Python script to incorporate the settings from /etc/hive/conf/hive-site.xml. What worked is the following code: settings = [
("hive.exec.dynamic.partition", "true"),
("hive.exec.dynamic.partition.mode", "nonstrict"),
("spark.sql.orc.filterPushdown", "true"),
("hive.msck.path.validation", "ignore"),
("spark.sql.caseSensitive", "true"),
("spark.speculation", "false"),
("hive.metastore.authorization.storage.checks", "false"),
("hive.metastore.cache.pinobjtypes", "Table,Database,Type,FieldSchema,Order"),
("hive.metastore.client.connect.retry.delay", "5s"),
("hive.metastore.client.socket.timeout", "1800s"),
("hive.metastore.connect.retries", "12"),
("hive.metastore.execute.setugi", "false"),
("hive.metastore.failure.retries", "12"),
("hive.metastore.kerberos.keytab.file", "/etc/security/keytabs/hive.service.keytab"),
("hive.metastore.kerberos.principal", "hive/_HOST@EXAMPLE.COM"),
("hive.metastore.pre.event.listeners", "org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener"),
("hive.metastore.sasl.enabled", "false"),
("hive.metastore.schema.verification", "false"),
("hive.metastore.schema.verification.record.version", "false"),
("hive.metastore.server.max.threads", "100000"),
("hive.metastore.uris", "thrift://sandbox-hdp.hortonworks.com:9083"),
("hive.metastore.warehouse.dir", "thrift://sandbox-hdp.hortonworks.com:9083"),
("hive.metastore.authorization.storage.checks", "/apps/hive/warehouse")
]
spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
self.spark = SparkSession.builder. \
config(conf = spark_conf). \
enableHiveSupport(). \
getOrCreate() What did *not* work was copying /etc/hive/conf/hive-site.xml to /etc/spark2/conf/hive-site.xml. I don't know why this didn't work, but it didn't. I hope this resolution helps someone else! Chris Falter
... View more
09-19-2018
03:45 PM
Hi Sandeep, I appreciate your response. I followed both of your suggestions: # cp /etc/hive/conf/hive-site.xml /etc/spark/conf/ This did not result in a change in behavior # spark-submit --py-files ./spark.zip --files /etc/spark/conf/hive-site.xml sample.py No change there, either. I also tried to create the sampletable in the default database, instead. Same result, different error message: 'UnresolvedRelation `default`.`sampletable` I suspect that the job being submitted is looking at a different Hive metastore than the one I am viewing in Ambari > Hive 2.0 View....even though I am seeing the output that would indicate, IMO, that the Spark job is using the correct Hive metastore: SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/apps/hive/warehouse').
2018-09-19 15:29:59 INFO SharedState:54 - Warehouse path is '/apps/hive/warehouse'
I am using the default master for, which is 'local[*]'. Any other ideas?
... View more
09-19-2018
10:36 AM
I am using the HDP 2.6.5 Sandbox on Docker/Windows 10. Using the Hive 2 view, I created a database and table. Let's call them "sampledb" with a table called "sampletable" in HDFS location /sampledb/sampletable/. I stored data in the form of an ORC file in the appropriate directory, and invoked `msck repair sampledb.sampletable`. From Ambari Hive View and Hive View 2.0 I am able to successfully read data from sampletable. In Zeppelin I wrote a Pyspark script that uses the Spark SQL interface to read data from sampletable. It works, no problem. Note, this is Spark 2, not Spark 1.6. We do not want to run Zeppelin scripts in production, so I converted the script to standard Python for use with spark-submit. The Spark initialization code is below: settings = [
("hive.exec.dynamic.partition", "true"),
("hive.exec.dynamic.partition.mode", "nonstrict"),
("spark.sql.orc.filterPushdown", "true"),
("hive.msck.path.validation", "ignore"),
("spark.sql.caseSensitive", "true"),
("spark.speculation", "false"),
]
spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
self.spark = SparkSession.builder. \
enableHiveSupport(). \
config(conf = spark_conf). \
getOrCreate()
Python dependencies are zipped into spark.zip and the script is called sample.py. After copying the script to my /home/ dir, I attempt run it as follows: # spark-submit --py-files ./spark.zip sample.py The relevant error output, so far as I can tell, are the following console outputs: SharedState:54 - Warehouse path is 'file:/home/raj_ops/spark-warehouse'.
...
HiveClientImpl:54 - Warehouse location for Hive client (version 1.2.2) is file:/home/raj_ops/spark-warehouse
...
Failed to get database sampledb, returning NoSuchObjectException
u"Table or view not found: `sampledb`.`sampletable`
The only thing I had to show for my effort was a new /spark-warehouse sub-directory within /home/raj_ops/. So I tried adding the following line to the Python script: ("spark.sql.warehouse.dir","/apps/hive/warehouse") So now the full configuration is as follows: settings = [
("hive.exec.dynamic.partition", "true"),
("hive.exec.dynamic.partition.mode", "nonstrict"),
("spark.sql.orc.filterPushdown", "true"),
("hive.msck.path.validation", "ignore"),
("spark.sql.caseSensitive", "true"),
("spark.speculation", "false"),
("spark.sql.warehouse.dir","/apps/hive/warehouse")
]
spark_conf = SparkConf().setAppName("sampleApp").setAll(settings)
The invocation of spark-submit was the same. The outcome was similarly unsuccessful. What else do I need to do to get the queries from Pyspark scripts to access the Hive database and tables successfully? Thanks in advance! Chris Falter
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark