Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

load data into hive partitioned table using pyspark

Highlighted

load data into hive partitioned table using pyspark

I want to load data into dynamically partitioned table in hive using pyspark , table is already created in hive only data load has to be done with pyspark .

I am using below code to do above requirement but need more suggestions :

spark = SparkSession \    .builder \    .appName("Python Spark SQL Hive integration example") \    .config("spark.sql.warehouse.dir", warehouse_location) \    .enableHiveSupport() \    .getOrCreate()

# spark is an existing SparkSession
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

1. INPATH path should be dynamic --- how to pass in spark.sql


Kind Regards

Anurag

1 REPLY 1
Highlighted

Re: load data into hive partitioned table using pyspark

Super Guru

@Anurag Mishra

You can pass the INPATH as argument to the spark submit then using

import sys
print sys.argv //to get the arguments passed to the script.

once you are able to get the passed arguments then you can pass those values in spark.sql using .format function.

spark.sql("LOAD DATA LOCAL INPATH '{}' INTO TABLE src".format(variable_name))

Refer to this and this links for more details regards to args parsing pyspark.


Don't have an account?
Coming from Hortonworks? Activate your account here