Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Create a table from pyspark code on top of parquet file

Highlighted

Create a table from pyspark code on top of parquet file

New Contributor

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. I can see _common_metadata,_metadata and a gz.parquet file generated Now what I am trying to do is that from the same code I want to create a hive table on top of this parquet file which then I can later query from. How can I do that?

1 REPLY 1
Highlighted

Re: Create a table from pyspark code on top of parquet file

iris = spark.read.csv("/tmp/iris.csv", header=True, inferSchema=True) 
iris.printSchema()

Result:

root 
|-- sepalLength: double (nullable = true)
|-- sepalWidth: double (nullable = true)
|-- petalLength: double (nullable = true)
|-- petalWidth: double (nullable = true)
|-- species: string (nullable = true)

Write parquet file ...

iris.write.parquet("/tmp/iris.parquet")

... and create hive table

spark.sql("""
create external table iris_p (
    sepalLength double,
    sepalWidth double,
    petalLength double,
    petalWidth double,
    species string
)
STORED AS PARQUET
location "/tmp/iris.parquet"
""")
Don't have an account?
Coming from Hortonworks? Activate your account here