Support Questions

abhinavmishra59 · ‎04-11-2017

I am writing data to a parquet file format using peopleDF.write.parquet("people.parquet")in PySpark code. I can see _common_metadata,_metadata and a gz.parquet file generated Now what I am trying to do is that from the same code I want to create a hive table on top of this parquet file which then I can later query from. How can I do that?

bwalter1 · ‎04-12-2017

iris = spark.read.csv("/tmp/iris.csv", header=True, inferSchema=True) 
iris.printSchema()

Result:

root 
|-- sepalLength: double (nullable = true)
|-- sepalWidth: double (nullable = true)
|-- petalLength: double (nullable = true)
|-- petalWidth: double (nullable = true)
|-- species: string (nullable = true)

Write parquet file ...

iris.write.parquet("/tmp/iris.parquet")

... and create hive table

spark.sql("""
create external table iris_p (
    sepalLength double,
    sepalWidth double,
    petalLength double,
    petalWidth double,
    species string
)
STORED AS PARQUET
location "/tmp/iris.parquet"
""")

Cloudera Community

Support Questions

Create a table from pyspark code on top of parquet file