Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to get datatype for specific field name from schema attribute of pyspark dataframe (from parquet files)?

How to get datatype for specific field name from schema attribute of pyspark dataframe (from parquet files)?

Rising Star

Have a folder of parquet files that I am reading into a pyspark session. How can I inspect / parse the individual schema field types and other info (eg. for the purpose of comparing schemas between dataframes to see exact type differences)?

I can see the parquet schema and specific field names with something like...

from pyspark.sql import SparkSession
from pyspark.sql.functions import *sparkSession = SparkSession.builder.appName("data_debugging").getOrCreate()df = sparkSession.read.option("header", "true").parquet("hdfs://hw.co.local:8020/path/to/parquets")
df.schema # or df.printSchema()
df.fieldNames()

So I can see the schema

StructType(List(StructField(SOME_FIELD_001,StringType,true),StructField(SOME_FIELD_002,StringType,true),StructField(SOME_FIELD_003,StringType,true)))

but not sure how to get the values for specific fields, eg. something like...

df.schema.getType("SOME_FIELD_001")
#or df.schema.getData("SOME_FIELD_001") #type: dict

Does anyone know how to do something like this?

Don't have an account?
Coming from Hortonworks? Activate your account here