Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

StructType schema spark on JSON

avatar
Master Collaborator

Hi.

How can i create the schema with 2 levels in a JSON in spark??

>>> df1.schema
StructType(List(StructField(CAMPO1,StringType,true),StructField(CAMPO2,StringType,true),StructType(List(StructField(VARIABLE,StringType,true),StructField(V1,StringType,true))),true))

This code doesnt work:

schema = StructType([
StructField("CAMPO1", StringType(), True),
StructField("CAMPO2", StringType(), True),
StructField("VARIABLE.V1", StringType(), True)
])

The json i have is:

{"CAMPO1":"xxxx","CAMPO2":"xxx","VARIABLE":{"V1":"xxx"}}

please could you help me?

Many thanks

3 REPLIES 3

avatar

@Roberto Sancho

You're schema structure is close, but you need to make a few modifications, like this:

import org.apache.spark.sql.types._ 

val data = sc.parallelize("""{"CAMPO1":"xxxx","CAMPO2":"xxx","VARIABLE":{"V1":"xxx"}}""" :: Nil)

val schema = (new StructType)
    .add("CAMPO1", StringType)
    .add("CAMPO2", StringType)
    .add("VARIABLE", (new StructType)
        .add("V1", StringType))

sqlContext.read.schema(schema).json(data).select("VARIABLE.V1").show()

Please let me know if this works for you. Thanks!

avatar
Master Collaborator

I am on Python enviornment, I have translate the scala code to Python code like that, but doesnt WORK, please any suggestion?

schema = StructType([
StructField("CAMPO1", StringType(), True),
StructField("CAMPO2", StringType(), True),
StructField("VARIABLE", StructType([
StructField("V1", StringType(), True),
StructField("V2", DoubleType(), True),
StructField("V3", StringType(), True)]))
])

df1 = sqlContext.read.json("xxxx.json",schema).select('VARIABLE.V2').show()
+-----------------+
|V2               |
+-----------------+
|             null|
+-----------------

avatar
Master Collaborator

Hi:

I have resolved the problem, but I thing there is A bug or somenthing, let my explain:

The V1=11.88 whe y type DoubleType or DecimalType doesnt work, but if I type StringType, is working, so... please could you confirm that is correct my test????

{"CAMPO1":"xxxx","CAMPO2":"xxx","VARIABLE":{"V1":"11.88"}}

schema = StructType([
StructField("CAMPO1", StringType(), True),
StructField("CAMPO2", StringType(), True),
StructField("VARIABLE", StructType([
StructField("V1", StringType(), True),
StructField("V2", StringType(), True),
StructField("V3", StringType(), True)]))
])

thanks