Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Can SparkSql Write a Flattened JSON Table to a File?

Highlighted

Can SparkSql Write a Flattened JSON Table to a File?

Rising Star

I recently posted an article that reads in JSON and uses Spark to flatten it into a queryable table. Link is here https://community.hortonworks.com/articles/43701/json-to-sql-using-spark.html

using this cmd

val jsonEvents  = sqlc.read.json(events)

I want to now write the newly created schema to a file, is this possible?

example:

{account:{name:123, type:retail}}. Read.json method will take the input and create account.name, account.type. How can I get spark to write this same format to a file? Desired format "123,retail".

So far I have tried these options

jsonEvents.rdd.saveAsTextFile("/events/one")
jsonEvents.write.json("/events/one/example.json")

but each gave different results as expected.

[[null,Columbus,Ohio,21000],future,forward,null,null,40.00,456,sell]

{"account":{"city":"Columbus","state":"Ohio","zip":"21000"},"assetClass":"future","contractType":"forward","strikePrice":"40.00","tradeId":"456","transType":"sell"}

the schema of read.json is :

root
 |-- account: struct (nullable = true)
 |    |-- accountType: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- zip: string (nullable = true)
 |-- assetClass: string (nullable = true)
 |-- contractType: string (nullable = true)
 |-- price: string (nullable = true)
 |-- stockAttributes: struct (nullable = true)
 |    |-- 52weekHi: string (nullable = true)
 |    |-- 5avg: string (nullable = true)
 |-- strikePrice: string (nullable = true)
 |-- tradeId: string (nullable = true)
 |-- transType: string (nullable = true)
6 REPLIES 6

Re: Can SparkSql Write a Flattened JSON Table to a File?

@Kirk Haslbeck

At first, it appears what you want is a flat file of the values (not the keys/columns) stored in the events DataFrame. Perhaps not the direct approach, but consider writing the DataFrame to a Hive table using registerTempTable(), which will store the values to Hive managed table, as well as storing metadata (i.e. column names) to Hive metastore. For instance:

events.registerTempTable("staging")
sqlContext.sql("CREATE TABLE events STORED AS ORC AS SELECT * FROM staging")

On the other hand, if you want to use the DataFrame API to do this, you might try this:

events.write.saveAsTable(tableName)

Re: Can SparkSql Write a Flattened JSON Table to a File?

Rising Star

@Paul Hargis I tried this but the hive table did not end up with flattened columns like the spark SQL table. See image below.

5596-hive-table.png

Re: Can SparkSql Write a Flattened JSON Table to a File?

New Contributor

If I'm understanding you correctly your desired format is a csv file? I know there are some libraries out there that do this if you need things like quote and escape chars (spark-csv). For simpler cases you can also map through your events and transform them into a string with the values you want separated by commas.

Re: Can SparkSql Write a Flattened JSON Table to a File?

@Kirk Haslbeck

Better answer here, using built-in sql functions concat() and lit() to create a single String value holding the contents of the selected Row values. For simplicity, I only included 2 columns here, the "tradeId" and "assetClass" columns:

%spark
import org.apache.spark.sql.functions.{concat, lit}

val jsonStrings = jsonEvents.select(concat($"tradeId", lit(","), $"assetClass").alias("concat"))
jsonStrings.show()

Results in the following:

import org.apache.spark.sql.functions.{concat, lit}
jsonStrings: org.apache.spark.sql.DataFrame = [concat: string]
+----------+
|    concat|
+----------+
| 123,stock|
|456,future|
|789,option|
+----------+

Re: Can SparkSql Write a Flattened JSON Table to a File?

Rising Star

@Paul Hargis The issue with this approach is that you need to type all the column names "$tradeId" which negates the advantage of using sparks json reader. The reader dynamically creates the schema, this line would break that by physically typing the schema.

Re: Can SparkSql Write a Flattened JSON Table to a File?

Rising Star

So far this is the best answer I can find but it doesn't fully satisfy my requirement.

df.select($"tradeId", $"assetClass", $"transType", $"price", $"strikePrice", $"contractType", $"stockAttributes.*", $"account.*").printSchema

The schema is now flat with minimal coding but it did require static typing of the fields.

root
 |-- tradeId: string (nullable = true)
 |-- assetClass: string (nullable = true)
 |-- transType: string (nullable = true)
 |-- price: string (nullable = true)
 |-- strikePrice: string (nullable = true)
 |-- contractType: string (nullable = true)
 |-- 52weekHi: string (nullable = true)
 |-- 5avg: string (nullable = true)
 |-- accountType: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: string (nullable = true)