Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Uses of Complex Spark SQL Data Types

avatar
New Contributor

I read that Spark SQL has three complex data types: ArrayType, MapType, and StructType. When would you use these? I'm confused because I was taught that SQL tables should never, ever contain arrays/lists in a single cell value, so why does Spark SQL allow having arraytype?

3 REPLIES 3

avatar
New Contributor

Hi,

 

Complex types are generally used to aggregate the characteristics of an object, for example:

Based on: https://impala.apache.org/docs/build/html/topics/impala_struct.html#struct
type:

current_address STRUCT <
        street_address: STRUCT
        <street_number: INT,
         street_name: STRING,
         street_type: STRING>,
         country: STRING,
         postal_code: STRING>

So now we have the 'current_address' attribute and its members grouped.
This is not only organizational, but also has an impact on the performance of the processes related to this table.

When you want to retrieve a data it can be done like this:

SELECT id, name,
current_address.street_address.street_number,
current_address.street_address.street_name,
current_address.street_address.street_type,
current_address.country,
current_address.postal_code
FROM struct_demo;

 

Despite the example they are giving, it refers to Apache Impala the concept is the same applied to Spark.

Hope this helps.

avatar
New Contributor

In general, different tasks necessitate varying degrees of flexibility. I've used them on datasets with varied column counts, where the first n columns are always the same, but the next n columns range from 3 to 500. Placing this into a DataFrame with an ArrayType column allows you to do any usual Spark processing while maintaining the data attached. If necessary, other processing steps can explode the array to separate rows, or I can access the complete set.

avatar
New Contributor

Spark SQL Complex Datatypes would be used in Case of Complex or Custom Requirements, where you would like to provide schema to your unstructured data, or sometimes even semi structured or structured data as well, you will also use these in Custom UDF's where you would use windowed operation and write you own advanced custom logics, and in Spark SQL you would explode that complex structure to get Dataframe's column.

 

Use-Case may vary depending upon the requirement but underlying concept would remain same as in any programming language to handle data based on data-structures for which specific type is designed.