Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Uses of Complex Spark SQL Data Types

New Contributor

I read that Spark SQL has three complex data types: ArrayType, MapType, and StructType. When would you use these? I'm confused because I was taught that SQL tables should never, ever contain arrays/lists in a single cell value, so why does Spark SQL allow having arraytype?

3 REPLIES 3

New Contributor

Hi,

 

Complex types are generally used to aggregate the characteristics of an object, for example:

Based on: https://impala.apache.org/docs/build/html/topics/impala_struct.html#struct
type:

current_address STRUCT <
        street_address: STRUCT
        <street_number: INT,
         street_name: STRING,
         street_type: STRING>,
         country: STRING,
         postal_code: STRING>

So now we have the 'current_address' attribute and its members grouped.
This is not only organizational, but also has an impact on the performance of the processes related to this table.

When you want to retrieve a data it can be done like this:

SELECT id, name,
current_address.street_address.street_number,
current_address.street_address.street_name,
current_address.street_address.street_type,
current_address.country,
current_address.postal_code
FROM struct_demo;

 

Despite the example they are giving, it refers to Apache Impala the concept is the same applied to Spark.

Hope this helps.

New Contributor

In general, different tasks necessitate varying degrees of flexibility. I've used them on datasets with varied column counts, where the first n columns are always the same, but the next n columns range from 3 to 500. Placing this into a DataFrame with an ArrayType column allows you to do any usual Spark processing while maintaining the data attached. If necessary, other processing steps can explode the array to separate rows, or I can access the complete set.

New Contributor

Spark SQL Complex Datatypes would be used in Case of Complex or Custom Requirements, where you would like to provide schema to your unstructured data, or sometimes even semi structured or structured data as well, you will also use these in Custom UDF's where you would use windowed operation and write you own advanced custom logics, and in Spark SQL you would explode that complex structure to get Dataframe's column.

 

Use-Case may vary depending upon the requirement but underlying concept would remain same as in any programming language to handle data based on data-structures for which specific type is designed.