Support Questions

ask_bill_brooks · ‎06-05-2020

I read that Spark SQL has three complex data types: ArrayType, MapType, and StructType. When would you use these? I'm confused because I was taught that SQL tables should never, ever contain arrays/lists in a single cell value, so why does Spark SQL allow having arraytype?

Volnei · ‎06-05-2020

Hi,

Complex types are generally used to aggregate the characteristics of an object, for example:

Based on: https://impala.apache.org/docs/build/html/topics/impala_struct.html#struct
type:

current_address STRUCT <
        street_address: STRUCT
        <street_number: INT,
         street_name: STRING,
         street_type: STRING>,
         country: STRING,
         postal_code: STRING>

So now we have the 'current_address' attribute and its members grouped.
This is not only organizational, but also has an impact on the performance of the processes related to this table.

When you want to retrieve a data it can be done like this:

SELECT id, name,
current_address.street_address.street_number,
current_address.street_address.street_name,
current_address.street_address.street_type,
current_address.country,
current_address.postal_code
FROM struct_demo;

Despite the example they are giving, it refers to Apache Impala the concept is the same applied to Spark.

Hope this helps.

zeno12 · ‎03-16-2022

In general, different tasks necessitate varying degrees of flexibility. I've used them on datasets with varied column counts, where the first n columns are always the same, but the next n columns range from 3 to 500. Placing this into a DataFrame with an ArrayType column allows you to do any usual Spark processing while maintaining the data attached. If necessary, other processing steps can explode the array to separate rows, or I can access the complete set.

_its_ck · ‎04-04-2022

Spark SQL Complex Datatypes would be used in Case of Complex or Custom Requirements, where you would like to provide schema to your unstructured data, or sometimes even semi structured or structured data as well, you will also use these in Custom UDF's where you would use windowed operation and write you own advanced custom logics, and in Spark SQL you would explode that complex structure to get Dataframe's column.

Use-Case may vary depending upon the requirement but underlying concept would remain same as in any programming language to handle data based on data-structures for which specific type is designed.

Cloudera Community

Support Questions

Uses of Complex Spark SQL Data Types

JSON to SQL using Spark

JupyterLab and Spark Connect Quickstart in Clouder...

Cloudera Data Engineering Spark Job with Python Wh...

Spark in CML: Recommendations for using Spark in C...

Working with CDE Spark Job Parameters in Cloudera ...

Spark Geospatial with Apache Sedona in Cloudera Da...

Spark Categorical Data Transformation

Data Integrity check using Spark JDBC with Encrypt...

Analyze Resource Manager REST API stats using Spar...

Creating and using Custom Airflow Operators in Clo...