Created on 06-05-2020 03:55 AM - last edited on 03-16-2022 10:41 AM by ask_bill_brooks
I read that Spark SQL has three complex data types: ArrayType, MapType, and StructType. When would you use these? I'm confused because I was taught that SQL tables should never, ever contain arrays/lists in a single cell value, so why does Spark SQL allow having arraytype?
Created 06-05-2020 04:49 AM
Hi,
Complex types are generally used to aggregate the characteristics of an object, for example:
Based on: https://impala.apache.org/docs/build/html/topics/impala_struct.html#struct
type:
current_address STRUCT <
street_address: STRUCT
<street_number: INT,
street_name: STRING,
street_type: STRING>,
country: STRING,
postal_code: STRING>
So now we have the 'current_address' attribute and its members grouped.
This is not only organizational, but also has an impact on the performance of the processes related to this table.
When you want to retrieve a data it can be done like this:
SELECT id, name,
current_address.street_address.street_number,
current_address.street_address.street_name,
current_address.street_address.street_type,
current_address.country,
current_address.postal_code
FROM struct_demo;
Despite the example they are giving, it refers to Apache Impala the concept is the same applied to Spark.
Hope this helps.
Created 03-16-2022 01:42 AM
In general, different tasks necessitate varying degrees of flexibility. I've used them on datasets with varied column counts, where the first n columns are always the same, but the next n columns range from 3 to 500. Placing this into a DataFrame with an ArrayType column allows you to do any usual Spark processing while maintaining the data attached. If necessary, other processing steps can explode the array to separate rows, or I can access the complete set.
Created 04-04-2022 07:07 AM
Spark SQL Complex Datatypes would be used in Case of Complex or Custom Requirements, where you would like to provide schema to your unstructured data, or sometimes even semi structured or structured data as well, you will also use these in Custom UDF's where you would use windowed operation and write you own advanced custom logics, and in Spark SQL you would explode that complex structure to get Dataframe's column.
Use-Case may vary depending upon the requirement but underlying concept would remain same as in any programming language to handle data based on data-structures for which specific type is designed.