Member since
08-08-2018
31
Posts
1
Kudos Received
0
Solutions
12-12-2024
12:36 AM
1 Kudo
Issue with different schema merging in Iceberg tables from Spark During running a merge from Spark to an Iceberg table like this: df.writeTo("tablename").append() The following error would appear if the schema of DataFrame and Table are different: Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'catalog.tablename':
- Cannot find data for output column 'column3'
- Cannot find data for output column 'column4' This is happening because by default Iceberg doesn't accept different schema to be merged. Problem solution We need to add following table properties to our Iceberg table, to accept different schema merging: ALTER TABLE tablename SET TBLPROPERTIES (
'write.spark.accept-any-schema'='true'
) After that we need to add MergeSchema option true to our append command in Spark: data.writeTo("tablename").option("mergeSchema","true").append() With these changes Iceberg would accept this merge and will do it without any issue. To read more about this behaviour here is the official documentation: https://iceberg.apache.org/docs/1.6.0/spark-writes/#schema-merge
... View more
Labels:
12-11-2024
02:17 AM
Why this article is created? The main reason for creating and listing all the legacy configurations for Spark 3 is to ease the migration from Spark 2 to Spark 3. Main concern here was that during migration Spark 3 behaves different than Spark 2 due to upstream changes and causing the same code to work differently. Legacy configuration in Spark 3 Property name Default Meaning spark.sql.catalog.your_catalog_name.pushDownAggregate spark.sql.catalog.your_catalog_name.pushDownLimit spark.sql.catalog.your_catalog_name.pushDownOffset True Since Spark 3.5, the JDBC options related to DS V2 pushdown are true by default. These options include: pushDownAggregate, pushDownLimit, pushDownOffset and pushDownTableSample. To restore the legacy behavior, please set them to false. e.g. set spark.sql.catalog.your_catalog_name.pushDownAggregate to false. spark.sql.legacy.negativeIndexInArrayInsert False Since Spark 3.5, the array_insert function is 1-based for negative indexes. It inserts new element at the end of input arrays for the index -1. To restore the previous behavior, set spark.sql.legacy.negativeIndexInArrayInsert to true. spark.sql.legacy.avro.allowIncompatibleSchema False Since Spark 3.5, the Avro will throw AnalysisException when reading Interval types as Date or Timestamp types, or reading Decimal types with lower precision. To restore the legacy behavior, set spark.sql.legacy.avro.allowIncompatibleSchema to true spark.sql.legacy.v1IdentifierNoCatalog False Since Spark 3.4, v1 database, table, permanent view and function identifier will include ‘spark_catalog’ as the catalog name if database is defined, e.g. a table identifier will be: spark_catalog.default.t. To restore the legacy behavior, set spark.sql.legacy.v1IdentifierNoCatalog to true. spark.sql.legacy.skipTypeValidationOnAlterPartition False Since Spark 3.4, Spark will do validation for partition spec in ALTER PARTITION to follow the behavior of spark.sql.storeAssignmentPolicy which may cause an exception if type conversion fails, e.g. ALTER TABLE .. ADD PARTITION(p='a') if column p is int type. To restore the legacy behavior, set spark.sql.legacy.skipTypeValidationOnAlterPartition to true. spark.sql.orc.enableNestedColumnVectorizedReader spark.sql.parquet.enableNestedColumnVectorizedReader True Since Spark 3.4, vectorized readers are enabled by default for the nested data types (array, map and struct). To restore the legacy behavior, set spark.sql.orc.enableNestedColumnVectorizedReader and spark.sql.parquet.enableNestedColumnVectorizedReader to false. spark.sql.optimizer.runtime.bloomFilter.enabled True Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set spark.sql.optimizer.runtime.bloomFilter.enabled to false. spark.sql.parquet.inferTimestampNTZ.enabled True Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation isAdjustedToUTC=false will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set spark.sql.parquet.inferTimestampNTZ.enabled to false. spark.sql.legacy.allowNonEmptyLocationInCTAS False Since Spark 3.4, the behavior for CREATE TABLE AS SELECT ... is changed from OVERWRITE to APPEND when spark.sql.legacy.allowNonEmptyLocationInCTAS is set to true. Users are recommended to avoid CTAS with a non-empty table location. spark.sql.legacy.histogramNumericPropagateInputType False Since Spark 3.3, the histogram_numeric function in Spark SQL returns an output type of an array of structs (x, y), where the type of the ‘x’ field in the return value is propagated from the input values consumed in the aggregate function. In Spark 3.2 or earlier, ‘x’ always had double type. Optionally, use the configuration spark.sql.legacy.histogramNumericPropagateInputType since Spark 3.3 to revert back to the previous behavior. spark.sql.legacy.lpadRpadAlwaysReturnString False Since Spark 3.3, the functions lpad and rpad have been overloaded to support byte sequences. When the first argument is a byte sequence, the optional padding pattern must also be a byte sequence and the result is a BINARY value. The default padding pattern in this case is the zero byte. To restore the legacy behavior of always returning string types, set spark.sql.legacy.lpadRpadAlwaysReturnString to true. spark.sql.legacy.respectNullabilityInTextDatasetConversion False Since Spark 3.3, Spark turns a non-nullable schema into nullable for API DataFrameReader.schema(schema: StructType).json(jsonDataset: Dataset[String]) and DataFrameReader.schema(schema: StructType).csv(csvDataset: Dataset[String]) when the schema is specified by the user and contains non-nullable fields. To restore the legacy behavior of respecting the nullability, set spark.sql.legacy.respectNullabilityInTextDatasetConversion to true. spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv False Since Spark 3.3, nulls are written as empty strings in CSV data source by default. In Spark 3.2 or earlier, nulls were written as empty strings as quoted empty strings, "". To restore the previous behavior, set nullValue to "", or set the configuration spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv to true. spark.sql.legacy.notReserveProperties False Since Spark 3.3, the table property external becomes reserved. Certain commands will fail if you specify the external property, such as CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES. In Spark 3.2 and earlier, the table property external is silently ignored. You can set spark.sql.legacy.notReserveProperties to true to restore the old behavior. spark.sql.legacy.groupingIdWithAppendedUserGroupBy False Since Spark 3.3.1 and 3.2.3, for SELECT ... GROUP BY a GROUPING SETS (b)-style SQL statements, grouping__id returns different values from Apache Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by expressions plus grouping set columns. To restore the behavior before 3.3.1 and 3.2.3, you can set spark.sql.legacy.groupingIdWithAppendedUserGroupBy. For details, see SPARK-40218 and SPARK-40562. spark.sql.legacy.keepCommandOutputSchema False In Spark 3.2, the output schema of SHOW TABLES becomes namespace: string, tableName: string, isTemporary: boolean. In Spark 3.1 or earlier, the namespace field was named database for the builtin catalog, and there is no isTemporary field for v2 catalogs. In Spark 3.2, the output schema of SHOW TABLE EXTENDED becomes namespace: string, tableName: string, isTemporary: boolean, information: string. In Spark 3.1 or earlier, the namespace field was named database for the builtin catalog, and no change for the v2 catalogs. In Spark 3.2, the output schema of SHOW TBLPROPERTIES becomes key: string, value: string whether you specify the table property key or not. In Spark 3.1 and earlier, the output schema of SHOW TBLPROPERTIES is value: string when you specify the table property key. In Spark 3.2, the output schema of DESCRIBE NAMESPACE becomes info_name: string, info_value: string. In Spark 3.1 or earlier, the info_name field was named database_description_item and the info_value field was named database_description_value for the builtin catalog. To restore the old schema with the builtin catalog, you can set spark.sql.legacy.keepCommandOutputSchema to true spark.sql.legacy.allowStarWithSingleTableIdentifierInCount False In Spark 3.2, the usage of count(tblName.*) is blocked to avoid producing ambiguous results. Because count(*) and count(tblName.*) will output differently if there is any null values. To restore the behavior before Spark 3.2, you can set spark.sql.legacy.allowStarWithSingleTableIdentifierInCount to true. spark.sql.legacy.interval.enabled False In Spark 3.2, the dates subtraction expression such as date1 - date2 returns values of DayTimeIntervalType. In Spark 3.1 and earlier, the returned type is CalendarIntervalType. In Spark 3.2, the timestamps subtraction expression such as timestamp '2021-03-31 23:48:00' - timestamp '2021-01-01 00:00:00' returns values of DayTimeIntervalType. In Spark 3.1 and earlier, the type of the same expression is CalendarIntervalType. In Spark 3.2, the unit-to-unit interval literals like INTERVAL '1-1' YEAR TO MONTH and the unit list interval literals like INTERVAL '3' DAYS '1' HOUR are converted to ANSI interval types: YearMonthIntervalType or DayTimeIntervalType. In Spark 3.1 and earlier, such interval literals are converted to CalendarIntervalType. In Spark 3.2, the unit list interval literals can not mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, …, MICROSECOND). For example, INTERVAL 1 month 1 hour is invalid in Spark 3.2. In Spark 3.1 and earlier, there is no such limitation and the literal returns value of CalendarIntervalType. To restore the behavior before Spark 3.2, you can set spark.sql.legacy.interval.enabled to true. spark.sql.legacy.notReserveProperties False In Spark 3.2, CREATE TABLE .. LIKE .. command can not use reserved properties. You need their specific clauses to specify them, for example, CREATE TABLE test1 LIKE test LOCATION 'some path'. You can set spark.sql.legacy.notReserveProperties to true to ignore the ParseException, in this case, these properties will be silently removed, for example: TBLPROPERTIES('owner'='yao') will have no effect. In Spark version 3.1 and below, the reserved properties can be used in CREATE TABLE .. LIKE .. command but have no side effects, for example, TBLPROPERTIES('location'='/tmp') does not change the location of the table but only create a headless property just like 'a'='b'. spark.sql.legacy.allowAutoGeneratedAliasForView False In Spark 3.2, create/alter view will fail if the input query output columns contain auto-generated alias. This is necessary to make sure the query output column names are stable across different spark versions. To restore the behavior before Spark 3.2, set spark.sql.legacy.allowAutoGeneratedAliasForView to true. spark.sql.legacy.statisticalAggregate False In Spark 3.1, statistical aggregation function includes std, stddev, stddev_samp, variance, var_samp, skewness, kurtosis, covar_samp, corr will return NULL instead of Double.NaN when DivideByZero occurs during expression evaluation, for example, when stddev_samp applied on a single element set. In Spark version 3.0 and earlier, it will return Double.NaN in such case. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.statisticalAggregate to true. spark.sql.legacy.integerGroupingId False In Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and earlier, this function returns int values. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.integerGroupingId to true. spark.sql.legacy.castComplexTypesToString.enabled False In Spark 3.1, structs and maps are wrapped by the {} brackets in casting them to strings. For instance, the show() action and the CAST expression use such brackets. In Spark 3.0 and earlier, the [] brackets are used for the same purpose. In Spark 3.1, NULL elements of structures, arrays and maps are converted to “null” in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.castComplexTypesToString.enabled to true. spark.sql.legacy.pathOptionBehavior.enabled False In Spark 3.1, path option cannot coexist when the following methods are called with path parameter(s): DataFrameReader.load(), DataFrameWriter.save(), DataStreamReader.load(), or DataStreamWriter.start(). In addition, paths option cannot coexist for DataFrameReader.load(). For example, spark.read.format("csv").option("path", "/tmp").load("/tmp2") or spark.read.option("path", "/tmp").csv("/tmp2") will throw org.apache.spark.sql.AnalysisException. In Spark version 3.0 and below, path option is overwritten if one path parameter is passed to above methods; path option is added to the overall paths if multiple path parameters are passed to DataFrameReader.load(). To restore the behavior before Spark 3.1, you can set spark.sql.legacy.pathOptionBehavior.enabled to true. In Spark 3.1: spark.sql.legacy.parquet.int96RebaseModeInRead spark.sql.legacy.parquet.int96RebaseModeInWrite From Spark 3.4: spark.sql.parquet.int96RebaseModeInRead spark.sql.parquet.int96RebaseModeInWrite LEGACY In Spark 3.1, loading and saving of timestamps from/to parquet files fails if the timestamps are before 1900-01-01 00:00:00Z, and loaded (saved) as the INT96 type. In Spark 3.0, the actions don’t fail but might lead to shifting of the input timestamps due to rebasing from/to Julian to/from Proleptic Gregorian calendar. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.parquet.int96RebaseModeInRead or/and spark.sql.legacy.parquet.int96RebaseModeInWrite to LEGACY. Other config options: When CORRECTED, Spark will not do rebase and write the timestamps as it is. When EXCEPTION, Spark will fail the writing if it sees ancient timestamps that are ambiguous between the two calendars. spark.sql.legacy.useCurrentConfigsForView False In Spark 3.1, creating or altering a permanent view will capture runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.useCurrentConfigsForView to true. spark.sql.legacy.storeAnalyzedPlanForView False In Spark 3.1, the temporary view will have same behaviors with the permanent view, i.e. capture and store runtime SQL configs, SQL text, catalog and namespace. The captured view properties will be applied during the parsing and analysis phases of the view resolution. In Spark 3.1, temporary view created via CACHE TABLE ... AS SELECT will also have the same behavior with permanent view. In particular, when the temporary view is dropped, Spark will invalidate all its cache dependents, as well as the cache for the temporary view itself. This is different from Spark 3.0 and below, which only does the latter. To restore the behavior before Spark 3.1, you can set spark.sql.legacy.storeAnalyzedPlanForView to true. spark.sql.legacy.charVarcharAsString False Since Spark 3.1, CHAR/CHARACTER and VARCHAR types are supported in the table schema. Table scan/insertion will respect the char/varchar semantic. If char/varchar is used in places other than table schema, an exception will be thrown (CAST is an exception that simply treats char/varchar as string like before). To restore the behavior before Spark 3.1, which treats them as STRING types and ignores a length parameter, e.g. CHAR(4), you can set spark.sql.legacy.charVarcharAsString to true. spark.sql.legacy.parseNullPartitionSpecAsStringLiteral False In Spark 3.0.2, PARTITION(col=null) is always parsed as a null literal in the partition spec. In Spark 3.0.1 or earlier, it is parsed as a string literal of its text representation, e.g., string “null”, if the partition column is string type. To restore the legacy behavior, you can set spark.sql.legacy.parseNullPartitionSpecAsStringLiteral as true. spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue False In Spark 2.4 and below, Dataset.groupByKey results to a grouped dataset with key attribute is wrongly named as “value”, if the key is non-struct type, for example, int, string, array, etc. This is counterintuitive and makes the schema of aggregation queries unexpected. For example, the schema of ds.groupByKey(...).count() is (value, count). Since Spark 3.0, we name the grouping attribute to “key”. The old behavior is preserved under a newly added configuration spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue with a default value of false. spark.sql.legacy.doLooseUpcast False When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. Seq("str").toDS.as[Int] fails, but Seq("str").toDS.as[Boolean] works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. Seq("str").toDS.as[Boolean] will fail during analysis. To restore the behavior before Spark 3.0, set spark.sql.legacy.doLooseUpcast to true. spark.sql.storeAssignmentPolicy ANSI In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting string to int and double to boolean are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid Cast. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option spark.sql.storeAssignmentPolicy, with a default value as “ANSI”. Setting the option as “Legacy” restores the previous behavior. spark.sql.legacy.setCommandRejectsSparkCoreConfs True Spark 2.4 and below: the SET command works without any warnings even if the specified key is for SparkConf entries and it has no effect because the command does not update SparkConf, but the behavior might confuse users. In 3.0, the command fails if a SparkConf key is used. You can disable such a check by setting spark.sql.legacy.setCommandRejectsSparkCoreConfs to false. spark.sql.legacy.addSingleFileInAddFile False In Spark 3.0, you can use ADD FILE to add file directories as well. Earlier you could add only single files using this command. To restore the behavior of earlier versions, set spark.sql.legacy.addSingleFileInAddFile to true. spark.sql.legacy.allowHashOnMapType False In Spark 3.0, an analysis exception is thrown when hash expressions are applied on elements of MapType. To restore the behavior before Spark 3.0, set spark.sql.legacy.allowHashOnMapType to true. spark.sql.legacy.createEmptyCollectionUsingStringType False In Spark 3.0, when the array/map function is called without any parameters, it returns an empty collection with NullType as element type. In Spark version 2.4 and below, it returns an empty collection with StringType as element type. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.createEmptyCollectionUsingStringType to true. spark.sql.legacy.allowUntypedScalaUDF False In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default. spark.sql.legacy.followThreeValuedLogicInArrayExists True In Spark 3.0, a higher-order function exists follows the three-valued boolean logic, that is, if the predicate returns any nulls and no true is obtained, then exists returns null instead of false. For example, exists(array(1, null, 3), x -> x % 2 == 0) is null. The previous behavior can be restored by setting spark.sql.legacy.followThreeValuedLogicInArrayExists to false. spark.sql.legacy.exponentLiteralAsDecimal.enabled False In Spark 3.0, numbers written in scientific notation(for example, 1E2) would be parsed as Double. In Spark version 2.4 and below, they’re parsed as Decimal. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.exponentLiteralAsDecimal.enabled to true. spark.sql.legacy.fromDayTimeString.enabled False In Spark 3.0, day-time interval strings are converted to intervals with respect to the from and to bounds. If an input string does not match to the pattern defined by specified bounds, the ParseException exception is thrown. For example, interval '2 10:20' hour to minute raises the exception because the expected format is [+|-]h[h]:[m]m. In Spark version 2.4, the from bound was not taken into account, and the to bound was used to truncate the resulted interval. For instance, the day-time interval string from the showed example is converted to interval 10 hours 20 minutes. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.fromDayTimeString.enabled to true. spark.sql.legacy.allowNegativeScaleOfDecimal False In Spark 3.0, negative scale of decimal is not allowed by default, for example, data type of literal like 1E10BD is DecimalType(11, 0). In Spark version 2.4 and below, it was DecimalType(2, -9). To restore the behavior before Spark 3.0, you can set spark.sql.legacy.allowNegativeScaleOfDecimal to true. spark.sql.legacy.ctePrecedencePolicy EXCEPTION In Spark 3.0, spark.sql.legacy.ctePrecedencePolicy is introduced to control the behavior for name conflicting in the nested WITH clause. By default value EXCEPTION, Spark throws an AnalysisException, it forces users to choose the specific substitution order they wanted. If set to CORRECTED (which is recommended), inner CTE definitions take precedence over outer definitions. For example, set the config to false, WITH t AS (SELECT 1), t2 AS (WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2 returns 2, while setting it to LEGACY, the result is 1 which is the behavior in version 2.4 and below. spark.sql.legacy.typeCoercion.datetimeToString.enabled False In Spark 3.0, Spark casts String to Date/Timestamp in binary comparisons with dates/timestamps. The previous behavior of casting Date/Timestamp to String can be restored by setting spark.sql.legacy.typeCoercion.datetimeToString.enabled to true.
... View more
Labels: