Support Questions

msun · ‎10-05-2016

We are using Spark-sql and Parquet data-format. Avro is used as the schema format. We are trying to use “aliases” on field names and are running into issues while trying to use alias-name in SELECT.

Sample schema, where each field has both a name and a alias:

{ "namespace": "com.test.profile",
  "type": "record",
  "name": "profile",
  "fields": [
    {"name": "ID", "type": "string"},
    {"name": “F1", "type": ["null","int"], "default": "null", "aliases": [“F1_ALIAS"]},
    {"name": “F2", "type": ["null","int"], "default": "null", "aliases": [“F2_ALIAS"]}
  ]
}

Code for SELECT:

val profile = sqlContext.read.parquet(“/user/test/parquet_files/*”)
profile.registerTempTable(“profile")
val features = sqlContext.sql("“SELECT F1_ALIAS from profile”)

It will throw the following exception:

org.apache.spark.sql.AnalysisException: cannot resolve ‘`F1_ALIAS`' given input columns: [ID, F1, F2]

Any suggestions for this use case?

On a side note, what characters are allowed in aliases? e.g. is "!" allowed?

Thank you in advance!

Anthony · ‎10-11-2016

Hi Msun,

Regarding your questions:

> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]

> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.

To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.

References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:

boolean -> BooleanType
int -> IntegerType
long -> LongType
float -> FloatType
double -> DoubleType
bytes -> BinaryType
string -> StringType
record -> StructType
enum -> StringType
array -> ArrayType
map -> MapType
fixed -> BinaryType

The spark-avro library supports the following union types:

union(int, long) -> LongType
union(float, double) -> DoubleType
union(any, null) -> any

The spark-avro library does not support complex union types.

All doc, aliases, and other fields are stripped when they are loaded into Spark.

View solution in original post

Anthony · ‎10-14-2016

Regarding the questions asked: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.

View solution in original post

Anthony · ‎10-11-2016

Hi Msun,

Regarding your questions:

> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]

> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.

To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.

References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:

boolean -> BooleanType
int -> IntegerType
long -> LongType
float -> FloatType
double -> DoubleType
bytes -> BinaryType
string -> StringType
record -> StructType
enum -> StringType
array -> ArrayType
map -> MapType
fixed -> BinaryType

The spark-avro library supports the following union types:

union(int, long) -> LongType
union(float, double) -> DoubleType
union(any, null) -> any

The spark-avro library does not support complex union types.

All doc, aliases, and other fields are stripped when they are loaded into Spark.

Anthony · ‎10-14-2016

Regarding the questions asked: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.

Cloudera Community

Support Questions

Spark-sql fails to use "SELECT" on Aliases on Parquet files (as defined in Avro schema)