Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark-sql fails to use "SELECT" on Aliases on Parquet files (as defined in Avro schema)

avatar
New Contributor

We are using Spark-sql and Parquet data-format. Avro is used as the schema format. We are trying to use “aliases” on field names and are running into issues while trying to use alias-name in SELECT.

 

Sample schema, where each field has both a name and a alias:

 

{ "namespace": "com.test.profile",
  "type": "record",
  "name": "profile",
  "fields": [
    {"name": "ID", "type": "string"},
    {"name": “F1", "type": ["null","int"], "default": "null", "aliases": [“F1_ALIAS"]},
    {"name": “F2", "type": ["null","int"], "default": "null", "aliases": [“F2_ALIAS"]}
  ]
}

 
Code for SELECT:

 

val profile = sqlContext.read.parquet(“/user/test/parquet_files/*”)
profile.registerTempTable(“profile")
val features = sqlContext.sql("“SELECT F1_ALIAS from profile”)

 

It will throw the following exception:

 

org.apache.spark.sql.AnalysisException: cannot resolve ‘`F1_ALIAS`' given input columns: [ID, F1, F2]

 

Any suggestions for this use case? 

 

On a side note, what characters are allowed in aliases? e.g. is "!" allowed?

 

Thank you in advance!

2 ACCEPTED SOLUTIONS

avatar
Expert Contributor

Hi Msun,

Regarding your questions:

> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]


> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.

To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.

References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:

  • boolean -> BooleanType
  • int -> IntegerType
  • long -> LongType
  • float -> FloatType
  • double -> DoubleType
  • bytes -> BinaryType
  • string -> StringType
  • record -> StructType
  • enum -> StringType
  • array -> ArrayType
  • map -> MapType
  • fixed -> BinaryType

The spark-avro library supports the following union types:

  • union(int, long) -> LongType
  • union(float, double) -> DoubleType
  • union(any, null) -> any

The spark-avro library does not support complex union types.

 

All doc, aliases, and other fields are stripped when they are loaded into Spark.

 

View solution in original post

avatar
Expert Contributor
Regarding the questions asked: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Hi Msun,

Regarding your questions:

> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]


> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.

To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.

References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:

  • boolean -> BooleanType
  • int -> IntegerType
  • long -> LongType
  • float -> FloatType
  • double -> DoubleType
  • bytes -> BinaryType
  • string -> StringType
  • record -> StructType
  • enum -> StringType
  • array -> ArrayType
  • map -> MapType
  • fixed -> BinaryType

The spark-avro library supports the following union types:

  • union(int, long) -> LongType
  • union(float, double) -> DoubleType
  • union(any, null) -> any

The spark-avro library does not support complex union types.

 

All doc, aliases, and other fields are stripped when they are loaded into Spark.

 

avatar
Expert Contributor
Regarding the questions asked: > What characters are allowed in aliases? e.g. is "!" allowed? In regards to Avro aliases, it follows the name rules[1], essentially: 1) Must start with [A-Za-z_] 2) Subsequently contain only [A-Za-z0-9_] > Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema" > API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is > it possible to get the mapping from name to aliases from DF? Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark. To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark. References: [1] http://avro.apache.org/docs/1.8.1/spec.html#names [2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html [3] Avro to Spark SQL Conversion: The spark-avro library supports conversion for all Avro data types: boolean -> BooleanType int -> IntegerType long -> LongType float -> FloatType double -> DoubleType bytes -> BinaryType string -> StringType record -> StructType enum -> StringType array -> ArrayType map -> MapType fixed -> BinaryType The spark-avro library supports the following union types: union(int, long) -> LongType union(float, double) -> DoubleType union(any, null) -> any The spark-avro library does not support complex union types. All doc, aliases, and other fields are stripped when they are loaded into Spark.