Created on 10-05-2016 11:38 AM - edited 09-16-2022 08:42 AM
We are using Spark-sql and Parquet data-format. Avro is used as the schema format. We are trying to use “aliases” on field names and are running into issues while trying to use alias-name in SELECT.
Sample schema, where each field has both a name and a alias:
{ "namespace": "com.test.profile", "type": "record", "name": "profile", "fields": [ {"name": "ID", "type": "string"}, {"name": “F1", "type": ["null","int"], "default": "null", "aliases": [“F1_ALIAS"]}, {"name": “F2", "type": ["null","int"], "default": "null", "aliases": [“F2_ALIAS"]} ] }
Code for SELECT:
val profile = sqlContext.read.parquet(“/user/test/parquet_files/*”) profile.registerTempTable(“profile") val features = sqlContext.sql("“SELECT F1_ALIAS from profile”)
It will throw the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve ‘`F1_ALIAS`' given input columns: [ID, F1, F2]
Any suggestions for this use case?
On a side note, what characters are allowed in aliases? e.g. is "!" allowed?
Thank you in advance!
Created 10-11-2016 02:00 PM
Hi Msun,
Regarding your questions:
> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]
> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.
To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.
References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:
The spark-avro library supports the following union types:
The spark-avro library does not support complex union types.
All doc, aliases, and other fields are stripped when they are loaded into Spark.
Created 10-14-2016 10:25 AM
Created 10-11-2016 02:00 PM
Hi Msun,
Regarding your questions:
> What characters are allowed in aliases? e.g. is "!" allowed?
In regards to Avro aliases, it follows the name rules[1], essentially:
1) Must start with [A-Za-z_]
2) Subsequently contain only [A-Za-z0-9_]
> Is there anyway to get the alias information once it's loaded to Dataframe? There's a "printSchema"
> API that lets you print the schema names, but there's not a counterpart for printing the aliases. Is
> it possible to get the mapping from name to aliases from DF?
Per our CDH 5.7 documentation[2], the spark-avro library strips all doc, aliases and other fields[3] when they are loaded into Spark.
To work around this issue, we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark.
References:
[1] http://avro.apache.org/docs/1.8.1/spec.html#names
[2] https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_avro.html
[3] Avro to Spark SQL Conversion:
The spark-avro library supports conversion for all Avro data types:
The spark-avro library supports the following union types:
The spark-avro library does not support complex union types.
All doc, aliases, and other fields are stripped when they are loaded into Spark.
Created 10-14-2016 10:25 AM