Created on 05-18-2021 08:22 PM - edited on 05-18-2021 08:36 PM by subratadas
In this article, we will learn how to query the Hive tables data by using column names with regular expressions in Spark.
Assume we have a table with column names like col1, col2, col3, col4, col5, etc. If we want to select the data, we will use queries like select col1, col2, col3, col4, col5 from the table. Instead of specifying col1, col2, col3, col4, col5, we can use regular expressions while selecting columns like select `col.*` from table.
From Hive, this feature is supported from Hive 0.13.0 onwards. By default, this feature is disabled, and in order to enable it, we need to use set hive.support.quoted.identifiers=none.
From Spark, this feature is supported from Spark 2.3.0 onwards. By default this feature is disabled, and in order to enable it, we need use set spark.sql.parser.quotedRegexColumnNames=true.
beeline -u jdbc:hive2://host:port
> create database if not exists regex_test;
> create external table if not exists regex_test.regex_test_tbl (col1 int, col2 string, col3 float, `timestamp` timestamp) stored as PARQUET;
> insert into regex_test.regex_test_tbl values (1, 'Ranga', 23000, cast('1988-01-06 00:59:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (2, 'Nishanth', 38000, cast('2018-05-29 17:32:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (3, 'Raja', 18000, cast('2067-05-29 17:32:59.345' as timestamp));
> SET hive.cli.print.header=true;
> select * from regex_test.regex_test_tbl;
+----------------------+----------------------+----------------------+---------------------------+
| regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 | regex_test_tbl.timestamp |
+----------------------+----------------------+----------------------+---------------------------+
| 1 | Ranga | 23000.0 | 1988-01-06 00:59:59.345 |
| 2 | Nishanth | 38000.0 | 2018-05-29 17:32:59.345 |
| 3 | Raja | 18000.0 | 2067-05-29 17:32:59.345 |
+----------------------+----------------------+----------------------+---------------------------+
> select `col.*` from regex_test.regex_test_tbl;
FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'col.*': (possible column names are: col1, col2, col3, timestamp)
> set hive.support.quoted.identifiers=none;
> select `col.*` from regex_test.regex_test_tbl;
+----------------------+----------------------+----------------------+
| regex_test_tbl.col1 | regex_test_tbl.col2 | regex_test_tbl.col3 |
+----------------------+----------------------+----------------------+
| 1 | Ranga | 23000.0 |
| 2 | Nishanth | 38000.0 |
| 3 | Raja | 18000.0 |
+----------------------+----------------------+----------------------+
> select `timestamp` from regex_test.regex_test_tbl;
+--------------------------+
| timestamp |
+--------------------------+
| 1988-01-06 00:59:59.345 |
| 2018-05-29 17:32:59.345 |
| 2067-05-29 17:32:59.345 |
+--------------------------+
# spark-shell
scala> spark.sql("select * from regex_test.regex_test_tbl").show(truncate=false)
+----+--------+-------+-----------------------+
|col1|col2 |col3 |timestamp |
+----+--------+-------+-----------------------+
|2 |Nishanth|38000.0|2018-05-29 17:32:59.345|
|1 |Ranga |23000.0|1988-01-06 00:59:59.345|
|3 |Raja |18000.0|2067-05-29 17:32:59.345|
+----+--------+-------+-----------------------+
scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: cannot resolve '`col.*`' given input columns: [spark_catalog.regex_test.regex_test_tbl.col1, spark_catalog.regex_test.regex_test_tbl.col2, spark_catalog.regex_test.regex_test_tbl.col3, spark_catalog.regex_test.regex_test_tbl.timestamp]; line 1 pos 7;
'Project ['`col.*`]
+- SubqueryAlias spark_catalog.regex_test.regex_test_tbl
+- Relation[col1#203,col2#204,col3#205,timestamp#206] parquet
scala> spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
+---------------------------------------+-----+
|key |value|
+---------------------------------------+-----+
|spark.sql.parser.quotedRegexColumnNames|true |
+---------------------------------------+-----+
scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
+----+--------+-------+
|col1|col2 |col3 |
+----+--------+-------+
|2 |Nishanth|38000.0|
|1 |Ranga |23000.0|
|3 |Raja |18000.0|
+----+--------+-------+
scala> spark.sql("select `timestamp` from regex_test.regex_test_tbl").show(false)
+-----------------------+
|timestamp |
+-----------------------+
|2018-05-29 17:32:59.345|
|1988-01-06 00:59:59.345|
|2067-05-29 17:32:59.345|
+-----------------------+
Note: Currently in Spark, there is a limitation while selecting the data with the alias name.scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias';
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
Thanks for visiting this article and happy learning !!