Community Articles

subratadas · ‎05-18-2021

Introduction

In this article, we will learn how to query the Hive tables data by using column names with regular expressions in Spark.

Assume we have a table with column names like col1, col2, col3, col4, col5, etc. If we want to select the data, we will use queries like select col1, col2, col3, col4, col5 from the table. Instead of specifying col1, col2, col3, col4, col5, we can use regular expressions while selecting columns like select `col.*` from table.

From Hive, this feature is supported from Hive 0.13.0 onwards. By default, this feature is disabled, and in order to enable it, we need to use set hive.support.quoted.identifiers=none.

From Spark, this feature is supported from Spark 2.3.0 onwards. By default this feature is disabled, and in order to enable it, we need use set spark.sql.parser.quotedRegexColumnNames=true.

Steps

Hive

Create the Hive table and insert the data:

beeline -u jdbc:hive2://host:port
> create database if not exists regex_test;
> create external table if not exists regex_test.regex_test_tbl (col1 int, col2 string, col3 float, `timestamp` timestamp) stored as PARQUET;

> insert into regex_test.regex_test_tbl values (1, 'Ranga', 23000, cast('1988-01-06 00:59:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (2, 'Nishanth', 38000, cast('2018-05-29 17:32:59.345' as timestamp));
> insert into regex_test.regex_test_tbl values (3, 'Raja', 18000, cast('2067-05-29 17:32:59.345' as timestamp));

Select the regex_test.regex_test_tbl table data.

> SET hive.cli.print.header=true;
> select * from regex_test.regex_test_tbl;
+----------------------+----------------------+----------------------+---------------------------+
| regex_test_tbl.col1  | regex_test_tbl.col2  | regex_test_tbl.col3  | regex_test_tbl.timestamp  |
+----------------------+----------------------+----------------------+---------------------------+
| 1                    | Ranga                | 23000.0              | 1988-01-06 00:59:59.345   |
| 2                    | Nishanth             | 38000.0              | 2018-05-29 17:32:59.345   |
| 3                    | Raja                 | 18000.0              | 2067-05-29 17:32:59.345   |
+----------------------+----------------------+----------------------+---------------------------+

Without setting set hive.support.quoted.identifiers=none; try to run the query using regular expressions. The following error is noticed:

> select `col.*` from regex_test.regex_test_tbl;
FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'col.*': (possible column names are: col1, col2, col3, timestamp)

Now set the hive.support.quoted.identifiers=none and execute the above query.

> set hive.support.quoted.identifiers=none;

> select `col.*` from regex_test.regex_test_tbl;

+----------------------+----------------------+----------------------+
| regex_test_tbl.col1  | regex_test_tbl.col2  | regex_test_tbl.col3  |
+----------------------+----------------------+----------------------+
| 1                    | Ranga                | 23000.0              |
| 2                    | Nishanth             | 38000.0              |
| 3                    | Raja                 | 18000.0              |
+----------------------+----------------------+----------------------+

We can also select a specific column name while querying the data.

> select `timestamp` from regex_test.regex_test_tbl;

+--------------------------+
|        timestamp         |
+--------------------------+
| 1988-01-06 00:59:59.345  |
| 2018-05-29 17:32:59.345  |
| 2067-05-29 17:32:59.345  |
+--------------------------+

Spark

Launch the spark shell and execute the following query to select the regex_test.regex_test_tbl table data.

# spark-shell
scala> spark.sql("select * from regex_test.regex_test_tbl").show(truncate=false)
+----+--------+-------+-----------------------+
|col1|col2    |col3   |timestamp              |
+----+--------+-------+-----------------------+
|2   |Nishanth|38000.0|2018-05-29 17:32:59.345|
|1   |Ranga   |23000.0|1988-01-06 00:59:59.345|
|3   |Raja    |18000.0|2067-05-29 17:32:59.345|
+----+--------+-------+-----------------------+

Without setting spark.sql.parser.quotedRegexColumnNames=true, try to run the query using regular expressions. We will get the following error:

scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: cannot resolve '`col.*`' given input columns: [spark_catalog.regex_test.regex_test_tbl.col1, spark_catalog.regex_test.regex_test_tbl.col2, spark_catalog.regex_test.regex_test_tbl.col3, spark_catalog.regex_test.regex_test_tbl.timestamp]; line 1 pos 7;
'Project ['`col.*`]
+- SubqueryAlias spark_catalog.regex_test.regex_test_tbl
   +- Relation[col1#203,col2#204,col3#205,timestamp#206] parquet

Now set the spark.sql.parser.quotedRegexColumnNames=true and execute the above query.

scala> spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
+---------------------------------------+-----+
|key                                    |value|
+---------------------------------------+-----+
|spark.sql.parser.quotedRegexColumnNames|true |
+---------------------------------------+-----+
scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
+----+--------+-------+
|col1|col2    |col3   |
+----+--------+-------+
|2   |Nishanth|38000.0|
|1   |Ranga   |23000.0|
|3   |Raja    |18000.0|
+----+--------+-------+

We can also select a specific column name while querying the data.

scala> spark.sql("select `timestamp` from regex_test.regex_test_tbl").show(false)
+-----------------------+
|timestamp              |
+-----------------------+
|2018-05-29 17:32:59.345|
|1988-01-06 00:59:59.345|
|2067-05-29 17:32:59.345|
+-----------------------+

Note: Currently in Spark, there is a limitation while selecting the data with the alias name.

scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false)
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias';
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)

References:

Thanks for visiting this article and happy learning !!

Cloudera Community

Community Articles

Spark to support REGEX column specification for Hive Queries

Apache Hive

Apache Spark

Introduction

Steps

Hive

Spark

References:

Hive: Query keyword column name

Uploading Files for Cloudera Support - alternate m...

Support Video: How to mask Hive columns using Atla...

How to display query metrics of Analyzer/Optimizer...

Hive CLI against specific queue

Running Hive LLAP on specific Nodes using YARN Nod...

Excluding Duplicate Key Columns from Hive using Re...

Hive Query Recovery Tactics: Handling Failures thr...

Optimizing Hive queries for ORC formatted tables

HIVE Regex with Uppercase