Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Master Collaborator

Introduction

In this article, we will learn how to query the Hive tables data by using column names with regular expressions in Spark.

Assume we have a table with column names like col1, col2, col3, col4, col5, etc. If we want to select the data, we will use queries like select col1, col2, col3, col4, col5 from the table. Instead of specifying col1, col2, col3, col4, col5, we can use regular expressions while selecting columns like select `col.*` from table.

From Hive, this feature is supported from Hive 0.13.0 onwards. By default, this feature is disabled, and in order to enable it, we need to use set hive.support.quoted.identifiers=none.

From Spark, this feature is supported from Spark 2.3.0 onwards. By default this feature is disabled, and in order to enable it, we need use set spark.sql.parser.quotedRegexColumnNames=true.

Steps

Hive

  1. Create the Hive table and insert the data:
    beeline -u jdbc:hive2://host:port
    > create database if not exists regex_test;
    > create external table if not exists regex_test.regex_test_tbl (col1 int, col2 string, col3 float, `timestamp` timestamp) stored as PARQUET;
    
    > insert into regex_test.regex_test_tbl values (1, 'Ranga', 23000, cast('1988-01-06 00:59:59.345' as timestamp));
    > insert into regex_test.regex_test_tbl values (2, 'Nishanth', 38000, cast('2018-05-29 17:32:59.345' as timestamp));
    > insert into regex_test.regex_test_tbl values (3, 'Raja', 18000, cast('2067-05-29 17:32:59.345' as timestamp));
  2. Select the regex_test.regex_test_tbl table data.
    > SET hive.cli.print.header=true;
    > select * from regex_test.regex_test_tbl;
    +----------------------+----------------------+----------------------+---------------------------+
    | regex_test_tbl.col1  | regex_test_tbl.col2  | regex_test_tbl.col3  | regex_test_tbl.timestamp  |
    +----------------------+----------------------+----------------------+---------------------------+
    | 1                    | Ranga                | 23000.0              | 1988-01-06 00:59:59.345   |
    | 2                    | Nishanth             | 38000.0              | 2018-05-29 17:32:59.345   |
    | 3                    | Raja                 | 18000.0              | 2067-05-29 17:32:59.345   |
    +----------------------+----------------------+----------------------+---------------------------+
  3. Without setting set hive.support.quoted.identifiers=none; try to run the query using regular expressions. The following error is noticed:
    > select `col.*` from regex_test.regex_test_tbl;
    FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'col.*': (possible column names are: col1, col2, col3, timestamp)
  4. Now set the hive.support.quoted.identifiers=none and execute the above query.
    > set hive.support.quoted.identifiers=none;
    
    > select `col.*` from regex_test.regex_test_tbl;
    
    +----------------------+----------------------+----------------------+
    | regex_test_tbl.col1  | regex_test_tbl.col2  | regex_test_tbl.col3  |
    +----------------------+----------------------+----------------------+
    | 1                    | Ranga                | 23000.0              |
    | 2                    | Nishanth             | 38000.0              |
    | 3                    | Raja                 | 18000.0              |
    +----------------------+----------------------+----------------------+
  5. We can also select a specific column name while querying the data.
    > select `timestamp` from regex_test.regex_test_tbl;
    
    +--------------------------+
    |        timestamp         |
    +--------------------------+
    | 1988-01-06 00:59:59.345  |
    | 2018-05-29 17:32:59.345  |
    | 2067-05-29 17:32:59.345  |
    +--------------------------+

 

 

Spark

  1. Launch the spark shell and execute the following query to select the regex_test.regex_test_tbl table data.
    # spark-shell
    scala> spark.sql("select * from regex_test.regex_test_tbl").show(truncate=false)
    +----+--------+-------+-----------------------+
    |col1|col2    |col3   |timestamp              |
    +----+--------+-------+-----------------------+
    |2   |Nishanth|38000.0|2018-05-29 17:32:59.345|
    |1   |Ranga   |23000.0|1988-01-06 00:59:59.345|
    |3   |Raja    |18000.0|2067-05-29 17:32:59.345|
    +----+--------+-------+-----------------------+
  2. Without setting spark.sql.parser.quotedRegexColumnNames=true, try to run the query using regular expressions. We will get the following error:
    scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
    org.apache.spark.sql.AnalysisException: cannot resolve '`col.*`' given input columns: [spark_catalog.regex_test.regex_test_tbl.col1, spark_catalog.regex_test.regex_test_tbl.col2, spark_catalog.regex_test.regex_test_tbl.col3, spark_catalog.regex_test.regex_test_tbl.timestamp]; line 1 pos 7;
    'Project ['`col.*`]
    +- SubqueryAlias spark_catalog.regex_test.regex_test_tbl
       +- Relation[col1#203,col2#204,col3#205,timestamp#206] parquet
  3. Now set the spark.sql.parser.quotedRegexColumnNames=true and execute the above query.
    scala> spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
    +---------------------------------------+-----+
    |key                                    |value|
    +---------------------------------------+-----+
    |spark.sql.parser.quotedRegexColumnNames|true |
    +---------------------------------------+-----+
    scala> spark.sql("select `col.*` from regex_test.regex_test_tbl").show(false)
    +----+--------+-------+
    |col1|col2    |col3   |
    +----+--------+-------+
    |2   |Nishanth|38000.0|
    |1   |Ranga   |23000.0|
    |3   |Raja    |18000.0|
    +----+--------+-------+
  4. We can also select a specific column name while querying the data.
    scala> spark.sql("select `timestamp` from regex_test.regex_test_tbl").show(false)
    +-----------------------+
    |timestamp              |
    +-----------------------+
    |2018-05-29 17:32:59.345|
    |1988-01-06 00:59:59.345|
    |2067-05-29 17:32:59.345|
    +-----------------------+
     Note: Currently in Spark, there is a limitation while selecting the data with the alias name.
    scala> spark.sql("select `col1` as `col` from regex_test.regex_test_tbl").show(false)
    org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'alias';
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)

 

 

 

 

References: 

Thanks for visiting this article and happy learning !!

4,395 Views