Support Questions

Find answers, ask questions, and share your expertise

JDBC/ODBC: regexp_* functions' behavior depends on unrelated comparisons and database used

avatar
New Contributor

I've found that the regexp_extract and regexp_replace functions behave differently, depending on comparisons done in the same query or the database used.

 

Consider the following script:

create schema test;
create table test.a (text string);
insert into test.a values ("a");

 

The following query behaves inconsistently with the Impala documentation:

select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a;

image.png

 

According to the Impala documentation, double backslashes should be used as a regex escape character. However, it doesn't work here (see Col2 in the above result). Instead, it does work when using a single backslash.

 

If we add an unrelated comparison to the query, this behaviour changes:

select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.a;

image.png

 

Now, a double backslash is required for the regex to function correctly. The result is identical, if one uses another table column in the comparison or puts the comparison into the where-clause.

 

This strange behavior is only present when running queries over JDBC/ODBC on a non-default database. Hue and Impala-Shell work as expected. And JDBC/ODBC-queries work as expected when executed on tables in the default database.

 

I've tested this on CDH5.15.0 and 5.13.1 with JDBC-2.6.3.1004 and ODBC v2.5.37.1014 (32bit) drivers.

 

Is this a bug or am I missing something? Anyone else experiencing the same issue?

1 ACCEPTED SOLUTION

avatar
Contributor

When reproducing the issue, we observed the following -

 

1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:

 

SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`

 

2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:

 

select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test

 

We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.

 

The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].

 

[1] https://www.simba.com/products/Impala/doc/ODBC_InstallGuide/win/content/odbc/options/usenativequery....

View solution in original post

6 REPLIES 6

avatar
Contributor

When reproducing the issue, we observed the following -

 

1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:

 

SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`

 

2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:

 

select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test

 

We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.

 

The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].

 

[1] https://www.simba.com/products/Impala/doc/ODBC_InstallGuide/win/content/odbc/options/usenativequery....

avatar

Thanks for letting us know about this, I'll see what I can do to get it fixed in a future release.

avatar

Hi @Svyat, I tested with Impala JDBC 2.6.4 and it appears to be resolved.

 

I now get this output from my test program with the two different version of the query:

 

Running query 1: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a;
col0= (null=false) col1=a (null=false) col2=a (null=false) 
Running query 2: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = 'a' from test.a;
col0= (null=false) col1=a (null=false) col2=a (null=false) 

Test code is:

import java.sql.*;

public class JDBCRegex {
   // JDBC driver name and database URL
   static final String JDBC_DRIVER = "com.cloudera.impala.jdbc41.Driver";
   static final String DB_URL = "jdbc:impala://localhost:21050/";

   public static void main(String[] args) {
   Connection conn = null;
   Statement stmt = null;
   try{
      Class.forName(JDBC_DRIVER);
      System.out.println("Connecting to a selected database...");
      conn = DriverManager.getConnection(DB_URL, "", "");
      System.out.println("Connected database successfully...");

      System.out.println("Creating statement...");
      stmt = conn.createStatement();

      String sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text from test.a;";
      ResultSet rs = stmt.executeQuery(sql);
      System.out.println("Running query 1: " + sql);
      while(rs.next()) {
        System.out.println("col0=" + rs.getString(1) + " (null="  + rs.wasNull() + ") " +
                          "col1=" + rs.getString(2) + " (null="  + rs.wasNull() + ") " +
                          "col2=" + rs.getString(3) + " (null="  + rs.wasNull() + ") ");
      }
      rs.close();

      // Add an unrelated comparison expression.
      sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text, text = 'a' from test.a;";
      System.out.println("Running query 2: " + sql);
      rs = stmt.executeQuery(sql);
      while(rs.next()) {
        System.out.println("col0=" + rs.getString(1) + " (null="  + rs.wasNull() + ") " +
                          "col1=" + rs.getString(2) + " (null="  + rs.wasNull() + ") " +
                          "col2=" + rs.getString(3) + " (null="  + rs.wasNull() + ") ");
      }
      rs.close();
   }catch(SQLException se){
      //Handle errors for JDBC
      se.printStackTrace();
   }catch(Exception e){
      //Handle errors for Class.forName
      e.printStackTrace();
   }finally{
      //finally block used to close resources
      try{
         if(stmt!=null)
            conn.close();
      }catch(SQLException se){
      }// do nothing
      try{
         if(conn!=null)
            conn.close();
      }catch(SQLException se){
         se.printStackTrace();
      }
   }
}
}

I ran from the command line with:

 

javac JDBCRegex.java && CLASSPATH=~/ClouderaImpalaJDBC-2.6.4.1005/ImpalaJDBC41.jar:. time java JDBCRegex

avatar
New Contributor

Thank you for checking @Tim Armstrong!

 

We have also found that the issue does not exist when querying Impala through the Java API, even with older connectors. However, it does, when you use another client.

 

I checked the 2.6.4 connector and the issue still persists when running the query through SQLWorkbench.

 

Have you tried using a differen JDBC client?

avatar

That's interesting - those tools ultimately must go through the Java API but I wonder if they're using different APIs or something. Presumably the bug isn't in the tools themselves.

avatar
New Contributor

I think similar issue exist with a like search where using a backslash to escape certain characters leads to wrong output