Created on 07-05-2018 06:02 AM - edited 09-16-2022 06:25 AM
I've found that the regexp_extract and regexp_replace functions behave differently, depending on comparisons done in the same query or the database used.
Consider the following script:
create schema test; create table test.a (text string); insert into test.a values ("a");
The following query behaves inconsistently with the Impala documentation:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a;
According to the Impala documentation, double backslashes should be used as a regex escape character. However, it doesn't work here (see Col2 in the above result). Instead, it does work when using a single backslash.
If we add an unrelated comparison to the query, this behaviour changes:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.a;
Now, a double backslash is required for the regex to function correctly. The result is identical, if one uses another table column in the comparison or puts the comparison into the where-clause.
This strange behavior is only present when running queries over JDBC/ODBC on a non-default database. Hue and Impala-Shell work as expected. And JDBC/ODBC-queries work as expected when executed on tables in the default database.
I've tested this on CDH5.15.0 and 5.13.1 with JDBC-2.6.3.1004 and ODBC v2.5.37.1014 (32bit) drivers.
Is this a bug or am I missing something? Anyone else experiencing the same issue?
Created on 09-18-2018 12:16 AM - edited 09-18-2018 01:24 AM
When reproducing the issue, we observed the following -
1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:
SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`
2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test
We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.
The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].
Created on 09-18-2018 12:16 AM - edited 09-18-2018 01:24 AM
When reproducing the issue, we observed the following -
1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:
SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`
2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test
We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.
The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].
Created 09-18-2018 09:33 AM
Thanks for letting us know about this, I'll see what I can do to get it fixed in a future release.
Created 09-19-2018 04:05 PM
Hi @Svyat, I tested with Impala JDBC 2.6.4 and it appears to be resolved.
I now get this output from my test program with the two different version of the query:
Running query 1: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a; col0= (null=false) col1=a (null=false) col2=a (null=false) Running query 2: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = 'a' from test.a; col0= (null=false) col1=a (null=false) col2=a (null=false)
Test code is:
import java.sql.*; public class JDBCRegex { // JDBC driver name and database URL static final String JDBC_DRIVER = "com.cloudera.impala.jdbc41.Driver"; static final String DB_URL = "jdbc:impala://localhost:21050/"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; try{ Class.forName(JDBC_DRIVER); System.out.println("Connecting to a selected database..."); conn = DriverManager.getConnection(DB_URL, "", ""); System.out.println("Connected database successfully..."); System.out.println("Creating statement..."); stmt = conn.createStatement(); String sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text from test.a;"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Running query 1: " + sql); while(rs.next()) { System.out.println("col0=" + rs.getString(1) + " (null=" + rs.wasNull() + ") " + "col1=" + rs.getString(2) + " (null=" + rs.wasNull() + ") " + "col2=" + rs.getString(3) + " (null=" + rs.wasNull() + ") "); } rs.close(); // Add an unrelated comparison expression. sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text, text = 'a' from test.a;"; System.out.println("Running query 2: " + sql); rs = stmt.executeQuery(sql); while(rs.next()) { System.out.println("col0=" + rs.getString(1) + " (null=" + rs.wasNull() + ") " + "col1=" + rs.getString(2) + " (null=" + rs.wasNull() + ") " + "col2=" + rs.getString(3) + " (null=" + rs.wasNull() + ") "); } rs.close(); }catch(SQLException se){ //Handle errors for JDBC se.printStackTrace(); }catch(Exception e){ //Handle errors for Class.forName e.printStackTrace(); }finally{ //finally block used to close resources try{ if(stmt!=null) conn.close(); }catch(SQLException se){ }// do nothing try{ if(conn!=null) conn.close(); }catch(SQLException se){ se.printStackTrace(); } } } }
I ran from the command line with:
javac JDBCRegex.java && CLASSPATH=~/ClouderaImpalaJDBC-2.6.4.1005/ImpalaJDBC41.jar:. time java JDBCRegex
Created on 09-20-2018 04:26 AM - edited 09-20-2018 04:26 AM
Thank you for checking @Tim Armstrong!
We have also found that the issue does not exist when querying Impala through the Java API, even with older connectors. However, it does, when you use another client.
I checked the 2.6.4 connector and the issue still persists when running the query through SQLWorkbench.
Have you tried using a differen JDBC client?
Created 09-20-2018 03:00 PM
That's interesting - those tools ultimately must go through the Java API but I wonder if they're using different APIs or something. Presumably the bug isn't in the tools themselves.
Created 11-08-2019 03:29 AM
I think similar issue exist with a like search where using a backslash to escape certain characters leads to wrong output