- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
JDBC/ODBC: regexp_* functions' behavior depends on unrelated comparisons and database used
- Labels:
-
Apache Impala
Created on ‎07-05-2018 06:02 AM - edited ‎09-16-2022 06:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've found that the regexp_extract and regexp_replace functions behave differently, depending on comparisons done in the same query or the database used.
Consider the following script:
create schema test; create table test.a (text string); insert into test.a values ("a");
The following query behaves inconsistently with the Impala documentation:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a;
According to the Impala documentation, double backslashes should be used as a regex escape character. However, it doesn't work here (see Col2 in the above result). Instead, it does work when using a single backslash.
If we add an unrelated comparison to the query, this behaviour changes:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.a;
Now, a double backslash is required for the regex to function correctly. The result is identical, if one uses another table column in the comparison or puts the comparison into the where-clause.
This strange behavior is only present when running queries over JDBC/ODBC on a non-default database. Hue and Impala-Shell work as expected. And JDBC/ODBC-queries work as expected when executed on tables in the default database.
I've tested this on CDH5.15.0 and 5.13.1 with JDBC-2.6.3.1004 and ODBC v2.5.37.1014 (32bit) drivers.
Is this a bug or am I missing something? Anyone else experiencing the same issue?
Created on ‎09-18-2018 12:16 AM - edited ‎09-18-2018 01:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When reproducing the issue, we observed the following -
1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:
SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`
2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test
We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.
The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].
Created on ‎09-18-2018 12:16 AM - edited ‎09-18-2018 01:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When reproducing the issue, we observed the following -
1./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.test;) with the newest connector, through a client, the Impala parser got the following input:
SELECT regexp_extract(`test`.`text`,'\\w',0), regexp_extract(`test`.`text`,'\\\\w',0), `test`.`text` FROM `test`.`test`
2./ When running the query (select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test;), the parser input is:
select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = "a" from test.test
We also ran multiple queries and noticed when there is a " character in the query, the driver passes through the statement as is. When there is no " in the query, the driver will use backticks and backslashes in the statement causing double escaping the characters.
The above leads us to believe that the problem is probably in the connector, and not in Impala. For now, this can be resolved by using a '' in the query or use "UseNativeQuery". This helps in ensuring that the driver does not transform the queries emitted by an application, and runs it as is, as explained on the Simba documentation page [1].
Created ‎09-18-2018 09:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for letting us know about this, I'll see what I can do to get it fixed in a future release.
Created ‎09-19-2018 04:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Svyat, I tested with Impala JDBC 2.6.4 and it appears to be resolved.
I now get this output from my test program with the two different version of the query:
Running query 1: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text from test.a; col0= (null=false) col1=a (null=false) col2=a (null=false) Running query 2: select regexp_extract(text, '\w', 0), regexp_extract(text, '\\w', 0), text, text = 'a' from test.a; col0= (null=false) col1=a (null=false) col2=a (null=false)
Test code is:
import java.sql.*; public class JDBCRegex { // JDBC driver name and database URL static final String JDBC_DRIVER = "com.cloudera.impala.jdbc41.Driver"; static final String DB_URL = "jdbc:impala://localhost:21050/"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; try{ Class.forName(JDBC_DRIVER); System.out.println("Connecting to a selected database..."); conn = DriverManager.getConnection(DB_URL, "", ""); System.out.println("Connected database successfully..."); System.out.println("Creating statement..."); stmt = conn.createStatement(); String sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text from test.a;"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Running query 1: " + sql); while(rs.next()) { System.out.println("col0=" + rs.getString(1) + " (null=" + rs.wasNull() + ") " + "col1=" + rs.getString(2) + " (null=" + rs.wasNull() + ") " + "col2=" + rs.getString(3) + " (null=" + rs.wasNull() + ") "); } rs.close(); // Add an unrelated comparison expression. sql = "select regexp_extract(text, '\\w', 0), regexp_extract(text, '\\\\w', 0), text, text = 'a' from test.a;"; System.out.println("Running query 2: " + sql); rs = stmt.executeQuery(sql); while(rs.next()) { System.out.println("col0=" + rs.getString(1) + " (null=" + rs.wasNull() + ") " + "col1=" + rs.getString(2) + " (null=" + rs.wasNull() + ") " + "col2=" + rs.getString(3) + " (null=" + rs.wasNull() + ") "); } rs.close(); }catch(SQLException se){ //Handle errors for JDBC se.printStackTrace(); }catch(Exception e){ //Handle errors for Class.forName e.printStackTrace(); }finally{ //finally block used to close resources try{ if(stmt!=null) conn.close(); }catch(SQLException se){ }// do nothing try{ if(conn!=null) conn.close(); }catch(SQLException se){ se.printStackTrace(); } } } }
I ran from the command line with:
javac JDBCRegex.java && CLASSPATH=~/ClouderaImpalaJDBC-2.6.4.1005/ImpalaJDBC41.jar:. time java JDBCRegex
Created on ‎09-20-2018 04:26 AM - edited ‎09-20-2018 04:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for checking @Tim Armstrong!
We have also found that the issue does not exist when querying Impala through the Java API, even with older connectors. However, it does, when you use another client.
I checked the 2.6.4 connector and the issue still persists when running the query through SQLWorkbench.
Have you tried using a differen JDBC client?
Created ‎09-20-2018 03:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's interesting - those tools ultimately must go through the Java API but I wonder if they're using different APIs or something. Presumably the bug isn't in the tools themselves.
Created ‎11-08-2019 03:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think similar issue exist with a like search where using a backslash to escape certain characters leads to wrong output
