Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala ODBC performance

avatar
Expert Contributor

Hi,

 we are trying to download a bulk of data from CDH cluster via Windows ODBC Driver for Impala version 2.5.22 to a Windows server.  The ODBC driver works well, but the performance of rows dispatching is really bad - roughly 3M rows/minute.  We checked the possible bottlenecks for this kind of download, but the cluster and also the receiving Windows server were not under load at all, the cpu around 5%, the network cards running on 10Gbit, there are plenty of RAM memory, the target disk where the data is written is RAID-0 SSD with 1GB/s max throughput, so we dont know what component on the trasnfer slows down the records. 

 

We tried to run in multiple parallel threads, what helped a little bit (50% perf increase) but the overall perf is still low..

Also tried to tweak the transfer batch size in ODBC driver, it looks that it doesnt affect the performance at all.

 

The setup is CDH5.3, and Microsoft SQL Server 2014, the Impala is linked via linked server in MS SQL.

 

Any ideas how to increase the transfer speed?

Thanks

 

Tomas

 

 

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
We set up two linked servers from on SQL Server instance to another SQL Server instance. One using our ODBC driver for SQL Server and one using our OLE DB provider for SQL Server. We tested using a table with 500000 rows, 7 columns and 5 different datatypes. We ran the performance tests using both the OPENQUERY syntax and without the OPENQUERY syntax but the performance is quite close between the two syntax.

View solution in original post

8 REPLIES 8

avatar
Contributor

Confirm it.

JDBC drivers are more faster.

 

select * from table1 limit 10000

ODBC driver (linked server to MS SQL 2014): 45 s

JDBC driver: 1 s

avatar
Expert Contributor

Created a case from this issue, hopefully the engineering team will come back with a solution

Tomas

 

avatar
Explorer
Can you try the DataDirect ODBC driver for Impala to help rule out issues at the odbc layer? This is the fastest driver available, so should help point to driver or application depending on results. If storing data in SQL, have you considered SSIS?

https://www.progress.com/products/data-sources/cloudera-impala-odbc-driver-cloudera-impala-jdbc-driv...

avatar
Contributor

Hi SAsInSumit,

 

Will try it today.

 

Thanks

avatar
Explorer

One difference I noticed with DataDirect Impala ODBC driver is that it has a very large default varchar size for string data types to prevent possible data corruption, and SQL Server Linked Server doesnt like the length.  

 

To fix, make sure to set Max Varchar Size = 4000 in the advanced tab.

avatar
Expert Contributor

Hi, we installed 64bit ODBC driver from DataDirect for Impala and tried to establish a connection between SQL Server 2014 (running on Windows Srv 2012R2) and Cloudera Impala. After setting up the ODBC driver, the test connection was ok.

But the linked server is not working, listing tables works, but a simple select statmenet returns this kind of error:

 

OLE DB provider "MSDASQL" for linked server "IMPALA" returned message "Unspecified error".

Msg 7311, Level 16, State 2, Line 1

Cannot obtain the schema rowset "DBSCHEMA_COLUMNS" for OLE DB provider "MSDASQL" for linked server "IMPALA". The provider supports the interface, but returns a failure code when it is used.

 

I also contacted the technical team from Progress Software but no response yet,

 

Any ideas?

avatar
Cloudera Employee
We set up two linked servers from on SQL Server instance to another SQL Server instance. One using our ODBC driver for SQL Server and one using our OLE DB provider for SQL Server. We tested using a table with 500000 rows, 7 columns and 5 different datatypes. We ran the performance tests using both the OPENQUERY syntax and without the OPENQUERY syntax but the performance is quite close between the two syntax.

avatar
Explorer

Thank you mkempanna.  Performance will depend on which SQL queries you run through the Linked Server.  For example, complex queries are typically faster when pushed down to the database using OPENQUERY.  If you just select * from the table, both syntaxes will perform similarly.