Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best Practices to access HIVE from IBM SPSS

avatar
Contributor

At Daimler , We extensively use IBM SPSS Modeller for Data Science activities and when it comes to access HIVE from IBM SPSS we use Hortonworks ODBC connection. Performance of accessing data is not good and takes huge amount of time.

Please suggest the most optimized way to access HIVE data from IBM SPSS and if someone has already done some benchmark that would be highly appreciated.

Scenarios :

We used small tables with joins IBM SPSS took a lot of time.

We tried to combine data on HIVE side and then do filter in IBM SPSS still the performance is not good.

Thank you for suggestion and help.

1 ACCEPTED SOLUTION

avatar
Master Guru

There are literally a dozen different options here:

a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation.

https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm

b) SPSS also supports a set of UDFs for in database scoring but that is not what you want.

c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster.

ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf

Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.

View solution in original post

1 REPLY 1

avatar
Master Guru

There are literally a dozen different options here:

a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation.

https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm

b) SPSS also supports a set of UDFs for in database scoring but that is not what you want.

c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster.

ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf

Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.