Created 07-25-2016 08:19 AM
At Daimler , We extensively use IBM SPSS Modeller for Data Science activities and when it comes to access HIVE from IBM SPSS we use Hortonworks ODBC connection. Performance of accessing data is not good and takes huge amount of time.
Please suggest the most optimized way to access HIVE data from IBM SPSS and if someone has already done some benchmark that would be highly appreciated.
Scenarios :
We used small tables with joins IBM SPSS took a lot of time.
We tried to combine data on HIVE side and then do filter in IBM SPSS still the performance is not good.
Thank you for suggestion and help.
Created 07-25-2016 01:14 PM
There are literally a dozen different options here:
a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation.
https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm
b) SPSS also supports a set of UDFs for in database scoring but that is not what you want.
c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster.
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf
Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.
Created 07-25-2016 01:14 PM
There are literally a dozen different options here:
a) Did you enable SQL Optimization of SPSS ( requires the modeler server licence ) after that it can push tasks into the hive datasource. Not sure if Hive is a supported datasource but I would assume so. You can look into the documentation.
https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/sql_overview.htm
b) SPSS also supports a set of UDFs for in database scoring but that is not what you want.
c) Finally there is the SPSS Analytic Server which can essentially run most functions as an Mapreduce job on the cluster.
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/analyticserver/1.0/English/IBM_SPSS_Analytic_Server_1_Users_Guide.pdf
Unfortunately if you neither have the Modeler Server licence nor analytic server there is not much you can do besides manually pushing prefilters into the hive database or optimizing your SPSS jobs more.