Archives of Support Questions (Read Only)

mashayekhi_m · ‎11-23-2016

I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the Sql server. Is there possibility of calling stored-procedure in Sqoop?

gkeys · ‎11-23-2016

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

there are advantages to offloading the stored proc processing to hadoop:
- it typically takes much less time on hadoop (parallel processing)
- it frees resources on your source system and thus improves performance on that side
when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

View solution in original post

gkeys · ‎11-23-2016

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

there are advantages to offloading the stored proc processing to hadoop:
- it typically takes much less time on hadoop (parallel processing)
- it frees resources on your source system and thus improves performance on that side
when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

LesterMartin · ‎11-24-2016

+1 on suggestion #1

Cloudera Community

Archives of Support Questions (Read Only)

Data ingestion from MSSQL server to HDFS?