Support Questions

Find answers, ask questions, and share your expertise

Data ingestion from MSSQL server to HDFS?

avatar

I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the Sql server. Is there possibility of calling stored-procedure in Sqoop?

1 ACCEPTED SOLUTION

avatar
Guru

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

  1. ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
  2. same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
  3. same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

  • there are advantages to offloading the stored proc processing to hadoop:
    • it typically takes much less time on hadoop (parallel processing)
    • it frees resources on your source system and thus improves performance on that side
  • when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

View solution in original post

2 REPLIES 2

avatar
Guru

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

  1. ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
  2. same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
  3. same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

  • there are advantages to offloading the stored proc processing to hadoop:
    • it typically takes much less time on hadoop (parallel processing)
    • it frees resources on your source system and thus improves performance on that side
  • when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

avatar

+1 on suggestion #1