Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Data ingestion from MSSQL server to HDFS?

avatar
New Member

I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the Sql server. Is there possibility of calling stored-procedure in Sqoop?

1 ACCEPTED SOLUTION

avatar
Guru

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

  1. ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
  2. same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
  3. same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

  • there are advantages to offloading the stored proc processing to hadoop:
    • it typically takes much less time on hadoop (parallel processing)
    • it frees resources on your source system and thus improves performance on that side
  • when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

View solution in original post

2 REPLIES 2

avatar
Guru

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:

  1. ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
  2. same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
  3. same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true

Notes

  • there are advantages to offloading the stored proc processing to hadoop:
    • it typically takes much less time on hadoop (parallel processing)
    • it frees resources on your source system and thus improves performance on that side
  • when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side

If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

avatar

+1 on suggestion #1