- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Data ingestion from MSSQL server to HDFS?
- Labels:
-
Apache Sqoop
Created ‎11-23-2016 01:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the Sql server. Is there possibility of calling stored-procedure in Sqoop?
Created ‎11-23-2016 03:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:
- ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
- same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
- same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true
Notes
- there are advantages to offloading the stored proc processing to hadoop:
- it typically takes much less time on hadoop (parallel processing)
- it frees resources on your source system and thus improves performance on that side
- when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side
If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.
Created ‎11-23-2016 03:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices:
- ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap.
- same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
- same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true
Notes
- there are advantages to offloading the stored proc processing to hadoop:
- it typically takes much less time on hadoop (parallel processing)
- it frees resources on your source system and thus improves performance on that side
- when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side
If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.
Created ‎11-24-2016 12:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
+1 on suggestion #1
