Created 11-23-2016 01:57 PM
I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the sql server. Is there possibility of calling stored procedure in sparksql?
Created 11-23-2016 04:03 PM
Unfortunately you cannot.
See the following for alternatives, which must offload stored procedure logic to hadoop.
Note: there are advantages to offloading the stored proc processing to hadoop:
Created 11-23-2016 04:03 PM
Unfortunately you cannot.
See the following for alternatives, which must offload stored procedure logic to hadoop.
Note: there are advantages to offloading the stored proc processing to hadoop:
Created 11-23-2016 04:45 PM
Well, in terms of offloading the store procedure, I do not have access to the database server and I am just given access to those stored procedures to run them. Is it wise to run the stored procedure in a java application and then use for example spark to write it in HDFS?
Thanks.
Created 11-23-2016 05:00 PM
I am not clear on the access parts here. Typically the offloading is done by rewriting the logic with a hadoop tool as mentioned. It seems like your choice is between easiest and most scalable. Easiest is to simply process the stored proc wherever/however as long as the result lands in hdfs. Most scalable is the rewrite on hadoop alternatives. (Not sure if this is clear ... let me know if not)
Created 11-23-2016 09:30 PM
Thanks. I meant that I don't have access to any tables or views on the database server so I don't know the logic underneath. I am able only to call the stored procedures.
Created 11-23-2016 10:27 PM
Given that constraint, it looks like your best option is to call it on source system, dump results to a table in source db and then sqoop from there.
If you cannot dump to a table, you could call it from a java program and dump results to a local file system and get this to hadoop using linux shell hdfs commands. The java program would have to be on an edge node connected to hadoop.
linux shell hdfs commands: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html
edge node: http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/
If java program cannot be on an edge node, you would need to ftp to the edge node and the put to hadoop, or transfer results straight to hadoop via WebHdfs REST api. These last two are least favorable for performance reasons.
java hdfs api: https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html
WebHdfs REST api: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
Created 12-01-2016 02:18 PM
Thanks for your answers. Recently I noticed that probably NiFi can help me for this problem. Can I implement this dataflow in NiFi?
Created 12-01-2016 02:44 PM
That is a good question. I would post this as a separate question to get the full attention of nifi experts. I think you could simplify the question and the requirements by stating that you need to use nifi to call a java program that calls a stored procedure and put the results in hdfs. I believe this is possible, but there may be some points to consider.