Support Questions

mashayekhi_m · ‎11-23-2016

I want to ingest some data from MSSQL into HDFS periodically for example every week. I only have access to a couple of stored procedures on the sql server. Is there possibility of calling stored procedure in sparksql?

gkeys · ‎11-23-2016

Unfortunately you cannot.

See the following for alternatives, which must offload stored procedure logic to hadoop.

Note: there are advantages to offloading the stored proc processing to hadoop:

it typically takes much less time on hadoop (parallel processing)
it frees resources on your source system and thus improves performance on that side

View solution in original post

gkeys · ‎11-23-2016

Unfortunately you cannot.

See the following for alternatives, which must offload stored procedure logic to hadoop.

Note: there are advantages to offloading the stored proc processing to hadoop:

it typically takes much less time on hadoop (parallel processing)
it frees resources on your source system and thus improves performance on that side

mashayekhi_m · ‎11-23-2016

Well, in terms of offloading the store procedure, I do not have access to the database server and I am just given access to those stored procedures to run them. Is it wise to run the stored procedure in a java application and then use for example spark to write it in HDFS?

Thanks.

gkeys · ‎11-23-2016

I am not clear on the access parts here. Typically the offloading is done by rewriting the logic with a hadoop tool as mentioned. It seems like your choice is between easiest and most scalable. Easiest is to simply process the stored proc wherever/however as long as the result lands in hdfs. Most scalable is the rewrite on hadoop alternatives. (Not sure if this is clear ... let me know if not)

mashayekhi_m · ‎11-23-2016

Thanks. I meant that I don't have access to any tables or views on the database server so I don't know the logic underneath. I am able only to call the stored procedures.

gkeys · ‎11-23-2016

Given that constraint, it looks like your best option is to call it on source system, dump results to a table in source db and then sqoop from there.

If you cannot dump to a table, you could call it from a java program and dump results to a local file system and get this to hadoop using linux shell hdfs commands. The java program would have to be on an edge node connected to hadoop.

linux shell hdfs commands: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html

edge node: http://www.dummies.com/programming/big-data/hadoop/edge-nodes-in-hadoop-clusters/

If java program cannot be on an edge node, you would need to ftp to the edge node and the put to hadoop, or transfer results straight to hadoop via WebHdfs REST api. These last two are least favorable for performance reasons.

java hdfs api: https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html

WebHdfs REST api: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

mashayekhi_m · ‎12-01-2016

Thanks for your answers. Recently I noticed that probably NiFi can help me for this problem. Can I implement this dataflow in NiFi?

gkeys · ‎12-01-2016

That is a good question. I would post this as a separate question to get the full attention of nifi experts. I think you could simplify the question and the requirements by stating that you need to use nifi to call a java program that calls a stored procedure and put the results in hdfs. I believe this is possible, but there may be some points to consider.

Cloudera Community

Support Questions

Data ingestion from MSSQL server to HDFS?