Can someone provide a rough order of magnitude/level-of-effort for converting or migrating PLSQL language (5,000-10,000 lines of code need to be migrated to Pig and Hive)?
There is also HPL/SQL.
It was incorporated into the Hive 2.0 release (Feb 16th) and thus will be automatically included in the next release of HDP that includes Hive 2.0.
It can be very easily added to an existing HDP install as well.
It is an implementation of stored procs, control flow logic, temporary variables, error checking, on-the-fly SQL syntax conversion to HiveQL, RDBMS SQL syntax compatibility, etc. It brings almost all the functionality of DB2, SQL Server, Oracle stored procs and SQL syntax into Hive. An added bonus is that you can mix & match the syntax of SQL Server T-SQL, Oracle PLSQL and DB2. It is slick.
For example - here is a sample script that runs in HPL/SQL (for real):
create procedure spLoadSalesFact (LoadMonth char(10)) BEGIN SET plhql.onerror = seterror; print 'Beginning execution...'; print '*' || LoadMonth || '*'; /* Fail if invalid Input Parameter set LoadMonth = Upper(LoadMonth); if (LoadMonth not in ('LAST', 'CURRENT')) BEGIN print 'Bad Input Paramter'; RETURN -1; END */ /* Load Prior Month */ IF (LoadMonth = 'LAST') BEGIN print 'Inserting Last Month''s Data...'; insert into table saleslineitem select * from saleslineitem where TransactionDate >= '2011-06-01' and transactionDate < '2011-07-01'; if SQLCODE <> 0 RETURN -2; END /* Load Current Month */ ELSE IF (LoadMonth = 'CURRENT') BEGIN print 'Inserting Current month''s data...'; insert into table saleslineitem select * from saleslineitem where TransactionDate >= '2011-07-01' and transactionDate < '2011-08-01'; if SQLCODE <> 0 RETURN -3; END PRINT 'Execution Completed...'; END /* Stored Proc invocation */ -- CALL spLoadSalesFact ('GARBAGE'); -- CALL spLoadSalesFact ('LAST'); CALL spLoadSalesFact ('CURRENT');
Are there any performance improvements of not creating custom UDF function ? I have currently some text-analytics functions, for which I can either create UDF jar files or use HPLSQL.@bpreachuk Are there any performance issues that you faced on using HPLSQL ?
Hi @tanmoy. I would recommend you go the custom route - create UDF jars for Hive. That way you can use them anywhere you use Hive code. HPL/SQL is still not a GA feature, and in my opinion HPL/SQL use a little bit more time to "bake" and become more robust.