About Freakabhi

Freakabhi · ‎09-12-2018

All, I am working on writing to RDBMS ( Sql Server ) using Spark from hive, process works with great speed. But there is big issue, that each tasks until completes does not commits - which utilizes transaction log of the database and can cause impacts to other running jobs. Need to have some way to commits at regular interval ( 10000 K or so) . Can someone please suggest how this can be done?? Spark version : 2.2 SQL Server 2016 Thanks freakabhi

Freakabhi · ‎07-03-2017

I wanted to get the suggestion on the incremental strategy for tables be implemented : We have set of source table which are getting refreshed on the daily basis in the source ( DB2 ) and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ? https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

Freakabhi · ‎07-02-2017

I wanted to get the suggestion on the incremental strategy for tables be implemented : We have set of source table which are getting refreshed on the daily basis in the source ( DB2 ) and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ? https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

Freakabhi · ‎06-14-2017

favorite All, I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program and also I have question regarding writing SparkSQL program, is there difference of performance between writing 1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))") 2) using pyspark Api : df.select("col1,col2").distinct().count(). I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.

Freakabhi · ‎04-29-2017

All, I am working on importing 1.2 billion rows from one of the db2 table, There is composite primary index in the table with at least 5 columns, hence I need to manually specify the --split-by column ( since SQOOP does not support multi-part split-by column), I tried to run the Sqoop import with one of the columns from the index which is numeric , and import ran for almost 8 hrs. I am being suggested to try with different choice for split by key, but now I have question how to choose column for split by 1) is it supposed to be numeric? or varchar is fine too? 2) can it have nulls 3) I suppose distribution of values for column should be as even as possible, but will it matter how many maps I am choosing (--num-map ## ) ? or any other criteria to pay attention to? thanks Abhijeet

Freakabhi · ‎04-28-2017

All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any eazy way?

Freakabhi · ‎04-27-2017

All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any easy way?

Freakabhi · ‎04-24-2017

All, I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve 1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue 2) sqoop data in chucks and process and append into hive table ( have you done this ? ) 3) sqoop hive import to write all data to hive ORC table Which is best way ?

Freakabhi · ‎03-27-2017

All, If the hive tables created are in ORC and snappy ; how important is to analyze the tables and columns for performance? also if table is in ORC, do we need to take care of other performance enhancement techniques like Sorted merge join ( keep data in sorted on keys ), CBO and others. how does it applies to ORC files.

Freakabhi · ‎03-22-2017

Hi All, I am converting long time taking SQL into hive-Spark SQL based solution, I have two options 1) create data frame for each of the hive table and replicate SQL and run on the Spark table1 = sqlContext.sql("select * from table1") table1.registerAsTempTabble("table1") .... similarly for all the tables, and replicate the SQL and run on spark pros: faster prototyping 2) use DataFrame Api using pyspark, like df.distinct().select()..... relatively slower developement time, what are pros and cons of one verses other ? and how to choose? thanks Abhijeet Rajput

Online	Offline
Last Visited	‎07-02-2017 10:02 PM

Member Since	‎03-24-2017 07:50 PM
Last Visited	‎07-02-2017 10:02 PM
Posts	28
Kudos received	1

Cloudera Community

Spark - JDBC intermediate Commits - urgent

hive incremental updates

hive incremental approach

Pyspark SQL - Large Queries

SQOOP - Split By Key Manual

Sqoop import - Special characters

Sqoop import - special characters

Sqoop import & hive ORC

How important is to analyze ORC Hive table

Pyspark - Spark SQL