Member since
03-24-2017
28
Posts
1
Kudos Received
0
Solutions
09-12-2018
12:11 PM
All, I am working on writing to RDBMS ( Sql Server ) using Spark from hive, process works with great speed. But there is big issue, that each tasks until completes does not commits - which utilizes transaction log of the database and can cause impacts to other running jobs. Need to have some way to commits at regular interval ( 10000 K or so) . Can someone please suggest how this can be done?? Spark version : 2.2 SQL Server 2016 Thanks freakabhi
... View more
Labels:
- Labels:
-
Apache Spark
07-03-2017
01:37 AM
I wanted to get the suggestion on the incremental strategy for tables be implemented :
We have set of source table which are getting refreshed on the daily basis in the source ( DB2 )
and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same
I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ?
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
07-02-2017
06:42 PM
I wanted to get the suggestion on the incremental strategy for tables be implemented : We have set of source table which are getting refreshed on the daily basis in the source ( DB2 ) and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ? https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
06-14-2017
10:58 PM
favorite
All, I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program and also I have question regarding writing SparkSQL program, is there difference of performance between writing 1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
2) using pyspark Api : df.select("col1,col2").distinct().count(). I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.
... View more
Labels:
- Labels:
-
Apache Spark
04-29-2017
02:38 AM
All, I am working on importing 1.2 billion rows from one of the db2 table, There is composite primary index in the table with at least 5 columns, hence I need to manually specify the --split-by column ( since SQOOP does not support multi-part split-by column), I tried to run the Sqoop import with one of the columns from the index which is numeric , and import ran for almost 8 hrs. I am being suggested to try with different choice for split by key, but now I have question how to choose column for split by 1) is it supposed to be numeric? or varchar is fine too? 2) can it have nulls 3) I suppose distribution of values for column should be as even as possible, but will it matter how many maps I am choosing (--num-map ## ) ? or any other criteria to pay attention to? thanks Abhijeet
... View more
Labels:
- Labels:
-
Apache Sqoop
04-28-2017
03:48 AM
All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any eazy way?
... View more
Labels:
- Labels:
-
Apache Sqoop
04-27-2017
08:52 PM
All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any easy way?
... View more
Labels:
- Labels:
-
Apache Sqoop
04-24-2017
10:00 PM
1 Kudo
All, I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve 1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue 2) sqoop data in chucks and process and append into hive table ( have you done this ? ) 3) sqoop hive import to write all data to hive ORC table Which is best way ?
... View more
Labels:
- Labels:
-
Apache Sqoop
03-27-2017
07:11 PM
All, If the hive tables created are in ORC and snappy ; how important is to analyze the tables and columns for performance? also if table is in ORC, do we need to take care of other performance enhancement techniques like Sorted merge join ( keep data in sorted on keys ), CBO and others. how does it applies to ORC files.
... View more
Labels:
- Labels:
-
Apache Hive
03-22-2017
01:19 AM
Hi All, I am converting long time taking SQL into hive-Spark SQL based solution, I have two options 1) create data frame for each of the hive table and replicate SQL and run on the Spark table1 = sqlContext.sql("select * from table1") table1.registerAsTempTabble("table1") .... similarly for all the tables, and replicate the SQL and run on spark pros: faster prototyping 2) use DataFrame Api using pyspark, like df.distinct().select()..... relatively slower developement time, what are pros and cons of one verses other ? and how to choose? thanks Abhijeet Rajput
... View more
Labels:
- Labels:
-
Apache Spark