Member since
03-24-2017
28
Posts
1
Kudos Received
0
Solutions
01-25-2019
01:19 AM
We have common pattern for the applications - where we perform data-prep ( spark-sql) / run the model scoring (pyspark) and generate the scoring output to hive (spark) - I am looking for microservice solution for the same
... View more
01-22-2019
04:29 AM
All, I am very new to micro-services and Containers world and concept, and trying to related how it can be possibly be accommodated with my Spark and Hive applications, Can someone please point me in the right direction to figure this out. Are there examples or blog which will show some sample code and demos.. I googled but not much luck
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
10-15-2018
10:59 PM
All, I want to learn how to use various settings ( set hive.tez... etc ) in hive for tuning and whats meaning of each and impacts of each ... where can i study each of them and see the effect of each setting during execution. Also is there extensive list of each of them ? We been strugling big time for identifying right settings to use, and simulate exact issues in lower environment like prod ? Thanks Freakabhi
... View more
Labels:
- Labels:
-
Apache Hive
09-12-2018
12:11 PM
All, I am working on writing to RDBMS ( Sql Server ) using Spark from hive, process works with great speed. But there is big issue, that each tasks until completes does not commits - which utilizes transaction log of the database and can cause impacts to other running jobs. Need to have some way to commits at regular interval ( 10000 K or so) . Can someone please suggest how this can be done?? Spark version : 2.2 SQL Server 2016 Thanks freakabhi
... View more
Labels:
- Labels:
-
Apache Spark
10-11-2017
12:25 AM
I wanted to understand how to approach to the solution of hive query failures; in our environment, we get numerous failure for "Vertex issue" / "Out of memory" etc ? is it common? how to check Hive Server2 logs/ yarn-logs and resource manager logs ( what to look for ? ) how to know if there is SKEW or another issue - Are there article covering these topics What hive setting/parameters means what and how it affects --> where to look for this details? what tools are good or being used to analyze issues in hive queries or map/reduce level?
... View more
Labels:
- Labels:
-
Apache Hive
10-11-2017
12:17 AM
I am planning to enable full-fledged development environment for Spark Practice, with Intellij / eclipse installed I am trying to enable Desktop mode for HDP 2.6 VM and it keeps on failing while trying to Add VNC Server; is it possible to achieve this? Error I am getting is shown in the screen-shot attached.
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Spark
07-03-2017
01:37 AM
I wanted to get the suggestion on the incremental strategy for tables be implemented :
We have set of source table which are getting refreshed on the daily basis in the source ( DB2 )
and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same
I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ?
https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
07-03-2017
01:36 AM
I wanted to get the suggestion on the incremental strategy for tables be implemented :
We have set of source table which are getting refreshed on the daily basis in the source ( DB2 )
and we need to refresh then in hive db as well, which approach will you suggest. Please note : Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same
I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ? https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
thanks Abhijeet
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
07-02-2017
06:42 PM
I wanted to get the suggestion on the incremental strategy for tables be implemented : We have set of source table which are getting refreshed on the daily basis in the source ( DB2 ) and we need to refresh then in hive db as well, which approach will you suggest. Source table have new inserts as well as updates to existing records; 1) approach 1: USe Hbase to store the data since updates are allowed and build hive external table referring to the same I doubt if this will affect queries using the joins for hive-hbase table with large ORC hive tables? 2) approach 2 : USe 4 step incremental table approach suggested by HDP ? https://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/
... View more
06-14-2017
10:58 PM
favorite
All, I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program and also I have question regarding writing SparkSQL program, is there difference of performance between writing 1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))")
2) using pyspark Api : df.select("col1,col2").distinct().count(). I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.
... View more
Labels:
- Labels:
-
Apache Spark
06-14-2017
04:03 PM
All, I would like to get the suggestions and correct way to convert very large queries like ( 1000 lines ) joining 10+ tables and complicated transforms to Py-Spark program Also if there are relevent examples for large sqls. and also I have question regarding writing SparkSQL program, is there difference of performance between writing 1) SQLContext.sql("select count(*) from (select distinct col1,col2 from table))") 2) using pyspark Api : df.select("col1,col2").distinct().count(). I am from SQL background and we are working on converting existing logic to hadoop, hence SQL is handy.
... View more
Labels:
- Labels:
-
Spark
04-29-2017
02:38 AM
All, I am working on importing 1.2 billion rows from one of the db2 table, There is composite primary index in the table with at least 5 columns, hence I need to manually specify the --split-by column ( since SQOOP does not support multi-part split-by column), I tried to run the Sqoop import with one of the columns from the index which is numeric , and import ran for almost 8 hrs. I am being suggested to try with different choice for split by key, but now I have question how to choose column for split by 1) is it supposed to be numeric? or varchar is fine too? 2) can it have nulls 3) I suppose distribution of values for column should be as even as possible, but will it matter how many maps I am choosing (--num-map ## ) ? or any other criteria to pay attention to? thanks Abhijeet
... View more
Labels:
- Labels:
-
Apache Sqoop
04-28-2017
07:47 PM
All, I am working on importing 1.2 billion rows from one of the db2 table, There is composite primary index in the table with at least 5 columns, hence I need to manually specify the --split-by column ( since SQOOP does not support multi-part split-by column), I tried to run the Sqoop import with one of the columns from the index which is numeric , and import ran for almost 8 hrs. I am being suggested to try with different choice for split by key, but now I have question how to choose column for split by 1) is it supposed to be numeric? or varchar is fine too? 2) can it have nulls 3) I suppose distribution of values for the column should be as even as possible, but will it matter how many maps I am choosing (--num-map ## )? or any other criteria to pay attention to? thanks Abhijeet
... View more
Labels:
- Labels:
-
Sqoop
04-28-2017
03:48 AM
All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any eazy way?
... View more
Labels:
- Labels:
-
Apache Sqoop
04-27-2017
08:52 PM
All, Working on importing data from DB2 using sqoop import, it worked fine for the most part except one table, which seemed to have some special characters ( control-M = ^M ) in contents, hence while sqooping, these characters are treated as newline and hence everything after it will be on the next line in the imported files, which will affect all the records after one bad record. I am unable to guess how to fix the imports? is there any easy way?
... View more
Labels:
- Labels:
-
Sqoop
04-24-2017
10:00 PM
1 Kudo
All, I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve 1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue 2) sqoop data in chucks and process and append into hive table ( have you done this ? ) 3) sqoop hive import to write all data to hive ORC table Which is best way ?
... View more
- Tags:
- Data Processing
- orc
- Sqoop
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Sqoop
04-24-2017
09:54 PM
All , I am working on creating and loading very high volume (450 million ) ORC table and process keeps on failing for vortex error . Platform: hdp 2.4 Engine : tez Why does this happens and what's solution?
... View more
03-27-2017
07:11 PM
All, If the hive tables created are in ORC and snappy ; how important is to analyze the tables and columns for performance? also if table is in ORC, do we need to take care of other performance enhancement techniques like Sorted merge join ( keep data in sorted on keys ), CBO and others. how does it applies to ORC files.
... View more
- Tags:
- Data Processing
- Hive
Labels:
- Labels:
-
Apache Hive
03-22-2017
01:19 AM
Hi All, I am converting long time taking SQL into hive-Spark SQL based solution, I have two options 1) create data frame for each of the hive table and replicate SQL and run on the Spark table1 = sqlContext.sql("select * from table1") table1.registerAsTempTabble("table1") .... similarly for all the tables, and replicate the SQL and run on spark pros: faster prototyping 2) use DataFrame Api using pyspark, like df.distinct().select()..... relatively slower developement time, what are pros and cons of one verses other ? and how to choose? thanks Abhijeet Rajput
... View more
Labels:
- Labels:
-
Apache Spark
03-22-2017
01:12 AM
I have question regarding, if we are creating common reservoir for the files in hadoop which can later be used for any purpose, may it be spark processing or pig or hive ...etc. Now I have question on while sqooping data into the hadoop, which file format to choose, any industry wide standards ? 1)Text delimited ( compressed or un-compressed ?) 2) AVRO 3)Parquet
... View more
Labels:
- Labels:
-
Apache Sqoop