About Freakabhi

edmp · ‎12-13-2023

use --hive-drop-import-delims |tr -d '\r'

edu_vikassri · ‎09-14-2018

Hi @Abhijeet Rajput, Did you tried like this ? dfOrders.write.mode("overwrite").format("jdbc") .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .option("url", "jdbc:sqlserver://server.westus.cloudapp.azure.com;databaseName=TestDB") .option("dbtable", "TestDB.dbo.orders") .option("user", "myuser") .option("batchsize","200000") .option("password","MyComplexPassword!001").save() Thanks Vikas Srivastava

SQLShaw · ‎07-03-2017

Hi @Abhijeet Rajput, Previous to HDP 2.6 you'll need to use the solution outlined in #2. HDP 2.6 includes Hive MERGE so you can now create a staging table and execute a MERGE statement against an ACID enabled table. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

saranvisa · ‎07-02-2017

@Freakabhi You can consider few more points before choose one of the approach, like... 1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records 2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2 3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update

bkosaraju · ‎06-15-2017

Hi @Abhijeet Rajput, In response to handling the huge SQL, Spark does lazy evolution which means you can split your code into multiple blocks and write using the multiple data frames. That will be evaluated at last and uses the optimal execution plan that can accommodate for the operation. Example : var subquery1 = sql (“select c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”) subquery1.registerTempTable(“res1”) var subquery2 = sql (“select c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on…. On the other request, there is no difference between using the DataFrame base API or SQL as the same execution plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.

mqureshi · ‎04-29-2017

@Abhijeet Rajput Numeric is preferred which you are already doing. You don't run into case sensitivity issues (your database sorting records in case insensitive way for example). Do you have a column which is unique but not primary key? Even distribution is important because otherwise your sqoop job can be skewed. Number of mappers definitely matter if you have slots available. More mappers, more parallelism, faster job. See the following link if you haven't already: http://stackoverflow.com/questions/37206232/sqoop-import-composite-primary-key-and-textual-primary-key

mqureshi · ‎04-28-2017

@Abhijeet Rajput Sqoop should load data in UTF-8 by default. run the following get db cfg for db_name and see the value for Database_code_set. In your mapred-site.xml, can you please try adding the following for mapreduce.map.java.opts: -Ddb2.jcc.charsetDecoderEncoder=3

balavignesh_nag · ‎04-24-2017

If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.

SQLShaw · ‎03-27-2017

Hi @Abhijeet Rajput, it is recommended to analyze all tables, ORC included, on a regular basis for performance. Statistics will be more valuable on larger tables than smaller tables. Sorting is not necessary and, in fact, sorting is not allowed on ACID tables. As of HDP 2.5, Hive uses both a rules based optimizer as well as a cost-based optimizer called Apache Calcite. Enabling the CBO will provide the best use of statistics. Also, you may want to take a look at LLAP which is TP in 2.5 and will be GA in 2.6. Hope this helps.

yvora · ‎03-22-2017

@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision. https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html In summary, You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc). Depending on your use case, you can choose to go with either SQL or Dataframe API. For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)

Online	Offline
Last Visited	‎07-02-2017 10:02 PM

Member Since	‎03-24-2017 07:50 PM
Last Visited	‎07-02-2017 10:02 PM
Posts	28
Kudos received	1

Cloudera Community

Re: Sqoop import - special characters

Re: Spark - JDBC intermediate Commits - urgent

Re: hive incremental updates

Re: hive incremental approach

Re: Pyspark SQL - Large Queries

Re: SQOOP - Split By Key Manual

Re: Sqoop import - Special characters

Re: Sqoop import & hive ORC

Re: How important is to analyze ORC Hive table

Re: Pyspark - Spark SQL