Member since
03-24-2017
28
Posts
1
Kudos Received
0
Solutions
09-14-2018
08:51 AM
Hi @Abhijeet Rajput,
Did you tried like this ?
dfOrders.write.mode("overwrite").format("jdbc")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://server.westus.cloudapp.azure.com;databaseName=TestDB")
.option("dbtable", "TestDB.dbo.orders")
.option("user", "myuser")
.option("batchsize","200000")
.option("password","MyComplexPassword!001").save()
Thanks
Vikas Srivastava
... View more
07-03-2017
04:45 PM
1 Kudo
Hi @Abhijeet Rajput, Previous to HDP 2.6 you'll need to use the solution outlined in #2. HDP 2.6 includes Hive MERGE so you can now create a staging table and execute a MERGE statement against an ACID enabled table. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge
... View more
07-02-2017
07:33 PM
@Freakabhi You can consider few more points before choose one of the approach, like... 1. Number of records: approach 1 is fine for very huge records and approach 2 is ok for the less records 2. How to handle the issue if something goes wrong? : The 4th step in approach 2 deletes base table and recreate with new data. Consider you have noticed an issue with data after couple of days, how do you get deleted base_table? if you have answer then go for approach 2 3. Approach 3: You are choosing approach 1 because Hbase supports updates but hive does not support updates (I guess this is your understanding). Yes your understand was correct with old hive version. But Update is available in starting Hive 0.14 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
... View more
06-15-2017
12:31 AM
1 Kudo
Hi @Abhijeet Rajput, In response to handling the
huge SQL, Spark does lazy evolution which means you can split
your code into multiple blocks and write using the multiple data frames. That will be evaluated at
last and uses the optimal execution plan that can accommodate for the operation. Example :
var subquery1 = sql (“select
c1,c2,c3 form tbl1 join tbl2 on codition1 and condition 2”)
subquery1.registerTempTable(“res1”)
var subquery2 = sql (“select
c1,c2,c3 form res1 join tbl3 on codition4 and condition 5”)and so on…. On the other request, there
is no difference between using the DataFrame base API or SQL as the same execution
plan will be generated for both, you can validate the same from DAG schedule while on execution with Spark UI.
... View more
04-29-2017
03:26 AM
@Abhijeet Rajput Numeric is preferred which you are already doing. You don't run into case sensitivity issues (your database sorting records in case insensitive way for example). Do you have a column which is unique but not primary key? Even distribution is important because otherwise your sqoop job can be skewed. Number of mappers definitely matter if you have slots available. More mappers, more parallelism, faster job. See the following link if you haven't already: http://stackoverflow.com/questions/37206232/sqoop-import-composite-primary-key-and-textual-primary-key
... View more
04-28-2017
05:37 AM
@Abhijeet Rajput Sqoop should load data in UTF-8 by default. run the following get db cfg for db_name and see the value for Database_code_set. In your mapred-site.xml, can you please try adding the following for mapreduce.map.java.opts: -Ddb2.jcc.charsetDecoderEncoder=3
... View more
04-24-2017
10:38 PM
1 Kudo
If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.
... View more
03-27-2017
07:24 PM
Hi @Abhijeet Rajput, it is recommended to analyze all tables, ORC included, on a regular basis for performance. Statistics will be more valuable on larger tables than smaller tables. Sorting is not necessary and, in fact, sorting is not allowed on ACID tables. As of HDP 2.5, Hive uses both a rules based optimizer as well as a cost-based optimizer called Apache Calcite. Enabling the CBO will provide the best use of statistics. Also, you may want to take a look at LLAP which is TP in 2.5 and will be GA in 2.6. Hope this helps.
... View more
03-22-2017
06:58 PM
5 Kudos
@Abhijeet Rajput, Found an article which compares performance of RDD/ Dataframe and SQL . It will help you make informed decision. https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html In summary, You mainly need to analyze your use case ( like what type of queries will you be running , how big is data set etc). Depending on your use case, you can choose to go with either SQL or Dataframe API. For example: If your use case involves lot of groupby, orderby like queries, you should go with sparkSQL instead data frame api. ( because sparkSQL executes faster than data frame api for such use case)
... View more