Member since
04-01-2018
17
Posts
2
Kudos Received
0
Solutions
12-03-2019
11:41 PM
@RahulSoni I think you're a bit quick to dismiss Spark + JDBC. There is actually a solution for the multithreading - Spark will extract the data to different partitions in parallel, just like when your read an HDFS file. You just have to specify the number of partitions in the extracted dataframe and optimize the parameters for your job (number of executors, cores per executor, memory per executor). While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop). I haven't tried yet, but here's someone who seems quite pleased with how Spark worked for them: https://www.all-things-cloud.com/2018/12/spark-jdbc-instead-of-sqoop.html. Sqoop is supposed to support Avro but when I tried to output Avro files it failed with a low-level error in a Java library. I wasn't too impressed by the performance either, although that could be due to bandwidth problems.
... View more
09-03-2019
01:28 PM
hi @nshawa, I am having the following error on PutHiveStreaming processor after running the template you provided: Any idea how to fix this?
... View more
04-10-2019
04:15 AM
1 Kudo
Assuming you want to access the data via spark, then the main question is how it should be stored. For this Drill is not supported, but Hive tables and Kudu are supported by Cloudera. Now it boils down to whether you want to store the data in Hive or in Kudu, as Spark can work with both of these. If you want to insert your data record by record, or want to do interactive queries in Impala then Kudu is likely the best choice. If you want to insert and process your data in bulk, then Hive tables are usually the nice fit.
... View more
07-22-2018
05:02 PM
> How many vCores allocated for Tasks within the Executors? Tasks run inside pre-allocated Executors, and do not cause further allocations to occur. Read on below to understand the relationship between tasks and executor from a resource and concurrency viewpoint: """ Every Spark executor in an application has the same fixed number of cores and same fixed heap size. The number of cores can be specified with the --executor-cores flag when invoking spark-submit, spark-shell, and pyspark from the command line, or by setting the spark.executor.cores property in the spark-defaults.conf file or on a SparkConf object. Similarly, the heap size can be controlled with the --executor-memory flag or the spark.executor.memory property. The cores property controls the number of concurrent tasks an executor can run. --executor-cores 5 means that each executor can run a maximum of five tasks at the same time. """ Read more at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
... View more
04-19-2018
06:49 PM
Thanks a lot! Finally, Sqoop.. 🙂
... View more