About Sri_Kumaran

ManneFagerlind · ‎12-03-2019

@RahulSoni I think you're a bit quick to dismiss Spark + JDBC. There is actually a solution for the multithreading - Spark will extract the data to different partitions in parallel, just like when your read an HDFS file. You just have to specify the number of partitions in the extracted dataframe and optimize the parameters for your job (number of executors, cores per executor, memory per executor). While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop). I haven't tried yet, but here's someone who seems quite pleased with how Spark worked for them: https://www.all-things-cloud.com/2018/12/spark-jdbc-instead-of-sqoop.html. Sqoop is supposed to support Avro but when I tried to output Avro files it failed with a low-level error in a Java library. I wasn't too impressed by the performance either, although that could be due to bandwidth problems.

Koffi · ‎09-03-2019

hi @nshawa, I am having the following error on PutHiveStreaming processor after running the template you provided: Any idea how to fix this?

DennisJaheruddi · ‎04-10-2019

Assuming you want to access the data via spark, then the main question is how it should be stored. For this Drill is not supported, but Hive tables and Kudu are supported by Cloudera. Now it boils down to whether you want to store the data in Hive or in Kudu, as Spark can work with both of these. If you want to insert your data record by record, or want to do interactive queries in Impala then Kudu is likely the best choice. If you want to insert and process your data in bulk, then Hive tables are usually the nice fit.

Harsh J · ‎07-22-2018

> How many vCores allocated for Tasks within the Executors? Tasks run inside pre-allocated Executors, and do not cause further allocations to occur. Read on below to understand the relationship between tasks and executor from a resource and concurrency viewpoint: """ Every Spark executor in an application has the same fixed number of cores and same fixed heap size. The number of cores can be specified with the --executor-cores flag when invoking spark-submit, spark-shell, and pyspark from the command line, or by setting the spark.executor.cores property in the spark-defaults.conf file or on a SparkConf object. Similarly, the heap size can be controlled with the --executor-memory flag or the spark.executor.memory property. The cores property controls the number of concurrent tasks an executor can run. --executor-cores 5 means that each executor can run a maximum of five tasks at the same time. """ Read more at http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Sri_Kumaran · ‎04-19-2018

Thanks a lot! Finally, Sqoop.. 🙂

Online	Offline
Last Visited	‎06-07-2020 01:17 AM

Member Since	‎04-01-2018 02:48 PM
Last Visited	‎06-07-2020 01:17 AM
Posts	17
Kudos received	3

Cloudera Community

Re: Can Spark SQL replaces Sqoop for Data Ingestio...

Re: Stream data into HIVE like a Boss using NiFi H...

Re: Which one is best Hive vs Impala vs Drill vs K...

Re: How many vCores allocated for Tasks within the...

Re: Can Spark SQL replaces Sqoop for Data Ingestio...