Member since
01-27-2022
9
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2850 | 03-08-2022 08:39 AM |
01-02-2023
08:17 AM
We have a full ACID hive managed table that we need to access from spark ETL. We used the documentation provided to connect to Hive WareHouse connector -> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehousecon... In addition to using hive warehouse connector to access the acid tables, what spark execution mode differences are there when using JDBC hwc connector and HiveWareHouseSession vs SparkContext without hwc connector. We don't see any information spark ui/ spark history server and the query takes far too long (x3) than a similar query from SQLContext using a non-acid managed table. from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build() df= hive.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # NO DAG in spark history server, slower, takes higher memory __________________________ The same pattern using SQLContext from pyspark.sql import SQLContext sqlSparkContext = SQLContext(spark.sparkContext) df = sqlSparkContext.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # SHOWS DAG in spark ui/ spark history server, faster Can someone please explain the difference apart from hive table access where the HiveWarehouseSession spark code gets executed, engines in play, optimization, memory usage etc. vs spark code using SQLContext. Does "spark.sql.hive.hwc.execution.mode"=spark change the spark map reduce execution?
... View more
Labels:
12-28-2022
04:40 AM
Can someone explain what is different about these two spark execution engines.below? Environment: CDP private cluster Spark version 2 We have a full ACID hive managed table that we need to access from spark ETL. We used the documentation provided to connect to Hive WareHouse connector -> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html In addition to using hive warehouse connector to access the acid tables, what execution differences are there between two submissions. We don't see any DAG in spark history server and the query takes far too long (x3) than a similar query from SQLContext using a non-acid managed table. from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build() df= hive.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # NO DAG in spark history server, slower, takes higher memory __________________________ The same pattern using SQLContext from pyspark.sql import SQLContext sqlSparkContext = SQLContext(spark.sparkContext) df = sqlSparkContext.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # SHOWS DAG in spark history server, faster Can someone please explain the difference apart from hive table access where the HiveWarehouseSession spark code gets executed, engines in play, optimization, memory usage etc. vs spark code using SQLContext. I suspect
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
Apache Tez
05-06-2022
04:39 PM
Thanks for that! That's helpful.
... View more
05-06-2022
02:58 AM
Trying to clarify my original post as requested.. The intent was indeed to say that we're moving work between CDP public cloud -> CDP private cloud experiences. That is one of the significant advantages of the cloudera offering, a hybrid cloud solution that offers similar experiences between on-prem (private cloud) and public cloud solutions. Glad you pointed out this statement in my above post which could be confusing as my question is really about CDP private base -> CDP private experiences dependencies. Hope that makes it clear.
... View more
05-05-2022
11:22 AM
Hi, I'm trying to setup the data engineering experiences (ECS) on top of a CDP base installation. These are licensed installations, so we first installed and configured a base cluster and we have been working on it. We want to setup the data experiences so we can move some work between the cdp public cloud and the cdp private cloud experiences since the control planes are similar. The goal is to work from the CDP private ECS enabled cluster and not use the base cluster unless any components are needed for the ECS/experiences cluster from the base. Trying to comb through the documentation to find the relationship between CDP base and CDP ECS enabled experiences cluster. What roles/manager/services are being used from base cluster in the experiences cluster? Understand a base "core" (manager/runtime etc.) is needed but not sure if I understand the dependencies between the two. The documentation says start with a base cluster and install ECS - well there a many flavors for base templates and custom roles one can setup. What is the minimum need here? Ultimately, we just want to use the elastic services from the experience CDP private ECS cluster if everything can be served from there and not waste resources for the base. Could someone explain this please or reference a document/blob with the explanation? Thanks!
... View more
Labels:
03-08-2022
08:39 AM
2 Kudos
Thanks for all the input. We reinstalled ubuntu 20.04 and started fresh. It failed again but this time the log showed that curl was missing on the machine. Manually installed curl and basically monitored the logs in /var/tmp for the nodes to push along any small issues. Finally we got the cluster installed. However, it still had some health monitoring and inconsistencies when installing services. Finally, we dropped Ubuntu and went with centOS 7. Had better luck with that, the errors were more manageable. Anyone going though this I think Ubuntu 20.04 cluster scripts/error messages are a bit immature, go with RHEL if you can. Also, take any health warning reported by the cloduera manager seriously like "swappiness". They are "Warnings" but will save you a ton of time if addressed ahead of time.
... View more
01-28-2022
06:20 AM
This is the first cluster I'm adding. I'm installing per this reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/installation/topics/cdpdc-trial-installation.html So far I have the cluster manager setup and I'm trying to follow the next set of instructions to add a cluster.
... View more
01-27-2022
09:18 AM
repos has mostly empty folder with blank /unbuntu-focal/cloudera.manager.list
... View more
01-27-2022
07:02 AM
Cluster manager looks fine Ubuntu 20.04, bare metal host. No SElinux, no firewall The agent log says BEGIN sudo dpkg -l openjdk8 | grep -E '^ii[[:space:]]*openjdk8[[:space:]]*' dpkg-query: no packages found matching openjdk8 END (1) BEGIN sudo apt-cache show openjdk8 E: No packages found ___________ If you skip JDK installation in the wizard, it fails with BEGIN sudo dpkg -l cloudera-manager-agent | grep -E '^ii[[:space:]]*cloudera-manager-agent[[:space:]]*' dpkg-query: no packages found matching cloudera-manager-agent Any clues?
... View more
Labels:
- Labels:
-
Cloudera Manager