About aval

aval · ‎01-02-2023

We have a full ACID hive managed table that we need to access from spark ETL. We used the documentation provided to connect to Hive WareHouse connector -> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehousecon... In addition to using hive warehouse connector to access the acid tables, what spark execution mode differences are there when using JDBC hwc connector and HiveWareHouseSession vs SparkContext without hwc connector. We don't see any information spark ui/ spark history server and the query takes far too long (x3) than a similar query from SQLContext using a non-acid managed table. from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build() df= hive.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # NO DAG in spark history server, slower, takes higher memory __________________________ The same pattern using SQLContext from pyspark.sql import SQLContext sqlSparkContext = SQLContext(spark.sparkContext) df = sqlSparkContext.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # SHOWS DAG in spark ui/ spark history server, faster Can someone please explain the difference apart from hive table access where the HiveWarehouseSession spark code gets executed, engines in play, optimization, memory usage etc. vs spark code using SQLContext. Does "spark.sql.hive.hwc.execution.mode"=spark change the spark map reduce execution?

aval · ‎12-28-2022

Can someone explain what is different about these two spark execution engines.below? Environment: CDP private cluster Spark version 2 We have a full ACID hive managed table that we need to access from spark ETL. We used the documentation provided to connect to Hive WareHouse connector -> https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html In addition to using hive warehouse connector to access the acid tables, what execution differences are there between two submissions. We don't see any DAG in spark history server and the query takes far too long (x3) than a similar query from SQLContext using a non-acid managed table. from pyspark_llap import HiveWarehouseSession hive = HiveWarehouseSession.session(spark).build() df= hive.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # NO DAG in spark history server, slower, takes higher memory __________________________ The same pattern using SQLContext from pyspark.sql import SQLContext sqlSparkContext = SQLContext(spark.sparkContext) df = sqlSparkContext.sql("select * from incidents LIMIT 100") ... df.show(10) #additional spark transformation code.. # SHOWS DAG in spark history server, faster Can someone please explain the difference apart from hive table access where the HiveWarehouseSession spark code gets executed, engines in play, optimization, memory usage etc. vs spark code using SQLContext. I suspect

aval · ‎05-06-2022

Thanks for that! That's helpful.

aval · ‎05-06-2022

Trying to clarify my original post as requested.. The intent was indeed to say that we're moving work between CDP public cloud -> CDP private cloud experiences. That is one of the significant advantages of the cloudera offering, a hybrid cloud solution that offers similar experiences between on-prem (private cloud) and public cloud solutions. Glad you pointed out this statement in my above post which could be confusing as my question is really about CDP private base -> CDP private experiences dependencies. Hope that makes it clear.

aval · ‎05-05-2022

Hi, I'm trying to setup the data engineering experiences (ECS) on top of a CDP base installation. These are licensed installations, so we first installed and configured a base cluster and we have been working on it. We want to setup the data experiences so we can move some work between the cdp public cloud and the cdp private cloud experiences since the control planes are similar. The goal is to work from the CDP private ECS enabled cluster and not use the base cluster unless any components are needed for the ECS/experiences cluster from the base. Trying to comb through the documentation to find the relationship between CDP base and CDP ECS enabled experiences cluster. What roles/manager/services are being used from base cluster in the experiences cluster? Understand a base "core" (manager/runtime etc.) is needed but not sure if I understand the dependencies between the two. The documentation says start with a base cluster and install ECS - well there a many flavors for base templates and custom roles one can setup. What is the minimum need here? Ultimately, we just want to use the elastic services from the experience CDP private ECS cluster if everything can be served from there and not waste resources for the base. Could someone explain this please or reference a document/blob with the explanation? Thanks!

aval · ‎03-08-2022

Thanks for all the input. We reinstalled ubuntu 20.04 and started fresh. It failed again but this time the log showed that curl was missing on the machine. Manually installed curl and basically monitored the logs in /var/tmp for the nodes to push along any small issues. Finally we got the cluster installed. However, it still had some health monitoring and inconsistencies when installing services. Finally, we dropped Ubuntu and went with centOS 7. Had better luck with that, the errors were more manageable. Anyone going though this I think Ubuntu 20.04 cluster scripts/error messages are a bit immature, go with RHEL if you can. Also, take any health warning reported by the cloduera manager seriously like "swappiness". They are "Warnings" but will save you a ton of time if addressed ahead of time.

aval · ‎01-28-2022

This is the first cluster I'm adding. I'm installing per this reference: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/installation/topics/cdpdc-trial-installation.html So far I have the cluster manager setup and I'm trying to follow the next set of instructions to add a cluster.

aval · ‎01-27-2022

repos has mostly empty folder with blank /unbuntu-focal/cloudera.manager.list

aval · ‎01-27-2022

Cluster manager looks fine Ubuntu 20.04, bare metal host. No SElinux, no firewall The agent log says BEGIN sudo dpkg -l openjdk8 | grep -E '^ii[[:space:]]*openjdk8[[:space:]]*' dpkg-query: no packages found matching openjdk8 END (1) BEGIN sudo apt-cache show openjdk8 E: No packages found ___________ If you skip JDK installation in the wizard, it fails with BEGIN sudo dpkg -l cloudera-manager-agent | grep -E '^ii[[:space:]]*cloudera-manager-agent[[:space:]]*' dpkg-query: no packages found matching cloudera-manager-agent Any clues?

Online	Offline
Last Visited	‎04-11-2024 02:40 PM

Member Since	‎01-27-2022 06:57 AM
Last Visited	‎04-11-2024 02:40 PM
Posts	9
Kudos received	2

Cloudera Community

Re: CDP trial adding clusters failing

HWC execution in spark no spark ui

HiveWarehouseSession vs SQLContext spark execution

Re: CDP private cloud experience cluster requireme...

Re: CDP private cloud experience cluster requireme...

CDP private cloud experience cluster requirements

Re: CDP trial adding clusters failing

Re: CDP trial adding clusters failing

Re: CDP trial adding clusters failing

CDP trial adding clusters failing