Created 06-28-2022 07:43 AM
How to write a dataframe (pandas,spark) with python from CML (Workbench or NoteBook) to a Cloudera Datahubcluster?
Is it possible without pyspark?
Hi @data_diver. To start with, in CDP Public Cloud, you write all your data to the cloud storage service for your platform (such as S3 or ADLS). After doing that, you can read it from a data hub cluster.
Regarding your question about writing a DataFrame from Python, I want to start by clarifying a couple of points. You want to write a DataFrame, which is a Spark object, from Python, but without using PySpark, which is the framework that allows Python to interact with Spark objects such as DataFrames. Is all that correct?
Perhaps you can start by giving us a bit of context. Why do you want to write a DataFrame without using PySpark? How will the DataFrame object exist in your Python program without PySpark in the first place? Any context you can provide for your use case would be helpful.
Hi @data_diver ,
From "Now my intention is to copy/write a table from CML python into that space that I can see from datahubcluster. The table or dataframe formats that I use are pandas or spark.", the only missing piece of information here would be, in which sink and format are you wanting to write the table? i.e.:
Use case from what I understand :
1) CML Spark/PySpark/Python/Pandas ->
2) Create a "table" Dataframe ->
3) Write the Dataframe down to your Datahub space ->
4) Questions at this point would be:
4.1) Write the dataframe as a Managed table into Hive? Then use the HiveWarehouseConnector( from now on HWC).
4.2) Write the dataframe as an External table into Hive? Then use the HWC.
4.3) Write the dataframe directly to a Datahub HDFS path? Then HDFS service in a Datahub is not meant for this purpose, therefore unsupported.
4.4) Write the dataframe directly to a filesystem? Then this would not be related to a Datahub, and you'll just have to df.write.option("path", "/some/path").saveAsTable("t"), where "path" is your ADLS storage container.
4.5) Write to Hive, without HWC? You'll need to configure CML to have Spark load the Hive ThriftServer endpoints from your Datahub. I haven't tried this nor am sure about its support status, but should be doable, and as long as both experiences CML and DH are on the same environments, network wise both should be reachable. This means, collecting the hive-site.xml file from your Datahub, and making these Hive Metastore URIS as per the hive-site.xml configs available to your CML Spark session somehow. This method should let you write directly to an EXTERNAL table in DH Hive. i.e.: adding to your SparkSession:
Please feel free to clarify the details on the use case, and hopefully on the need for a "Spark connector", i.e.:
- Is there any issue with using the HWC ?
- Are you getting any specific errors while connecting from CML to DH using Spark?
- Which client will you be using in DH to query the table?
- Any other you can think of...
The more details you can provide, the better we'll be able to help you.
@data_diver Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks