Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

write python dataframe (pandas, spark) from CML (Workbench or NoteBook) to Cloudera Datahubcluster

avatar
New Contributor

How to write a dataframe (pandas,spark) with python from CML (Workbench or NoteBook) to a Cloudera Datahubcluster?

 

Is it possible without pyspark?

 

Thanks!

4 REPLIES 4

avatar
Expert Contributor

Hi @data_diver. To start with, in CDP Public Cloud, you write all your data to the cloud storage service for your platform (such as S3 or ADLS). After doing that, you can read it from a data hub cluster.

 

Regarding your question about writing a DataFrame from Python, I want to start by clarifying a couple of points. You want to write a DataFrame, which is a Spark object, from Python, but without using PySpark, which is the framework that allows Python to interact with Spark objects such as DataFrames. Is all that correct?

 

Perhaps you can start by giving us a bit of context. Why do you want to write a DataFrame without using PySpark? How will the DataFrame object exist in your Python program without PySpark in the first place? Any context you can provide for your use case would be helpful. 

avatar
New Contributor
Hello bbreak,

Thank you for coming back to me. Much appreciated.
Your summary of my question partially met my question.

My problem description in other words:
I work in an organisation that uses CDP as an Azure service.
Within CDP I can use Datahubcluster and Cloudera Machine Learning CML.
Due to the configuration of CML by our data engineers, I can use the Hive connector however I cannot the Spark connector to connect with and this might be a storage space like ADLS as you teached me and I have no clou of. But the essence is that the Spark connector is not yet there.
So far the premises.
Now my intention is to copy/write a table from CML python into that space that I can see from datahubcluster.
The table or dataframe formats that I use are pandas or spark.

My question:
How to copy/write CML python dataframes with format pandas or spark into that space that I see from datahubcluster?

Thank you in advance!

avatar
Rising Star

Hi @data_diver ,

 

From "Now my intention is to copy/write a table from CML python into that space that I can see from datahubcluster. The table or dataframe formats that I use are pandas or spark.", the only missing piece of information here would be, in which sink and format are you wanting to write the table? i.e.:

Use case from what I understand :
1) CML Spark/PySpark/Python/Pandas ->

2) Create a "table" Dataframe ->

3) Write the Dataframe down to your Datahub space ->

4) Questions at this point would be:

4.1) Write the dataframe as a Managed table into Hive? Then use the HiveWarehouseConnector( from now on HWC).

4.2) Write the dataframe as an External table into Hive? Then use the HWC.

4.3) Write the dataframe directly to a Datahub HDFS path? Then HDFS service in a Datahub is not meant for this purpose, therefore unsupported.

4.4) Write the dataframe directly to a filesystem? Then this would not be related to a Datahub, and you'll just have to df.write.option("path", "/some/path").saveAsTable("t"), where "path" is your ADLS storage container.

4.5) Write to Hive, without HWC? You'll need to configure CML to have Spark load the Hive ThriftServer endpoints from your Datahub. I haven't tried this nor am sure about its support status, but should be doable, and as long as both experiences CML and DH are on the same environments, network wise both should be reachable. This means, collecting the hive-site.xml file from your Datahub, and making these Hive Metastore URIS as per the hive-site.xml configs available to your CML Spark session somehow. This method should let you write directly to an EXTERNAL table in DH Hive. i.e.: adding to your SparkSession:

 

.config("spark.datasource.hive.warehouse.metastoreUri", "[Hive_Metastore_Uris]")

 

Please feel free to clarify the details on the use case, and hopefully on the need for a "Spark connector", i.e.:

- Is there any issue with using the HWC ?

- Are you getting any specific errors while connecting from CML to DH using Spark?

- Which client will you be using in DH to query the table?

- Any other you can think of...

 

The more details you can provide, the better we'll be able to help you.

 

avatar
Community Manager

@data_diver Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: