Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!

Have you stumbled upon existing articles on only reading from a managed Hive table using R within CML, but you really want to read/write to an external table (not a Hive Managed table)?  Then you've come to the right place!  


There are complexities to reading/writing to a Hive-managed table since ACID is involved, as the HWC piece solves.  In 2023, if you're looking for ACID capabilities then using Iceberg table format instead of Hive-managed tables is recommended anyways. The Iceberg table format is fully supported throughout CDP.  


Let's get on with the steps to easily read/write using R to an external table within CML!  Again, without the need for HWC.  


Step 1: Created two tables (external) and inserted records

*In this case, I used Cloudera Data Warehouse with an Impala Virtual Warehouse


create database cloudera_r;

create external table cloudera_r.people_table (id int, name string) location 's3a://ryancicak/people_table';

insert into cloudera_r.people_table values (0,'ryan'), (1, 'paul'), (2, 'oleksandr'), (3, 'oliver'), (4,'jd');

create external table cloudera_r.people_new_table (id int, name string) location 's3a://ryancicak/people_new_table';

insert into cloudera_r.people_new_table values (5,'danielle'), (6,'nilesh');


Step 2: Modifying the code in the article above (stripping out HWC) - I'm able to run the code (which I'll include a snippet below so you can copy)

  • Since the Hive Metastore contains the S3 location from Impala in Step 1, we do not need to specify the S3 location for Spark to read/write.
  • We need the spark.yarn.access.hadoopFileSystem to contain the S3 location that our table is - so Spark has access to read/write.  If you make a modification on spark.yarn.access.hadoopFileSystem - it's best to start a new session.



#Run this once

#Add library

#Add spark_config()
config <- spark_config()

#Add configuration

#Connect with spark
sc <- spark_connect(config = config)

#Change database if necessary
spark_session(sc) %>% invoke("sql", "USE cloudera_r")

peopleframe <- tbl(sc, sql("select * from cloudera_r.people_new_table"))

#spark_write_csv(peopleframe,"s3a://ryancicak/testing_people.csv", header=TRUE)
spark_write_table(peopleframe, "cloudera_r.people_table", mode = "append")

#Read and show table
intDf1 <- sparklyr::spark_read_table(sc, 'people_table')

peopleframe <- tbl(sc, sql("select * from cloudera_r.people_table"))


Notice that the first print table is showing the two rows, Danielle and Nilesh.  The second table print is showing the table people_table after we wrote the two new rows, where instead of five initial records, we show seven (since we appended).
  • I do print the same table twice - just showing two different ways to read from a table (I prefer the SQL option).
  • If you do not choose the correct bucket name (the S3 bucket where you're reading/writing tables) - you'll receive an error.  In my case, I receive a RAZ 401 error (since I have RAZ enabled).  If I change my bucket after running one in my session, I'll still get the error.  I need to stop and then start a new session to take in the correct bucket. 
  • I can add multiple S3 locations separated with commas.
Notice that I changed my hadoopFileSystems to "differents3bucket" and receive the access error.  Even if I change back to "ryancicak", I still receive an access error until I start a new session.  
@pauldefusco added some more examples to his GitHub repo that may be helpful: SparklyR_CML_Example.
Shout-out to @pauldefusco@ozarate@aakulov, and @jagadeesan for helping with these steps!

Cloudera making hard stuff easy again!   Great article Ryan!!