Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Generating unique Ids for hive table using Scala-Spark code

Generating unique Ids for hive table using Scala-Spark code

New Contributor

Hi,

Hi, I have created one dataframe in Spark 1.6  by reading data from MySql Database. In that dataframe there is ID column which is null while loading  from rdbms .Now I would like to insert this Dataframe into Hive table but ID column must be populated with some sequence number(0,1,...n). How can I achieve this in Scala program. I Hive 1.x  hence can't  take benefit  of HIve2.x.

3 REPLIES 3

Re: Generating unique Ids for hive table using Scala-Spark code

Expert Contributor

 

SQL provides function "rand" for random number generation.

In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps. 

 

scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co  |Place|
|null| a1  |
|null| a2  |
+----+-----+

scala> val newDF= df.na.fill(1.0, Seq("col1"))

scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1  | a1  |
| 1  | a2  |
+----+-----+

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

 

Re: Generating unique Ids for hive table using Scala-Spark code

New Contributor

Thank you  for the reply. But instrea of constant value my requirement is to pouplate it  with uique number. 1,2,3....n

Highlighted

Re: Generating unique Ids for hive table using Scala-Spark code

Expert Contributor

Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below: 

 

scala> df1.show()
+----+--------+----+
|  id|    name| age|
+----+--------+----+
|1201|  satish|39  |
|1202| krishna|null| <<
|1203|   amith|47  |
|1204|   javed|null| <<
|1205|  prudvi|null| <<
+----+--------+----+

scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())

scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))

scala> df2.show()
+----+--------+---+
|  id|    name|age|
+----+--------+---+
|1201|  satish| 39|
|1202| krishna| 2 | <<
|1203|   amith| 47|
|1204|   javed| 9 | <<
|1205|  prudvi| 7 | <<
+----+--------+---+