Support Questions

navbhatt · ‎11-16-2017

Hi,

Hi, I have created one dataframe in Spark 1.6 by reading data from MySql Database. In that dataframe there is ID column which is null while loading from rdbms .Now I would like to insert this Dataframe into Hive table but ID column must be populated with some sequence number(0,1,...n). How can I achieve this in Scala program. I Hive 1.x hence can't take benefit of HIve2.x.

AutoIN · ‎11-17-2017

SQL provides function "rand" for random number generation.

In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps.

scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co  |Place|
|null| a1  |
|null| a2  |
+----+-----+

scala> val newDF= df.na.fill(1.0, Seq("col1"))

scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1  | a1  |
| 1  | a2  |
+----+-----+

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

navbhatt · ‎11-19-2017

Thank you for the reply. But instrea of constant value my requirement is to pouplate it with uique number. 1,2,3....n

AutoIN · ‎11-25-2017

Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below:

scala> df1.show()
+----+--------+----+
|  id|    name| age|
+----+--------+----+
|1201|  satish|39  |
|1202| krishna|null| <<
|1203|   amith|47  |
|1204|   javed|null| <<
|1205|  prudvi|null| <<
+----+--------+----+

scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())

scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))

scala> df2.show()
+----+--------+---+
|  id|    name|age|
+----+--------+---+
|1201|  satish| 39|
|1202| krishna| 2 | <<
|1203|   amith| 47|
|1204|   javed| 9 | <<
|1205|  prudvi| 7 | <<
+----+--------+---+

Cloudera Community

Support Questions

Generating unique Ids for hive table using Scala-Spark code