Created on 11-16-2017 08:38 PM - edited 09-16-2022 05:32 AM
Hi,
Hi, I have created one dataframe in Spark 1.6 by reading data from MySql Database. In that dataframe there is ID column which is null while loading from rdbms .Now I would like to insert this Dataframe into Hive table but ID column must be populated with some sequence number(0,1,...n). How can I achieve this in Scala program. I Hive 1.x hence can't take benefit of HIve2.x.
Created 11-17-2017 05:21 AM
SQL provides function "rand" for random number generation.
In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps.
scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co |Place|
|null| a1 |
|null| a2 |
+----+-----+
scala> val newDF= df.na.fill(1.0, Seq("col1"))
scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1 | a1 |
| 1 | a2 |
+----+-----+https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
Created 11-19-2017 08:24 PM
Thank you for the reply. But instrea of constant value my requirement is to pouplate it with uique number. 1,2,3....n
Created on 11-25-2017 10:40 PM - edited 11-26-2017 07:27 AM
Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below:
scala> df1.show()
+----+--------+----+
| id| name| age|
+----+--------+----+
|1201| satish|39 |
|1202| krishna|null| <<
|1203| amith|47 |
|1204| javed|null| <<
|1205| prudvi|null| <<
+----+--------+----+
scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())
scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))
scala> df2.show()
+----+--------+---+
| id| name|age|
+----+--------+---+
|1201| satish| 39|
|1202| krishna| 2 | <<
|1203| amith| 47|
|1204| javed| 9 | <<
|1205| prudvi| 7 | <<
+----+--------+---+