Created on 11-16-2017 08:38 PM - edited 09-16-2022 05:32 AM
Hi,
Hi, I have created one dataframe in Spark 1.6 by reading data from MySql Database. In that dataframe there is ID column which is null while loading from rdbms .Now I would like to insert this Dataframe into Hive table but ID column must be populated with some sequence number(0,1,...n). How can I achieve this in Scala program. I Hive 1.x hence can't take benefit of HIve2.x.
Created 11-17-2017 05:21 AM
SQL provides function "rand" for random number generation.
In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps.
scala> df.show()
+----+-----+
|col1| col2|
+----+-----+
|Co  |Place|
|null| a1  |
|null| a2  |
+----+-----+
scala> val newDF= df.na.fill(1.0, Seq("col1"))
scala> newDF.show()
+----+-----+
|col1| col2|
+----+-----+
| Co |Place|
| 1  | a1  |
| 1  | a2  |
+----+-----+https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
Created 11-19-2017 08:24 PM
Thank you for the reply. But instrea of constant value my requirement is to pouplate it with uique number. 1,2,3....n
Created on 11-25-2017 10:40 PM - edited 11-26-2017 07:27 AM
Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below:
scala> df1.show()
+----+--------+----+
|  id|    name| age|
+----+--------+----+
|1201|  satish|39  |
|1202| krishna|null| <<
|1203|   amith|47  |
|1204|   javed|null| <<
|1205|  prudvi|null| <<
+----+--------+----+
scala> val arr = udf(() => scala.util.Random.nextInt(10).toString())
scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr()))
scala> df2.show()
+----+--------+---+
|  id|    name|age|
+----+--------+---+
|1201|  satish| 39|
|1202| krishna| 2 | <<
|1203|   amith| 47|
|1204|   javed| 9 | <<
|1205|  prudvi| 7 | <<
+----+--------+---+