Created on 11-16-2017 08:38 PM - edited 09-16-2022 05:32 AM
Hi,
Hi, I have created one dataframe in Spark 1.6 by reading data from MySql Database. In that dataframe there is ID column which is null while loading from rdbms .Now I would like to insert this Dataframe into Hive table but ID column must be populated with some sequence number(0,1,...n). How can I achieve this in Scala program. I Hive 1.x hence can't take benefit of HIve2.x.
Created 11-17-2017 05:21 AM
SQL provides function "rand" for random number generation.
In general, we've seen clients using df.na.fill() to replace Null strings. See if that helps.
scala> df.show() +----+-----+ |col1| col2| +----+-----+ |Co |Place| |null| a1 | |null| a2 | +----+-----+ scala> val newDF= df.na.fill(1.0, Seq("col1")) scala> newDF.show() +----+-----+ |col1| col2| +----+-----+ | Co |Place| | 1 | a1 | | 1 | a2 | +----+-----+
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions
Created 11-19-2017 08:24 PM
Thank you for the reply. But instrea of constant value my requirement is to pouplate it with uique number. 1,2,3....n
Created on 11-25-2017 10:40 PM - edited 11-26-2017 07:27 AM
Sure. One way I can think of achieving this is by creating a UDF using random and calling the udf within withColumn using coalesce. See below:
scala> df1.show() +----+--------+----+ | id| name| age| +----+--------+----+ |1201| satish|39 | |1202| krishna|null| << |1203| amith|47 | |1204| javed|null| << |1205| prudvi|null| << +----+--------+----+ scala> val arr = udf(() => scala.util.Random.nextInt(10).toString()) scala> val df2 = df1.withColumn("age", coalesce(df1("age"), arr())) scala> df2.show() +----+--------+---+ | id| name|age| +----+--------+---+ |1201| satish| 39| |1202| krishna| 2 | << |1203| amith| 47| |1204| javed| 9 | << |1205| prudvi| 7 | << +----+--------+---+