Support Questions

Anitauzanna · ‎10-18-2021

Hi, I have such a problem that I have a csv file that I put in data frame and everything is okay, where field is empty it puts null in data frame, but when I want to insert that into Hive table nulls are getting changed to "?".

tarak271 · ‎01-03-2022

Hi @Anitauzanna
I have tried below use case as per the problem description, but I could not see '?' in hive table

[root@c1448-node1 ~]# hdfs dfs -cat /tmp/data.csv
1,a,b
2,,c
3,d,
[root@c1448-node1 ~]#

Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.7.7.1.7.0-551
/_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)

scala> val csv_df = spark.read.csv("/tmp/data.csv");
csv_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> csv_df.show
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
| 1| a| b|
| 2|null| c|
| 3| d|null|
+---+----+----+


scala> csv_df.write.mode("overwrite").saveAsTable("null_test");
22/01/03 08:22:28 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
Hive Session ID = 623fe696-c08a-4bad-b15a-12b6c7db1c52

scala> val null_df=spark.sql("select * from null_test");
null_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala> null_df.show
+---+----+----+
|_c0| _c1| _c2|
+---+----+----+
| 1| a| b|
| 2|null| c|
| 3| d|null|
+---+----+----+


scala>

[root@c1448-node1 ~]#
[root@c1448-node1 ~]# beeline -e "select * from null_test;"
+----------------+----------------+----------------+
| null_test._c0 | null_test._c1 | null_test._c2 |
+----------------+----------------+----------------+
| 1 | a | b |
| 2 | NULL | c |
| 3 | d | NULL |
+----------------+----------------+----------------+
3 rows selected (1.344 seconds)

I guess there could be some unsupported characters in your text file, we might need to change encoding to get them correctly visible

Please share the sample data along with the Hive table schema & spark commands tried so far to help troubleshoot the issue further

Cloudera Community

Support Questions

Inserting nulls from dataframe to hive table