Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

dataframe is hashed but inserting in a table makes it unhashed!!!

Highlighted

dataframe is hashed but inserting in a table makes it unhashed!!!

New Contributor

#Creating dataframe out of table


clkstrm_brand = spark.sql("""select * from 10016_aa_clkstrm_na_lz_db.gdf0r12_adobe_clickstr_brand where \

df0r12_pevar15_site_x = 'example.com' """)


# creating UDF and do hashing the column

from pyspark.sql.functions import udf

import hashlib

def hash_func(df0r12_ip_x_2_hashed):


sha_value = hashlib.sha256(df0r12_ip_x_2_hashed.encode()).hexdigest()

return sha_value


spark_udf=udf(hash_value,StringType())

data = clkstrm_brand.withColumn('df0r12_ip_x_2_hashed',spark_udf('df0r12_ip_x_2'))

data=data.drop('df0r12_ip_x_2')

#data.select(F.col('df0r12_ip_x_2_hashed')).show(10,truncate=False)


Dataframe is hashed!!



// Creating table out of dataframe


sqlContext.sql("set hive.exec.dynamic.partition.mode=nonstrict");

data.write.mode("overwrite").insertInto("gotd_dataops.clickstr_brand_hashed")


dff=spark.table("gotd_dataops.clickstr_brand_hashed")

dff.select(F.col('df0r12_ip_x_2_hashed')).show(50)


+--------------------+

|df0r12_ip_x_2_hashed|

+--------------------+

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|

| ::3175731469|


Table is not hashed!!!!!!