About mala_etl

mala_etl · ‎08-30-2022

It is internal table. Data in hive is normal, it can select/update/delete from openquery in sql server and can query from dbeaver.

mala_etl · ‎08-30-2022

Hello @RangaReddy , I run in Hortonwork, and hive table is orc format. What you mean hive catalog or in-memory catalog?

mala_etl · ‎07-20-2022

@VidyaSargur , Thank you.

mala_etl · ‎07-20-2022

Does it can read from pyspark?

mala_etl · ‎07-20-2022

Hello all, I cannot read data from hive orc table and load to dataframe. If someone know, could you help me to fix it? Below is my scripts: from pyspark import SparkContext, SparkConf from pyspark.conf import SparkConf from pyspark.sql import SparkSession from pyspark.sql import HiveContext,SQLContext spark = SparkSession.builder.appName("Testing....").enableHiveSupport().getOrCreate() hive_context = HiveContext(spark) sqlContext = SQLContext(spark) df_pgw=hive_context.sql("select * from orc_table") Hive Session ID = 79c9e6c0-1649-41dc-9aea-493c0f62d046 22/07/20 11:50:52 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 22/07/20 11:50:56 WARN HiveMetastoreCatalog: Unable to infer schema for table orc_table from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema. df_pgw.show() => ....Don't have data presents Thanks,

mala_etl · ‎07-14-2022

Hello everyone, I am new for spark processing. I need you help my problem when transform data from plat file to hive orc table. Below is my process flow using pyspark: 1 - Use pyspark to load flat file to dataframe 2 - Transform data in dataframe and insert to hive table(parquet) 3 - Insert data from hive(parquet) to orc format Step 1 and 2 is fast, but step 3 is too slow because it use much memory. Sometime it stuck cannot continue. Please help to advice and recommend the better flow. Thanks Here is sample code: -- loading.py import pyspark from pyspark import SparkContext, SparkConf from pyspark.conf import SparkConf from pyspark.sql import SparkSession from pyspark.sql import HiveContext,SQLContext from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.sql.types import ArrayType, DoubleType, BooleanType from pyspark.sql.functions import input_file_name,col,array_contains spark = SparkSession.builder.appName("testing..").enableHiveSupport().getOrCreate() df_schema = StructType([ StructField("col1",StringType(),True) ,StructField("col2",StringType(),True) ,StructField("col3",StringType(),True) ,StructField("col4",StringType(),True) ,StructField("col5",StringType(),True) ,StructField("filename",StringType(),True) ,StructField("YEARKEY",StringType(),True) ,StructField("MONTHKEY",StringType(),True) ,StructField("DAYKEY",StringType(),True) ]) dsCSV = spark.read.format("csv").options(header='False', delimiter=';').schema(df_schema).load("/user/test/processing/data/out").withColumn("filename",input_file_name()) dsCSV.registerTempTable("cdr_data") df_insert=spark.sql("select * from cdr_data") df_insert.write.option("compression","snappy").mode('append').format('parquet').partitionBy("yearkey","monthkey","daykey").saveAsTable(('landing.test_loading')) dsCSV.unpersist() dsCSV.unpersist(True) df_insert.unpersist() df_insert.unpersist(True) -- cdr_hivesql.sh v_history_records="insert into staging.test_loading select * from landing.test_loading" echo "====================>>>`date +%Y%m%d%H%M%S`<<<=====================" echo "" echo $v_history_records hive -e "$v_history_records;" Note: -- landing.test_loading(parquet format) -- staging.test_loading(orc format)

mala_etl · ‎04-06-2022

Hello @steven-matison , Could you provide of value of each properties?

mala_etl · ‎04-05-2022

Hello Team, I have scenario below need you help which processor use in apache nifi. I want to pull near real time files from FTP server and put to HDFS. but after pull files I want to move those files to other path in the same ftp server or rename to new name like (.tmp). Please help me to design this data flow. Thank Regards,

mala_etl · ‎01-22-2022

Hi, you mean it exists in hive CDH source?

mala_etl · ‎01-20-2022

Hello Everyone, I want to use hive class library name "org.apache.hadoop.hive.ql.udf". Could you tell me where I can download it? Regards,

Online	Offline
Last Visited	‎12-27-2022 10:35 PM

Member Since	‎11-23-2021 06:00 AM
Last Visited	‎12-27-2022 10:35 PM
Posts	13

Cloudera Community

Re: Spark cannot read hive orc table

Re: Spark cannot read hive orc table

Re: ORC Creation Best Practices

Re: ORC Creation Best Practices

Spark cannot read hive orc table

Insert data from spark data frame to hive orc tabl...

Re: Move or rename files in ftp server after pull ...

Move or rename files in ftp server after pull to H...

Re: Download hive class library "hive-1.1.0-cdh5.7...

Download hive class library "hive-1.1.0-cdh5.7.0-s...