Member since
12-05-2017
10
Posts
0
Kudos Received
0
Solutions
04-24-2018
03:59 AM
Thanks for your prompt response. Regarding your response to Query1, a) Can the conversion from TEXT=>PARQUET be done using Impala alone, without using Hive specific SerDe? b) How can I create datafiles in HDFS directly in parquet format? Regards.
... View more
04-24-2018
12:07 AM
After loading csv data into HDFS, I have seen below way to create a Hive external table(textfile format). Followed by creating Impala internal table(parquet format) like the Hive external table. And it works.
My question:
1) Why should one go this roundabout way of creating a Hive table; and then an impala table from it? Why can't we directly create an external impala table (in parquet format)?
2) Is there any issue with sticking to external tables only(without any internal tables) - given that my data is always bulk loaded directly into hdfs?
2) When should one use "from_unixtime(unix_timestamp(as_of_date,"dd-MMM-yy"),'yyyy-MM-dd')"
and store date as string vs storing date as timestamp in Impala?
// sample external table defined below
create EXTERNAL TABLE my_external_table ( Col1 string, as_of_date string, Col3 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION "/data/my_data_files" tblproperties("skip.header.line.count"="1");
// sample internal table defined below
create table my_internal_table like my_external_table stored as parquet;
insert into table my_internal_table select Col1, from_unixtime(unix_timestamp(as_of_date,"dd-MMM-yy"),'yyyy-MM-dd'), Col3 from my_external_table;
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
HDFS
12-22-2017
03:43 AM
Can someone please help me solve this issue. It is blocking our progress.
... View more
12-10-2017
01:09 PM
I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem. Error: ImportError: No module named indexes.base at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) ... Error points towards the line where I am calling RDD.map() transformation. Sample code below: from pyspark.context import SparkContext import pandas CPR_loans = pandas.DataFrame(columns=["CPR", "loans"]) temp_vars = pandas.DataFrame(columns=['A','B','C']) def processPeriods(period): global accum accum+=1 temp_vars['prepay_probability'] = 0.000008 temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 ) #return (100 * (1-0.000008) **12) return temp_vars['CPR'] nr_periods=5 sc = SparkContext.getOrCreate() periodListRDD = sc.parallelize(range(1, nr_periods)) accum = sc.accumulator(0) rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect() print "rdd_list = ", rdd_list CPR_loans.append( rdd_list ) Please suggest how can I make it work? Thanks a lot.
... View more
Labels:
- Labels:
-
Apache Spark