About soumyabrata_kol

soumyabrata_kol · ‎12-16-2016

Hi Gurus, It may not be practical question, however, I am wandering, if it is possible to load data in a Bucketed Table (Non-Partitioned) through insert-overwrite. I am getting NullPointerException while I am trying to do so. CREATE TABLE my_stg.mytable1 ( employee_id int, employee_name string, dept STRING, country STRING ) CLUSTERED BY (employee_id) INTO 256 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; set hive.enforce.bucketing = true; INSERT OVERWRITE TABLE my_stg.mytable1 SELECT employee_id,employee_name,dept,country FROM my_stg.mytable; FAILED: NullPointerException null Thanks, Soumya

soumyabrata_kol · ‎12-10-2016

Thanks. From the link below, I found the explanation - http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

soumyabrata_kol · ‎12-10-2016

Hi All, I am trying to read a valid Json as below through Spark Sql. {"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]} My Code is like below : >>> from pyspark.sql import SparkSession >>> spark = SparkSession \ ... .builder \ ... .appName("Python Spark SQL basic example") \ ... .config("spark.some.config.option", "some-value") \ ... .getOrCreate() >>> df = spark.read.json("/Users/soumyabrata_kole/Documents/spark_test/employees.json") >>> df.show() +---------------+---------+--------+ |_corrupt_record|firstName|lastName| +---------------+---------+--------+ | {"employees":[| null| null| | null| John| Doe| | null| Anna| Smith| | null| Peter| Jones| | ]}| null| null| +---------------+---------+--------+ >>> df.createOrReplaceTempView("employees") >>> sqlDF = spark.sql("SELECT * FROM employees") >>> sqlDF.show() +---------------+---------+--------+ |_corrupt_record|firstName|lastName| +---------------+---------+--------+ | {"employees":[| null| null| | null| John| Doe| | null| Anna| Smith| | null| Peter| Jones| | ]}| null| null| +---------------+---------+--------+ >>> As per my understanding, there should be only two columns -firstName and lastName. Is it a wrong understanding ? Why _corrupt_record is coming and how to avoid it ? Thanks and Regards, Soumya

soumyabrata_kol · ‎03-01-2016

Hi, I was trying to load a file in Pig which contains data like : {(3),(mary),(19)} {(1),(john),(18)} {(2),(joe),(18)} Following command is falling : A = LOAD 'data3' AS (B: bag {T: tuple(t1:int), F:tuple(f1:chararray), G:tuple(g1:int)}); How to do it in correct way ? Thanks, Soumya

soumyabrata_kol · ‎02-18-2016

Thanks Neeraj for your answer. However, I could not find how to enable ACID transactions from the link -https://hortonworks.app.box.com/files/0/f/2070270300/1/f_37967540402 Also other links which are present in the page of above link are not working. Could you please tell me the steps to enable ACID transactions. Thanks again ! Soumya

soumyabrata_kol · ‎02-18-2016

Hi Experts, I was trying to do insert,update and delete in a Hive table. Though insert worked for me update and delete didn't worked. I set following properties before executing any DDL/DML : set hive.support.concurrency=true; set hive.enforce.bucketing=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on=true; set hive.compactor.worker.threads=1; Then following table created : CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2)) CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true'); Following insert worked : INSERT INTO TABLE students VALUES ('AA', 23, 1.28), ('BB', 32, 2.32); Following update/delete are falling : UPDATE students SET gpa = 3.12 WHERE name='AA'; delete from students WHERE age=32; Could you please help me to understand the issue ? Hive version is as below - [hdfs@sandbox ~]$ hive --version SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] WARNING: Use "yarn jar" to launch YARN applications. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Hive 1.2.1.2.3.2.0-2950 Subversion git://c66-slave-20176e25-6/grid/0/jenkins/workspace/HDP-2.3-maint-centos6/bigtop/build/hive/rpm/BUILD/hive-1.2.1.2.3.2.0 -r c67988138ca472655a6978f50c7423525b71dc27 Compiled by jenkins on Wed Sep 30 19:07:31 UTC 2015 Thanks, Soumya

Online	Offline
Last Visited	‎12-19-2016 04:49 AM

Member Since	‎02-15-2016 10:36 AM
Last Visited	‎12-19-2016 04:49 AM
Posts	17
Kudos received	4

Cloudera Community

Populating Non-Partitioned Bucketed Tables

Re: Can't read Json properly in Spark

Can't read Json properly in Spark

How to load a bag from a file

Re: Update and delete in hive

Update and delete in hive