About balavignesh_nag

balavignesh_nag · ‎01-30-2018

Hi, I have certain set of question which Im trying to understand in spark which are mentioned below: What the best compression codec that can be used in spark. In hadoop we should not use gz compression unless it is cold data where input splits of very less use. But if we were to choose any other compression w.r.t (lzo/bzip2/snappy etc) then based on what parameters do we need to choose the compressions? Does spark makes use of the input splits if the files are compressed? How does spark handles compression when compared with MR? Does compression increases the amount of data which is being shuffled? Thanks in advance!!

balavignesh_nag · ‎01-05-2018

This would also work. import java.io.File val files = getListOfFiles("/tmp") def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList

balavignesh_nag · ‎01-05-2018

@Chaitanya D It will be possible in with unix and spark combination. hadoop fs -ls /filedirectory/*txt_processed Above command will return the desired file you need. Then pass the result to spark and process the file as you need. Alternatively in spark you can select the desired file using the below command. val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!! Hope it helps !

balavignesh_nag · ‎01-03-2018

@Alexandros Biratsis i believe that you are not using Insert Overwrite when inserting the incremental records into the target. Assuming that its weird how the data is being overridden. For the union part --> If you wanted to avoid union then you may have to may have to perform left join between the incremental data and target to apply some transformations ( assuming that you are performing SCD type 1). If you wanted to just append the data then insert the incremental data through multiple queries into the target one by one. By if you are inserting the data multiple times then the no of jobs will be more which be more or less equal to performing union over it. Sorry for the late reply.

balavignesh_nag · ‎12-29-2017

Hi @Alexandros Biratsis I could see the work around available in the link which you have mentioned. Anyways let me add few points on top of it. Create a work table Perform union between target and incremental data and insert into the newly created work table Assuming that you are using only external table --> Drop the work table. re-create the target table pointing to the work table location so that you can avoid re-loading the target from the work table. Hope it helps!

balavignesh_nag · ‎12-28-2017

@Sebastien F Background execution of tez and mr has many similarities. Differences lies in the where the data are in placed to transform it. Tez uses DAG to process the data whereas mr doesn't use DAG. This link would answer your question. Hope it helps!!

balavignesh_nag · ‎12-27-2017

@Ashnee Sharma How many executors are in place? Also are you firing the query in spark-sql directly? What is the size of the table which you are fetching? Try increasing the partitions manually instead of letting spark deciding the no of partitions. No of partitions can be decided based on the table size which has to be splitted across executors. Use the below properties set by SparkConf: conf.set("spark.driver.maxResultSize", "3g") set by spark-defaults.conf: spark.driver.maxResultSize 3g set when calling spark-submit: --conf spark.driver.maxResultSize=3g I believe the above property should work. I could see that you have increase the driver size already. If so then ignore the driver size change property.

balavignesh_nag · ‎12-20-2017

@Sandeep SIngh No, hive doesn't maintain any lock history. show locks; Above command would help you to get the user who has acquired a lock over the table in hive. But however if the lock is released then you will not be able to see the user who has acquired the locks. Also there is no history of locks being recorded as it is not necessarily needed for the namenode for any computation. Hope it helps!

balavignesh_nag · ‎12-20-2017

Hi @Ashnee Sharma Based on the logs I could see that when you run a count query it triggers a mapreduce job and it takes time. Could you check running this command (set hive.stats.fetch.column.stats) and verify that it status is true? Because when this property is enable then stats should be fetched based on stats information available in the metastore which will not trigger any jobs when you run a count query. It should work regardless whether you are using mr/tez as your execution engine. Hope it helps!!

balavignesh_nag · ‎12-20-2017

Got it ! Thanks @James Dinkel

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Handling compression in Spark

Re: how to filter files containing specific text i...

Re: how to filter files containing specific text i...

Re: Hive overwrites existing external table data

Re: Hive overwrites existing external table data

Re: [TEZ] are partition, sort and shuffle built-in...

Re: I am getting error while fetching full data fr...

Re: How to identify the user who has acquired tabl...

Re: select count query taking more time

Re: hive.warehouse.subdir.inherit.perms=false