Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13344 | 02-20-2018 12:33 PM | |
1500 | 02-19-2018 05:12 AM | |
1859 | 12-28-2017 06:13 AM | |
7135 | 09-28-2017 09:25 AM | |
12163 | 09-25-2017 11:19 AM |
01-30-2018
05:13 AM
Hi, I have certain set of question which Im trying to understand in spark which are mentioned below: What the best compression codec that can be used in spark. In hadoop we should not use gz compression unless it is cold data where input splits of very less use. But if we were to choose any other compression w.r.t (lzo/bzip2/snappy etc) then based on what parameters do we need to choose the compressions? Does spark makes use of the input splits if the files are compressed? How does spark handles compression when compared with MR? Does compression increases the amount of data which is being shuffled? Thanks in advance!!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
01-05-2018
05:26 AM
This would also work. import java.io.File val files = getListOfFiles("/tmp") def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList
... View more
01-05-2018
05:23 AM
@Chaitanya D It will be possible in with unix and spark combination. hadoop fs -ls /filedirectory/*txt_processed Above command will return the desired file you need. Then pass the result to spark and process the file as you need. Alternatively in spark you can select the desired file using the below command. val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!! Hope it helps !
... View more
01-03-2018
12:05 PM
@Alexandros Biratsis i believe that you are not using Insert Overwrite when inserting the incremental records into the target. Assuming that its weird how the data is being overridden. For the union part --> If you wanted to avoid union then you may have to may have to perform left join between the incremental data and target to apply some transformations ( assuming that you are performing SCD type 1). If you wanted to just append the data then insert the incremental data through multiple queries into the target one by one. By if you are inserting the data multiple times then the no of jobs will be more which be more or less equal to performing union over it. Sorry for the late reply.
... View more
12-29-2017
06:18 AM
1 Kudo
Hi @Alexandros Biratsis I could see the work around available in the link which you have mentioned. Anyways let me add few points on top of it. Create a work table Perform union between target and incremental data and insert into the newly created work table Assuming that you are using only external table --> Drop the work table. re-create the target table pointing to the work table location so that you can avoid re-loading the target from the work table. Hope it helps!
... View more
12-28-2017
06:13 AM
@Sebastien F Background execution of tez and mr has many similarities. Differences lies in the where the data are in placed to transform it. Tez uses DAG to process the data whereas mr doesn't use DAG. This link would answer your question. Hope it helps!!
... View more
12-27-2017
09:44 AM
@Ashnee Sharma How many executors are in place? Also are you firing the query in spark-sql directly? What is the size of the table which you are fetching? Try increasing the partitions manually instead of letting spark deciding the no of partitions. No of partitions can be decided based on the table size which has to be splitted across executors. Use the below properties set by SparkConf: conf.set("spark.driver.maxResultSize", "3g") set by spark-defaults.conf: spark.driver.maxResultSize 3g set when calling spark-submit: --conf spark.driver.maxResultSize=3g I believe the above property should work. I could see that you have increase the driver size already. If so then ignore the driver size change property.
... View more
12-20-2017
09:44 AM
@Sandeep SIngh No, hive doesn't maintain any lock history. show locks; Above command would help you to get the user who has acquired a lock over the table in hive. But however if the lock is released then you will not be able to see the user who has acquired the locks. Also there is no history of locks being recorded as it is not necessarily needed for the namenode for any computation. Hope it helps!
... View more
12-20-2017
08:46 AM
Hi @Ashnee Sharma Based on the logs I could see that when you run a count query it triggers a mapreduce job and it takes time. Could you check running this command (set hive.stats.fetch.column.stats) and verify that it status is true? Because when this property is enable then stats should be fetched based on stats information available in the metastore which will not trigger any jobs when you run a count query. It should work regardless whether you are using mr/tez as your execution engine. Hope it helps!!
... View more
12-20-2017
06:09 AM
Got it ! Thanks @James Dinkel
... View more