Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13410 | 02-20-2018 12:33 PM | |
1523 | 02-19-2018 05:12 AM | |
1871 | 12-28-2017 06:13 AM | |
7169 | 09-28-2017 09:25 AM | |
12210 | 09-25-2017 11:19 AM |
07-21-2017
06:07 AM
@Xin Yang
could you share the DDL of the table t1. Use this command to get the DDL. show create table t1;
... View more
07-19-2017
07:28 PM
@mqureshi I have tried it but still the same result. Im unable to group only based on genres. If it is in a query i would have gone with select genres,count(*) from table_name group by genres. I would like to implement the same through pyspark. But stuck here. Any help would be appreciated much.
... View more
07-19-2017
06:25 PM
@mqureshi I dont think thats the issue here. Im able to perform actions like count(), collect() and take() over tags
... View more
07-19-2017
06:09 PM
@Varun R Below are few parameters which one would use in general. Ofcourse based on the logics other configuration settings cab also be modified. Enable Compression in Hive hive.exec.compress.output TRUE hive.exec.compress.intermediate TRUE
hive.auto.convert.join hive.auto.convert.join.noconditionaltask hive.optimize.bucketmapjoin avoid order by and try to use sort by set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true; set hive.vectorized.execution.reduce.groupby.enabled = true; hive.compute.query.using.stats hive.stats.fetch.partition.stats hive.stats.fetch.column.stats hive.stats.autogather ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS; ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS id, dept;
... View more
07-19-2017
06:04 PM
I have just started learning pyspark. I have a structured data in the below format. movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance 5,Father of the Bride Part II (1995),Comedy I wanted to get the count of movies based on genres. movies = sc.textFile("hdfs://data/spark/movies.csv") moviesgroup = movies.countByValue() If I use above code then its grouping the data on all the columns. Is there a way to group based on a particular column. In the above case grouping has to be done based on genre and I wanted genre and count to be stored in the RDD. Could someone help on the same?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
07-19-2017
05:33 PM
Thanks @mqureshi . Thats how I have implemented it. But just wanted to understand why the above code doesnt work.
... View more
07-19-2017
05:04 PM
I have csv file in this format. tagId,tag 1,007 2,007 (series) 3,18th century 4,1920s 5,1930s First line is header. Im using below code to remove the header in pyspark but its throwing an error. Could someone help me on that? Error Message: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/rdd.py", line 1975, in subtract
rdd = other.map(lambda x: (x, True))
AttributeError: 'unicode' object has no attribute 'map' Code: import csv
tags = sc.textFile("hdfs:///data/spark/genome-tags.csv")
tagsheader = tags.first()
tagsdata = tags.subtract(tagsheader)
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
06-22-2017
06:52 AM
@Smart Solutions If source data is less than the block size on a daily basis then small file will be created per day which may be less than the block size. So many small files will be created on a daily basis. Once it is loaded into the table merge property has nothing to do it when you are loading the same table again. However it will have an impact when the source has many small files when the job is trigerred. And the answer for the other question. As the first table has 1200 small files and merge.mapredfiles is set to true will enable the mapper to read as much of files and combine it, if the size of the files is less than the block size. So once the mapred job completes it will merge as much of files and push into hive table. Hope it helps!!
... View more
06-19-2017
06:08 PM
I dont think window function works better when the data is huge. when you use windowing function all the data gets accumulated in a single reducer which will end up in performance issue. I would suggest to go with distinct/ group by function which you have mentioned to avoid such issues.
... View more
06-15-2017
07:12 AM
2 Kudos
@Simran Kaur Try this hive -hiveconf dt=current_date -e 'select ${hiveconf:dt;' Hope it helps!!
... View more