About balavignesh_nag

balavignesh_nag · ‎07-21-2017

@Xin Yang could you share the DDL of the table t1. Use this command to get the DDL. show create table t1;

balavignesh_nag · ‎07-19-2017

@mqureshi I have tried it but still the same result. Im unable to group only based on genres. If it is in a query i would have gone with select genres,count(*) from table_name group by genres. I would like to implement the same through pyspark. But stuck here. Any help would be appreciated much.

balavignesh_nag · ‎07-19-2017

@mqureshi I dont think thats the issue here. Im able to perform actions like count(), collect() and take() over tags

balavignesh_nag · ‎07-19-2017

@Varun R Below are few parameters which one would use in general. Ofcourse based on the logics other configuration settings cab also be modified. Enable Compression in Hive hive.exec.compress.output TRUE hive.exec.compress.intermediate TRUE hive.auto.convert.join hive.auto.convert.join.noconditionaltask hive.optimize.bucketmapjoin avoid order by and try to use sort by set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true; set hive.vectorized.execution.reduce.groupby.enabled = true; hive.compute.query.using.stats hive.stats.fetch.partition.stats hive.stats.fetch.column.stats hive.stats.autogather ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS; ANALYZE TABLE employee COMPUTE STATISTICS FOR COLUMNS id, dept;

balavignesh_nag · ‎07-19-2017

I have just started learning pyspark. I have a structured data in the below format. movieId,title,genres 1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy 2,Jumanji (1995),Adventure|Children|Fantasy 3,Grumpier Old Men (1995),Comedy|Romance 4,Waiting to Exhale (1995),Comedy|Drama|Romance 5,Father of the Bride Part II (1995),Comedy I wanted to get the count of movies based on genres. movies = sc.textFile("hdfs://data/spark/movies.csv") moviesgroup = movies.countByValue() If I use above code then its grouping the data on all the columns. Is there a way to group based on a particular column. In the above case grouping has to be done based on genre and I wanted genre and count to be stored in the RDD. Could someone help on the same?

balavignesh_nag · ‎07-19-2017

Thanks @mqureshi . Thats how I have implemented it. But just wanted to understand why the above code doesnt work.

balavignesh_nag · ‎07-19-2017

I have csv file in this format. tagId,tag 1,007 2,007 (series) 3,18th century 4,1920s 5,1930s First line is header. Im using below code to remove the header in pyspark but its throwing an error. Could someone help me on that? Error Message: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/rdd.py", line 1975, in subtract rdd = other.map(lambda x: (x, True)) AttributeError: 'unicode' object has no attribute 'map' Code: import csv tags = sc.textFile("hdfs:///data/spark/genome-tags.csv") tagsheader = tags.first() tagsdata = tags.subtract(tagsheader)

balavignesh_nag · ‎06-22-2017

@Smart Solutions If source data is less than the block size on a daily basis then small file will be created per day which may be less than the block size. So many small files will be created on a daily basis. Once it is loaded into the table merge property has nothing to do it when you are loading the same table again. However it will have an impact when the source has many small files when the job is trigerred. And the answer for the other question. As the first table has 1200 small files and merge.mapredfiles is set to true will enable the mapper to read as much of files and combine it, if the size of the files is less than the block size. So once the mapred job completes it will merge as much of files and push into hive table. Hope it helps!!

balavignesh_nag · ‎06-19-2017

I dont think window function works better when the data is huge. when you use windowing function all the data gets accumulated in a single reducer which will end up in performance issue. I would suggest to go with distinct/ group by function which you have mentioned to avoid such issues.

balavignesh_nag · ‎06-15-2017

@Simran Kaur Try this hive -hiveconf dt=current_date -e 'select ${hiveconf:dt;' Hope it helps!!

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: Tez query failed with IllegalArgumentException

Re: Using countByValue() for a particular column i...

Re: Removing header from CSV file through pyspark

Re: How to increase performance of Tez in hive

Using countByValue() for a particular column in py...

Re: Removing header from CSV file through pyspark

Removing header from CSV file through pyspark

Re: Controlling Number of small files while insert...

Re: Handling Multiple joins creating duplicates

Re: How to use current date as value for a variabl...