Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using countByValue() for a particular column in pyspark

avatar

I have just started learning pyspark. I have a structured data in the below format.

movieId,title,genres

1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy

2,Jumanji (1995),Adventure|Children|Fantasy

3,Grumpier Old Men (1995),Comedy|Romance

4,Waiting to Exhale (1995),Comedy|Drama|Romance

5,Father of the Bride Part II (1995),Comedy

I wanted to get the count of movies based on genres.

movies = sc.textFile("hdfs://data/spark/movies.csv")

moviesgroup = movies.countByValue()

If I use above code then its grouping the data on all the columns. Is there a way to group based on a particular column. In the above case grouping has to be done based on genre and I wanted genre and count to be stored in the RDD. Could someone help on the same?

3 REPLIES 3

avatar
Super Guru
@Bala Vignesh N V

Have you tried doing groupByKey(), reduceByKey or aggregate()?

avatar

@mqureshi

I have tried it but still the same result. Im unable to group only based on genres.

If it is in a query i would have gone with select genres,count(*) from table_name group by genres. I would like to implement the same through pyspark. But stuck here. Any help would be appreciated much.

avatar
Expert Contributor
@Bala Vignesh N V

Your problem statement can be interpreted in two ways.

The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has:

genres = movies.flatMap(lambda line: line.split(',')[2].split('|'))
genres.countByValue()

We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them.

This gives you:

'Adventure': 2,
'Animation': 1,
'Children': 2,
'Comedy': 4,
'Drama': 1,
'Fantasy': 2,
'Romance': 2

Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use:

genre_combinations = movies.map(lambda line: line.split(',')[2])
genre_combinations.countByValue()

This gives you:

'Adventure|Animation|Children|Comedy|Fantasy': 1,
'Adventure|Children|Fantasy': 1,
'Comedy': 1,
'Comedy|Drama|Romance': 1,
'Comedy|Romance': 1