I have just started learning pyspark. I have a structured data in the below format.
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
I wanted to get the count of movies based on genres.
movies = sc.textFile("hdfs://data/spark/movies.csv")
moviesgroup = movies.countByValue()
If I use above code then its grouping the data on all the columns. Is there a way to group based on a particular column. In the above case grouping has to be done based on genre and I wanted genre and count to be stored in the RDD. Could someone help on the same?
I have tried it but still the same result. Im unable to group only based on genres.
If it is in a query i would have gone with select genres,count(*) from table_name group by genres. I would like to implement the same through pyspark. But stuck here. Any help would be appreciated much.
Your problem statement can be interpreted in two ways.
The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has:
genres = movies.flatMap(lambda line: line.split(',').split('|')) genres.countByValue()
We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them.
This gives you:
'Adventure': 2, 'Animation': 1, 'Children': 2, 'Comedy': 4, 'Drama': 1, 'Fantasy': 2, 'Romance': 2
Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use:
genre_combinations = movies.map(lambda line: line.split(',')) genre_combinations.countByValue()
This gives you:
'Adventure|Animation|Children|Comedy|Fantasy': 1, 'Adventure|Children|Fantasy': 1, 'Comedy': 1, 'Comedy|Drama|Romance': 1, 'Comedy|Romance': 1