I am getting count of 2 dataframe in spark 2.2 using spark session.
1st dataframe reads data from hive table which size is 5.2 GB. when i find count it returns in 25s.
2nd dataframe reads data from set of compressed files which size is 3.8 GB. When i find count it takes 12-13mins.
Even i tried repartition on 2nd dataframe. it reduces performance since it does shuffling.
Can anyone please explain the differences . am i missing anything here.
I am using spark session to read both file and hive data.
Lets say my spark session variable is "spark" with hive support enabled .
Reading hive and compressed files table.
Val hivedf=spark.table("table name") Val filedf = spark.read.format("CSV").option("header","true").option("delimiter","|").load(filepath) hivedf.count() filedf.count()
I am a little guessing here, but I believe its possible that the Hive metastore has statistics (i.e. information on the number of records in the partitions), so that the count might actually not read the complete table. The count on the file must read the file in any case.
But still i think 12 min are really long for processing 3.8 GB, even if this is the compressed size.
Is the count the very first action on the data frame? So that Spark only executes all previous statements (i guess reading the file, uncompressing it etc) when running the count?
Hi, yes . Count is the first action. But the next step is union of both hivedf and filedf and finding count. That step aslo takes same time (25s +12 mins)12.25 mins.
Could you please give some suggestions.