Archives of Support Questions (Read Only)

Report Inappropriate Content · ‎06-13-2017

I was wondering how to do an approx count of a dataframe without converting to an rdd in spark 1.6.

Is there a possible hack or not.

If anyone has any solutions please let me know thanks.

jfrazee · ‎07-07-2017

@elliot gimple I know it's not really what you want but there's an .rdd method you can call on a DataFrame in 1.6 so you could just do `df.rdd.countApprox()` on that. I'd have to look at the DAG more closely but I think the overhead is just going to be in converting DataFrame elements to Rows and not generation of the full RDD before `countApprox` is called -- not 100% sure about that though.

View solution in original post

jfrazee · ‎07-07-2017

@elliot gimple I know it's not really what you want but there's an .rdd method you can call on a DataFrame in 1.6 so you could just do `df.rdd.countApprox()` on that. I'd have to look at the DAG more closely but I think the overhead is just going to be in converting DataFrame elements to Rows and not generation of the full RDD before `countApprox` is called -- not 100% sure about that though.

Report Inappropriate Content · ‎07-18-2017

Thanks this is I what I use but I wish there was one just for the dataframe specifically.

Cloudera Community

Archives of Support Questions (Read Only)

Is there a way to do a count Approx for a dataframe (not rdd)in spark 1.6