About jestinm

jestinm · ‎07-13-2016

I want to concatenate non-empty values in a column after grouping by some key. Eg: Supposing I have a dataframe: df.show() +---+---+----+ | id|num|num2| +---+---+----+ | 1| 3| 5| | 2| 3| 4| | 1| | 2| | 1| 10| 0| +---+---+----+ I want to groupBy "id" and concatenate "num" together. Right now, I have this: df.groupBy($"id").agg(concat_ws(DELIM, collect_list($"num"))) Which concatenates by key but doesn't exclude empty strings. Is there a way I can specify in the Column argument of concat_ws() or collect_list() to exclude some kind of string? Thank you!

jestinm · ‎07-12-2016

I didn't even see this on stackoverflow! Thank you.

jestinm · ‎07-12-2016

I'm loading in a DataFrame with a timestamp column and I want to extract the month and year from values in that column. When specifying in the schema a field as TimestampType, I found that only text in the form of "yyyy-mm-dd hh:mm:ss" works without giving an error. Is there a way of specifying the format when reading in a csv file, like "mm/dd/yyyy hh:mm:ss"? If not and we have to specify the field as StringType, is there a way of converting the format my time is in to JDBC format? Would this be inefficient compared to just substringing the timestamp as a StringType? Thank you!

jestinm · ‎07-05-2016

Seeing as how I will be working with gigabytes/terabytes of data, I think a native datatype would be best then. Thank you!

jestinm · ‎07-05-2016

The queries involve: grouping by useruid and getting a count of rows joins (all kinds of joins)

jestinm · ‎07-05-2016

I'm doing some data processing with tabular data and have noticed that a few of my fields are not Strings but the data processing I do does not require the fields to be non-Strings. E.g.: |-- user_uid: Int (nullable =true) |-- labelVal: Int (nullable =true) |-- probability_score: Double (nullable =true) |-- real_labelVal: Double (nullable =false) I know stylistically it's better to have each field be the correct type, but from an efficiency standpoint, is it more computationally expensive or worse on storage to keep every field as a String?

jestinm · ‎07-02-2016

Does this also hold for other methods besides JOIN? E.g., I want to do a groupBy. Should I select() before the groupBy()?

jestinm · ‎07-02-2016

Concerning memory usage and efficiency, when joining two large tables with many columns but only want a few columns from each of them, is it better to select() before or after the join()? My instinct tells me to select() before the join but some other voices would be very helpful

jestinm · ‎06-30-2016

I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e]. I see that I could either drop all columns that are not in my Array, or select the columns in my Array. My question is which one is preferred efficiency-wise? Or does it even matter? Thank you.

jestinm · ‎06-29-2016

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes. From http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions. How could one perform a join of two DataFrames while preserving the order of one table? E.g., col1 col2 0 b 1 a joined with col2 col3 a x b y on col2 should give col1 col2 col3 0 b y 1 a x ordered using the first table. I've heard some things about using coalesce() or repartition(), but I'm not sure. Any suggestions/methods/insights are appreciated.

Online	Offline
Last Visited	‎08-14-2016 06:06 AM

Member Since	‎06-29-2016 07:28 PM
Last Visited	‎08-14-2016 06:06 AM
Posts	30
Kudos received	3

Cloudera Community

DataFrame groupBy and concat non-empty strings

Re: TimestampType format for Spark DataFrames

TimestampType format for Spark DataFrames

Re: DataFrames: inefficiencies of keeping all fiel...

Re: DataFrames: inefficiencies of keeping all fiel...

DataFrames: inefficiencies of keeping all fields a...

Re: Joining large tables (dataframe and sql), but ...

Joining large tables (dataframe and sql), but only...

Spark SQL Drop vs Select

Can Dataframe joins in Spark preserve order?