Member since
06-29-2016
30
Posts
3
Kudos Received
0
Solutions
07-13-2016
04:57 AM
I want to concatenate non-empty values in a column after grouping by some key. Eg: Supposing I have a dataframe: df.show()
+---+---+----+
| id|num|num2|
+---+---+----+
| 1| 3| 5|
| 2| 3| 4|
| 1| | 2|
| 1| 10| 0|
+---+---+----+
I want to groupBy "id" and concatenate "num" together. Right now, I have this: df.groupBy($"id").agg(concat_ws(DELIM, collect_list($"num")))
Which concatenates by key but doesn't exclude empty strings. Is there a way I can specify in the Column argument of concat_ws() or collect_list() to exclude some kind of string? Thank you!
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Spark
07-12-2016
02:31 AM
I'm loading in a DataFrame with a timestamp column and I want to extract the month and year from values in that column. When specifying in the schema a field as TimestampType, I found that only text in the form of "yyyy-mm-dd hh:mm:ss" works without giving an error. Is there a way of specifying the format when reading in a csv file, like "mm/dd/yyyy hh:mm:ss"? If not and we have to specify the field as StringType, is there a way of converting the format my time is in to JDBC format? Would this be inefficient compared to just substringing the timestamp as a StringType? Thank you!
... View more
Labels:
- Labels:
-
Apache Spark
07-05-2016
09:51 PM
Seeing as how I will be working with gigabytes/terabytes of data, I think a native datatype would be best then. Thank you!
... View more
07-05-2016
09:42 PM
The queries involve: grouping by useruid and getting a count of rows joins (all kinds of joins)
... View more
07-05-2016
09:21 PM
I'm doing some data processing with tabular data and have noticed that a few of my fields are not Strings but the data processing I do does not require the fields to be non-Strings. E.g.: |-- user_uid: Int (nullable =true)
|-- labelVal: Int (nullable =true)
|-- probability_score: Double (nullable =true)
|-- real_labelVal: Double (nullable =false) I know stylistically it's better to have each field be the correct type, but from an efficiency standpoint, is it more computationally expensive or worse on storage to keep every field as a String?
... View more
Labels:
- Labels:
-
Apache Spark
07-02-2016
05:07 PM
Does this also hold for other methods besides JOIN? E.g., I want to do a groupBy. Should I select() before the groupBy()?
... View more
07-02-2016
01:47 PM
2 Kudos
Concerning memory usage and efficiency, when joining two large tables with many columns but only want a few columns from each of them, is it better to select() before or after the join()? My instinct tells me to select() before the join but some other voices would be very helpful
... View more
Labels:
- Labels:
-
Apache Spark
06-30-2016
06:46 PM
I want to pick and choose only a subset of the columns of a dataframe / table given some Array of Columns E.g., given a table with columns [a,b,c,d,e] and I want to keep [a,c,e]. I see that I could either drop all columns that are not in my Array, or select the columns in my Array. My question is which one is preferred efficiency-wise? Or does it even matter? Thank you.
... View more
Labels:
- Labels:
-
Apache Spark
06-29-2016
07:35 PM
I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.
From http://stackoverflow.com/questions/29284095/which-operations-preserve-rdd-order, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.
How could one perform a join of two DataFrames while preserving the order of one table?
E.g., col1 col2 0 b 1 a joined with col2 col3 a x b y on col2 should give col1 col2 col3 0 b y 1 a x ordered using the first table. I've heard some things about using coalesce() or repartition(), but I'm not sure. Any suggestions/methods/insights are appreciated.
... View more
Labels:
- Labels:
-
Apache Spark
- « Previous
-
- 1
- 2
- Next »