Support Questions
Find answers, ask questions, and share your expertise

How to convert RDD[List[String]] to Dataframe in Scala

How to convert RDD[List[String]] to Dataframe in Scala

New Contributor

Hello, How do I convert the below RDD[List[String]] to Dataframe in scala?

List(Div, Date, HomeTeam, AwayTeam, FTHG, FTAG, FTR, HTHG, HTAG, HTR, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY, HR, AR)
List(D1, 09/08/13, Bayern Munich, M'gladbach, 3, 1, H, 2, 1, H, 26, 11, 9, 4, 12, 11, 6, 4, 1, 3, 0, 0)
List(D1, 10/08/13, Augsburg, Dortmund, 0, 4, A, 0, 1, A, 11, 13, 4, 9, 18, 12, 5, 6, 2, 0, 0, 0)
List(D1, 10/08/13, Braunschweig, Werder Bremen, 0, 1, A, 0, 0, D, 13, 12, 3, 4, 10, 18, 2, 7, 1, 0, 0, 0)
List(D1, 10/08/13, Hannover, Wolfsburg, 2, 0, H, 1, 0, H, 20, 15, 8, 4, 28, 11, 7, 2, 4, 0, 0, 2)
List(D1, 10/08/13, Hertha, Ein Frankfurt, 6, 1, H, 2, 1, H, 16, 10, 9, 4, 19, 18, 5, 4, 0, 2, 0, 0)

2 REPLIES 2

Re: How to convert RDD[List[String]] to Dataframe in Scala

If your data is read from a csv file like

Di,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR
D1,09/08/13,"Bayern Munich","M'gladbach",3,1,H,2,1,H,26,11,9,4,12,11,6,4,1,3,0,0
D1,10/08/13,Augsburg,Dortmund,0,4,A,0,1,A,11,13,4,9,18,12,5,6,2,0,0,0
D1,10/08/13,Braunschweig,"Werder Bremen",0,1,A,0,0,D,13,12,3,4,10,18,2,7,1,0,0,0
D1,10/08/13,Hannover,Wolfsburg,2,0,H,1,0,H,20,15,8,4,28,11,7,2,4,0,0,2
D1,10/08/13,Hertha,"Ein Frankfurt",6,1,H,2,1,H,16,10,9,4,19,18,5,4,0,2,0,0

I would recommend to use https://github.com/databricks/spark-csv, e.g. like

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

and then

val liga = sqlContext.read
                     .format("com.databricks.spark.csv")
                     .option("header", "true")
                     .option("inferSchema", "true")
                     .load("/tmp/liga.csv")
liga.show()

which can deal with quoted values in the csv file and infer the column types automatically. It will return

liga: org.apache.spark.sql.DataFrame = [Di: string, Date: string, HomeTeam: string, AwayTeam: string, FTHG: int, FTAG: int, FTR: string, HTHG: int, HTAG: int, HTR: string, HS: int, AS: int, HST: int, AST: int, HF: int, AF: int, HC: int, AC: int, HY: int, AY: int, HR: int, AR: int]
+---+--------+-------------+--------------+----+----+---+----+----+---+---+---+---+ ...
| Di|    Date|     HomeTeam|      AwayTeam|FTHG|FTAG|FTR|HTHG|HTAG|HTR| HS| AS|HST| ...
+---+--------+-------------+--------------+----+----+---+----+----+---+---+---+---+ ...
| D1|09/08/13|Bayern Munich|   M'gladbach |   3|   1|  H|   2|   1|  H| 26| 11|  9| ...
| D1|10/08/13|     Augsburg|     Dortmund |   0|   4|  A|   0|   1|  A| 11| 13|  4| ...
| D1|10/08/13| Braunschweig| Werder Bremen|   0|   1|  A|   0|   0|  D| 13| 12|  3| ...
| D1|10/08/13|     Hannover|    Wolfsburg |   2|   0|  H|   1|   0|  H| 20| 15|  8| ...
| D1|10/08/13|       Hertha|Ein Frankfurt |   6|   1|  H|   2|   1|  H| 16| 10|  9| ...
+---+--------+-------------+-------------+----+----+---+----+----+---+---+---+---+ ...

Re: How to convert RDD[List[String]] to Dataframe in Scala

Contributor

When you are using sqlContext it will create a dataframe by default.

But if you are using a spark context it will only create an RDD, so we have to use .toDF() to create an RDD in to a dataframe.

val csv = sc.textFile("/tmp/liga.csv")

csv.toDF()