Member since
08-22-2018
4
Posts
0
Kudos Received
0
Solutions
09-11-2018
02:14 PM
Thanks Felix. The issue was one record that has embedded comma in it. I tried using databricks package instead of programatically splitting record into columns by calling pyspark --packages com.databricks:spark-csv_2.10:1.4.0 from pyspark.sql import * from pyspark.sql.types import * schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
... View more
09-10-2018
02:19 AM
I'm running pyspark-sql code on Horton sandbox 18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3 # code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it. The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
... View more
Labels:
08-23-2018
05:51 PM
Currently Spark examination is not available and the certification page says it would be available by 22nd August; I;m writing this post on 23rd Aug, I don't see any updates on the certification page. === The HDPCD : Apache Spark Exam NOTICE: This exam is currently undergoing maintenance. It is available for purchase, however, exams cannot be scheduled until an estimated date of August 22, 2018. Thank you for your patience. === It's 23rd Aug today, can you at least update the page with future date ? Not sure if Horton takes certification seriously.. you can see many threads/posts about the horrible experience people have gone thru, Thanks
... View more
Labels:
08-22-2018
04:45 PM
Hi I was going thru the HDPCD Spark Certification objectives and the each objective link takes to the latest spark documentation. Whereas in the description of environment it says, exam will be based on spark 1.6. The latest sandbox available allows me to use spark 2.0 version. Would I be able to use spark 2 in exam ? Did anyone use 2.0 in exam ? Thanks
... View more
Labels: