Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

Solved Go to solution

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

New Contributor

I'm running pyspark-sql code on Horton sandbox

18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3

# code 
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
 |-- id: string (nullable = true)
 |-- cat_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- price: string (nullable = true)
 |-- url: string (nullable = true)
df1.show() 
+---+------+--------------------+----+------+--------------------+
| id|cat_id|                name|desc| price|                 url|
+---+------+--------------------+----+------+--------------------+
|  1|     2|Quest Q64 10 FT. ...|    | 59.98|http://images.acm...|
|  2|     2|Under Armour Men'...|    |129.99|http://images.acm...|
|  3|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  4|     2|Under Armour Men'...|    | 89.99|http://images.acm...|
|  5|     2|Riddell Youth Rev...|    |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();

I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.

The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully. Please let me know if you need more info and appreciate answer on how to handle this error.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

@Harshad M Perhaps issue is data related. I see show 10 rows works fine so this means that when it needs to go over all the rows it is failing at some point due data may not be properly formatted. Could you check if underlying data has any additional commas or any other problem?

View solution in original post

3 REPLIES 3
Highlighted

Re: pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

@Harshad M Perhaps issue is data related. I see show 10 rows works fine so this means that when it needs to go over all the rows it is failing at some point due data may not be properly formatted. Could you check if underlying data has any additional commas or any other problem?

View solution in original post

Highlighted

Re: pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

@Harshad M

Good to hear you found the issue with the record. Please remember to login and mark the answer as accepted if it helped you in anyway. Thanks!

Highlighted

Re: pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

New Contributor

Thanks Felix. The issue was one record that has embedded comma in it. I tried using databricks package instead of programatically splitting record into columns by calling

pyspark --packages com.databricks:spark-csv_2.10:1.4.0

from pyspark.sql import *

from pyspark.sql.types import *

schema1 = StructType ([ StructField("id",IntegerType(), True), \ StructField("cat_id",IntegerType(), True), \ StructField("name",StringType(), True),\ StructField("desc",StringType(), True),\ StructField("price",DecimalType(), True), \ StructField("url",StringType(), True) ]) df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv') df1.show()

Don't have an account?
Coming from Hortonworks? Activate your account here