Created 10-06-2016 09:17 AM
Hi:
Why is to slow the read.json method in sqlcontext??? ima trying read from hdfs 8gb
df1 = sqlContext.read.json("hdfs://xxxx:8020/tmp/file.json")
thanks
Created 10-07-2016 11:33 AM
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created 10-07-2016 11:33 AM
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created 10-07-2016 11:42 AM
Hi:
you was right, i used this:
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("field2", StringType()), ... StructField("field3", ... StructType([StructField("field5", ArrayType(IntegerType()))])) ... ]) >>> df3 = sqlContext.jsonRDD(json, schema) >>> df3.first() Row(field2=u'row1', field3=Row(field5=None))
Many thanks