Created 10-06-2016 09:17 AM
Hi:
Why is to slow the read.json method in sqlcontext??? ima trying read from hdfs 8gb
df1 = sqlContext.read.json("hdfs://xxxx:8020/tmp/file.json")thanks
Created 10-07-2016 11:33 AM
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created 10-07-2016 11:33 AM
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created 10-07-2016 11:42 AM
Hi:
you was right, i used this:
>>> from pyspark.sql.types import *
>>> schema = StructType([
... StructField("field2", StringType()),
... StructField("field3",
... StructType([StructField("field5", ArrayType(IntegerType()))]))
... ])
>>> df3 = sqlContext.jsonRDD(json, schema)
>>> df3.first()
Row(field2=u'row1', field3=Row(field5=None))
Many thanks