Support Questions

Find answers, ask questions, and share your expertise

read json with sqlcontext very slow

avatar
Master Collaborator

Hi:

Why is to slow the read.json method in sqlcontext??? ima trying read from hdfs 8gb

df1 = sqlContext.read.json("hdfs://xxxx:8020/tmp/file.json")

thanks

1 ACCEPTED SOLUTION

avatar
Contributor

As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.

it might be the reason for slowness .

Thanks!

View solution in original post

2 REPLIES 2

avatar
Contributor

As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.

it might be the reason for slowness .

Thanks!

avatar
Master Collaborator

Hi:

you was right, i used this:

>>> from pyspark.sql.types import *
>>> schema = StructType([
...     StructField("field2", StringType()),
...     StructField("field3",
...                 StructType([StructField("field5", ArrayType(IntegerType()))]))
... ])
>>> df3 = sqlContext.jsonRDD(json, schema)
>>> df3.first()
Row(field2=u'row1', field3=Row(field5=None))

Many thanks