Support Questions

pacosoplas · ‎10-06-2016

Hi:

Why is to slow the read.json method in sqlcontext??? ima trying read from hdfs 8gb

df1 = sqlContext.read.json("hdfs://xxxx:8020/tmp/file.json")

thanks

senthilkumarP · ‎10-07-2016

As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.

it might be the reason for slowness .

Thanks!

View solution in original post

senthilkumarP · ‎10-07-2016

As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.

it might be the reason for slowness .

Thanks!

pacosoplas · ‎10-07-2016

Hi:

you was right, i used this:

>>> from pyspark.sql.types import *
>>> schema = StructType([
...     StructField("field2", StringType()),
...     StructField("field3",
...                 StructType([StructField("field5", ArrayType(IntegerType()))]))
... ])
>>> df3 = sqlContext.jsonRDD(json, schema)
>>> df3.first()
Row(field2=u'row1', field3=Row(field5=None))

Many thanks

Cloudera Community

Support Questions

read json with sqlcontext very slow

Reading OpenData JSON and Storing into Phoenix Tab...

kafka json slow

PySpark JSON read with strict schema check and mar...

Can't read Json properly in Spark

How to read json from S3 then edit json using Nifi

JSON-to-JSON Simplified with Apache NiFi and Jolt

Write / Read Parquet File in Spark

Converting Nested JSON to Flat JSON using JOLT

23: error: not found: value sqlContext

How to convert Nested JSON to Flattened JSON using...