- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
read json with sqlcontext very slow
- Labels:
-
Apache Spark
Created ‎10-06-2016 09:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
Why is to slow the read.json method in sqlcontext??? ima trying read from hdfs 8gb
df1 = sqlContext.read.json("hdfs://xxxx:8020/tmp/file.json")
thanks
Created ‎10-07-2016 11:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created ‎10-07-2016 11:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As per my understanding ,sqlcontext is automatically infer schema .So it has been read entire data to get the data type.
it might be the reason for slowness .
Thanks!
Created ‎10-07-2016 11:42 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
you was right, i used this:
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("field2", StringType()), ... StructField("field3", ... StructType([StructField("field5", ArrayType(IntegerType()))])) ... ]) >>> df3 = sqlContext.jsonRDD(json, schema) >>> df3.first() Row(field2=u'row1', field3=Row(field5=None))
Many thanks
