I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Kafka). But it's very slow. Here what I do:
1. Data is downloaded from here.
2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.
3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes
4. rdd.take(10) , ... these are ok
5. It was not possible to unzip the file; I read it with data.json.gz
Any suggestion? How I can read it with json reader?
Thanks