Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster

Maryam — Tue, 26 Nov 2019 13:55:59 GMT

I would like to analyze a big data (0.9 TB after unzipping) in a cluster with 14 nodes and 39 cores (Azure HDInsight/Kafka). But it's very slow. Here what I do:

1. Data is downloaded from here.

2. val data = spark.read.json(path) ---- it crashes. Data are stored in HDFS.

3. val rdd = sc.textFile(path) ... then rdd.count() .... also crashes

4. rdd.take(10) , ... these are ok

5. It was not possible to unzip the file; I read it with data.json.gz

Any suggestion? How I can read it with json reader?

Thanks

Re: Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster

ask_bill_brooks — Tue, 26 Nov 2019 15:02:50 GMT

@Maryam While we welcome your question, you would be much more likely to obtain a useful answer if you posted this to the the appropriate forum for Microsoft Azure Hdinsight.

question Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster in Support Questions

Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster

Re: Reading/ analysing Json file with about 1TB size in Spark/ HDInisght Kafka cluster