Support Questions

Find answers, ask questions, and share your expertise

How to read multiple gzipped files from S3 into dataframe

avatar
Contributor

My data are in JSON format and gzipped and stored on S3.

I want to read those data

I tried some streaming options as below

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start()

Got exceptions

s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]

How to fix this? How can I read

1 REPLY 1

avatar
New Contributor

spark
.session()
.read()
.option("encoding", "UTF-8")
.option("delimiter", "^")
.option("mode", "PERMISSIVE")
.schema(SCHEMA_STORE.getIPDRschema())
.csv(
JavaConverters.collectionAsScalaIterableConverter(_files_to_process)
.asScala()
.toSeq())
.withColumn("filename", org.apache.spark.sql.functions.input_file_name())
.dropDuplicates();

 

Written in java please convert it into scala hope this will work 🙂