Support Questions

Find answers, ask questions, and share your expertise
Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

How to read multiple gzipped files from S3 into dataframe

avatar
Contributor

My data are in JSON format and gzipped and stored on S3.

I want to read those data

I tried some streaming options as below

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start()

Got exceptions

s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]

How to fix this? How can I read

1 REPLY 1

avatar
New Contributor

spark
.session()
.read()
.option("encoding", "UTF-8")
.option("delimiter", "^")
.option("mode", "PERMISSIVE")
.schema(SCHEMA_STORE.getIPDRschema())
.csv(
JavaConverters.collectionAsScalaIterableConverter(_files_to_process)
.asScala()
.toSeq())
.withColumn("filename", org.apache.spark.sql.functions.input_file_name())
.dropDuplicates();

 

Written in java please convert it into scala hope this will work 🙂