- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to read multiple gzipped files from S3 into dataframe
- Labels:
-
Apache Spark
Created ‎10-07-2022 11:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My data are in JSON format and gzipped and stored on S3.
I want to read those data
I tried some streaming options as below
import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession import org.apache.spark.sql.streaming.Trigger import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType}; import org.apache.spark.sql.SparkSession import sys.process._ val tSchema = new StructType().add("log_type", StringType) val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/") tDF.writeStream.outputMode("Append").format("console").start()
Got exceptions
s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]
How to fix this? How can I read
Created ‎10-09-2022 07:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
spark
.session()
.read()
.option("encoding", "UTF-8")
.option("delimiter", "^")
.option("mode", "PERMISSIVE")
.schema(SCHEMA_STORE.getIPDRschema())
.csv(
JavaConverters.collectionAsScalaIterableConverter(_files_to_process)
.asScala()
.toSeq())
.withColumn("filename", org.apache.spark.sql.functions.input_file_name())
.dropDuplicates();
Written in java please convert it into scala hope this will work 🙂
