Support Questions

manjj · ‎10-07-2022

My data are in JSON format and gzipped and stored on S3.

I want to read those data

I tried some streaming options as below

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start()

Got exceptions

s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]

How to fix this? How can I read

russell786 · ‎10-09-2022

spark
.session()
.read()
.option("encoding", "UTF-8")
.option("delimiter", "^")
.option("mode", "PERMISSIVE")
.schema(SCHEMA_STORE.getIPDRschema())
.csv(
JavaConverters.collectionAsScalaIterableConverter(_files_to_process)
.asScala()
.toSeq())
.withColumn("filename", org.apache.spark.sql.functions.input_file_name())
.dropDuplicates();

Written in java please convert it into scala hope this will work 🙂

Cloudera Community

Support Questions

How to read multiple gzipped files from S3 into dataframe