Member since
10-21-2018
14
Posts
1
Kudos Received
0
Solutions
10-07-2022
11:17 PM
My data are in JSON format and gzipped and stored on S3. I want to read those data I tried some streaming options as below import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start() Got exceptions s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0] How to fix this? How can I read
... View more
Labels:
- Labels:
-
Apache Spark
08-29-2021
10:45 PM
I have two separate Hadoop clusters, Cloudera Hadoop cluster and Apache Hadoop cluster. Found that Impala query runs faster on cloudera whereas same query runs slower in Apache Hadoop cluster. During query execution found that query taking significant amount of time in analyzing and Planning phase compared to Cloudera cluster. I tuned up Apache cluster for heap size configuration and try to maintain same property and it’s values as I have in Cloudera Cluster. What else I need to double check or need to configure some other services, configurations? Please suggest. Same machined hardware configuration and same instances were used in both clusters. Versions I used in Cloudera CDH 6.3.2 impalad version 3.2.0 Versions I used in Apache Hadoop 3.0.0 Impala 3.4.0
... View more
Labels:
- Labels:
-
Apache Impala
04-24-2020
10:57 AM
1 Kudo
Hi lwang, As suggested I disabled 'Hive Metastore Canary Health Test' and also reduced heap size from 5GiBs to 2GiBs. From last 14hours we have not noticed any alert from Service Monitor. Thanks,
... View more
04-23-2020
08:04 AM
Hi lwang, I noticed that we have only 285 entries in service monitor (find from Cloudera Management Service Monitored Entities). Recently I increased heap size to 5GiBs but still received alert. The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 4,991M. JVM maximum available heap size: 5,120M. Percentage of maximum heap: 97.48%. Critical threshold: 95.00%.
... View more
04-22-2020
08:46 PM
Thanks lwang, I increase the JVM heap size to 5GiBs. lets see how it will work. Version: Cloudera Express 6.3.0 (#1281944 built by jenkins on 20190719-0609 git: 5b793e9c9cb3f40b3912044aac00abb635183191) Java VM Name: Java HotSpot(TM) 64-Bit Server VM Java Version: 1.8.0_181
... View more
04-22-2020
07:25 AM
I am new in CDH cluster setup. I have CDH 6.3.2 with HA enabled. Total 3+5 nodes cluster(3 masters and 5 data nodes) From last 2 days we received alert from SERVICE_MONITOR_HEAP_SIZE The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 2,001M. JVM maximum available heap size: 2,048M. Percentage of maximum heap: 97.71%. Critical threshold: 95.00%. So I increased heap size to 3.0GiBs. But still we received alert as below The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 3,004M. JVM maximum available heap size: 3,072M. Percentage of maximum heap: 97.79%. Critical threshold: 95.00%. How can I estimate heap size? How can I fix this issue? Please assist me step by step to fix the issue. Thank you
... View more
Labels:
- Labels:
-
Cloudera Manager
01-06-2020
03:06 AM
I saved a sample query in HUE UI(Impala editor). Try to find records in Mysql DB 'HUE' and table 'beeswax_savedquery'. However tables 'beeswax_savedquery' and beeswax_queryhistory are empty. Whereas other tables were able to store all required information. E.g. Table 'auth_user' contains all information about users. My question is that : Where those HUE query are getting stored (in Mysql OR some where in HDFS) I am using CDH 6.3.2 with Impala.
... View more
Labels:
- Labels:
-
Apache Impala
10-22-2018
10:58 AM
Thanks a lot, this works for me.
... View more
10-21-2018
10:52 PM
We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again? If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us. I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist Any suggestion? am I doing right way, is there any othe correct way to achive this?
... View more
Labels:
- Labels:
-
Cloudera Manager