About manjj

manjj · ‎10-07-2022

My data are in JSON format and gzipped and stored on S3. I want to read those data I tried some streaming options as below import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession import org.apache.spark.sql.streaming.Trigger import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType}; import org.apache.spark.sql.SparkSession import sys.process._ val tSchema = new StructType().add("log_type", StringType) val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/") tDF.writeStream.outputMode("Append").format("console").start() Got exceptions s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0] How to fix this? How can I read

manjj · ‎08-29-2021

I have two separate Hadoop clusters, Cloudera Hadoop cluster and Apache Hadoop cluster. Found that Impala query runs faster on cloudera whereas same query runs slower in Apache Hadoop cluster. During query execution found that query taking significant amount of time in analyzing and Planning phase compared to Cloudera cluster. I tuned up Apache cluster for heap size configuration and try to maintain same property and it’s values as I have in Cloudera Cluster. What else I need to double check or need to configure some other services, configurations? Please suggest. Same machined hardware configuration and same instances were used in both clusters. Versions I used in Cloudera CDH 6.3.2 impalad version 3.2.0 Versions I used in Apache Hadoop 3.0.0 Impala 3.4.0

manjj · ‎04-24-2020

Hi lwang, As suggested I disabled 'Hive Metastore Canary Health Test' and also reduced heap size from 5GiBs to 2GiBs. From last 14hours we have not noticed any alert from Service Monitor. Thanks,

manjj · ‎04-23-2020

Hi lwang, I noticed that we have only 285 entries in service monitor (find from Cloudera Management Service Monitored Entities). Recently I increased heap size to 5GiBs but still received alert. The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 4,991M. JVM maximum available heap size: 5,120M. Percentage of maximum heap: 97.48%. Critical threshold: 95.00%.

manjj · ‎04-22-2020

Thanks lwang, I increase the JVM heap size to 5GiBs. lets see how it will work. Version: Cloudera Express 6.3.0 (#1281944 built by jenkins on 20190719-0609 git: 5b793e9c9cb3f40b3912044aac00abb635183191) Java VM Name: Java HotSpot(TM) 64-Bit Server VM Java Version: 1.8.0_181

manjj · ‎04-22-2020

I am new in CDH cluster setup. I have CDH 6.3.2 with HA enabled. Total 3+5 nodes cluster(3 masters and 5 data nodes) From last 2 days we received alert from SERVICE_MONITOR_HEAP_SIZE The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 2,001M. JVM maximum available heap size: 2,048M. Percentage of maximum heap: 97.71%. Critical threshold: 95.00%. So I increased heap size to 3.0GiBs. But still we received alert as below The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 3,004M. JVM maximum available heap size: 3,072M. Percentage of maximum heap: 97.79%. Critical threshold: 95.00%. How can I estimate heap size? How can I fix this issue? Please assist me step by step to fix the issue. Thank you

manjj · ‎01-06-2020

I saved a sample query in HUE UI(Impala editor). Try to find records in Mysql DB 'HUE' and table 'beeswax_savedquery'. However tables 'beeswax_savedquery' and beeswax_queryhistory are empty. Whereas other tables were able to store all required information. E.g. Table 'auth_user' contains all information about users. My question is that : Where those HUE query are getting stored (in Mysql OR some where in HDFS) I am using CDH 6.3.2 with Impala.

manjj · ‎10-22-2018

Thanks a lot, this works for me.

manjj · ‎10-21-2018

We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again? If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us. I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist Any suggestion? am I doing right way, is there any othe correct way to achive this?

Online	Offline
Last Visited	‎10-20-2022 01:06 PM

Member Since	‎10-21-2018 10:34 PM
Last Visited	‎10-20-2022 01:06 PM
Posts	14
Kudos received	1

Cloudera Community

How to read multiple gzipped files from S3 into da...

Impala query taking longer time in analyzing and p...

Re: SERVICE_MONITOR_HEAP_SIZE alert

Re: SERVICE_MONITOR_HEAP_SIZE alert

Re: SERVICE_MONITOR_HEAP_SIZE alert

SERVICE_MONITOR_HEAP_SIZE alert

Where can I find 'Saved Queries' from HUE UI.

Re: CM machine crashes, re-install CM on new machi...

CM machine crashes, re-install CM on new machine a...