Member since 
    
	
		
		
		10-21-2018
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                14
            
            
                Posts
            
        
                1
            
            
                Kudos Received
            
        
                0
            
            
                Solutions
            
        
			
    
	
		
		
		10-07-2022
	
		
		11:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 My data are in JSON format and gzipped and stored on S3.  I want to read those data  I tried some streaming options as below  import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start()  Got exceptions   s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]    How to fix this? How can I read 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		08-29-2021
	
		
		10:45 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I have two separate Hadoop clusters, Cloudera Hadoop cluster and Apache Hadoop cluster.  Found that Impala query runs faster on cloudera whereas same query runs slower in Apache Hadoop cluster.  During query execution found that query taking significant amount of time in analyzing and Planning phase compared to Cloudera cluster.  I tuned up Apache cluster for heap size configuration and try to maintain same property and it’s values as I have in Cloudera Cluster.     What else I need to double check or need to configure some other services, configurations?  Please suggest.     Same machined hardware configuration and same instances were used in both clusters.  Versions I used in Cloudera  CDH 6.3.2  impalad version 3.2.0     Versions I used in Apache  Hadoop 3.0.0  Impala 3.4.0    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Impala
			
    
	
		
		
		04-24-2020
	
		
		10:57 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi lwang,  As suggested I disabled 'Hive Metastore Canary Health Test' and also reduced heap size from 5GiBs to 2GiBs.  From last 14hours we have not noticed any alert from Service Monitor.     Thanks, 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-23-2020
	
		
		08:04 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi lwang,  I noticed that we have only 285 entries in service monitor (find from Cloudera Management Service Monitored Entities). Recently I increased heap size to 5GiBs but still received alert.  The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 4,991M. JVM maximum available heap size: 5,120M. Percentage of maximum heap: 97.48%. Critical threshold: 95.00%. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-22-2020
	
		
		08:46 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks lwang,   I increase the JVM heap size to 5GiBs. lets see how it will work.     Version: Cloudera Express 6.3.0 (#1281944 built by jenkins on 20190719-0609 git: 5b793e9c9cb3f40b3912044aac00abb635183191)  Java VM Name: Java HotSpot(TM) 64-Bit Server VM  Java Version: 1.8.0_181       
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-22-2020
	
		
		07:25 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I am new in CDH cluster setup. I have CDH 6.3.2 with HA enabled. Total 3+5 nodes cluster(3 masters and 5 data nodes)  From last 2 days we received alert from SERVICE_MONITOR_HEAP_SIZE  The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 2,001M. JVM maximum available heap size: 2,048M. Percentage of maximum heap: 97.71%. Critical threshold: 95.00%.  So I increased heap size to 3.0GiBs. But still we received alert as below  The health test result for SERVICE_MONITOR_HEAP_SIZE has become bad: Heap used: 3,004M. JVM maximum available heap size: 3,072M. Percentage of maximum heap: 97.79%. Critical threshold: 95.00%.  How can I estimate heap size? How can I fix this issue?  Please assist me step by step to fix the issue.  Thank you 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Cloudera Manager
			
    
	
		
		
		01-06-2020
	
		
		03:06 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I saved a sample query in HUE UI(Impala editor). Try to find records in Mysql DB 'HUE' and table 'beeswax_savedquery'. However tables 'beeswax_savedquery' and beeswax_queryhistory are empty.  Whereas other tables were able to store all required information.  E.g. Table 'auth_user' contains all information about users.     My question is that : Where those HUE query are getting stored (in Mysql OR some where in HDFS)      I am using CDH 6.3.2 with Impala. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Impala
			
    
	
		
		
		10-22-2018
	
		
		10:58 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Thanks a lot, this works for me.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-21-2018
	
		
		10:52 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							    We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again?     If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us.     I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels  Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist         Any suggestion? am I doing right way, is there any othe correct way to achive this? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Cloudera Manager
 
        




