Member since 
    
	
		
		
		05-28-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                47
            
            
                Posts
            
        
                28
            
            
                Kudos Received
            
        
                7
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 7253 | 06-20-2016 04:00 PM | |
| 12491 | 01-16-2016 03:15 PM | |
| 13028 | 01-16-2016 05:06 AM | |
| 6367 | 01-14-2016 06:45 PM | |
| 3614 | 01-14-2016 01:56 AM | 
			
    
	
		
		
		10-25-2018
	
		
		04:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 HiveServer now uses a remote instead of an embedded metastore; consequently, Ambari no longer starts the metastore using  hive.metastore.uris=' '.  You no longer set  key=value  commands on the command line to configure Hive Metastore. You configure properties in hive-site.xml. The Hive catalog resides in Hive Metastore, which is RDBMS-based as it was in earlier releases. Using this architecture, Hive can take advantage of RDBMS resources in a cloud deployments.  So please check or share your hive-site.xml. for hive.server2. properties. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-23-2018
	
		
		08:16 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Make sure you are pointing to the right spark directory.  it looks like your  SPARK_HOME might be pointing to "/usr/hdp/2.6.0.11-1/spark" instead of "/usr/hdp/2.6.0.11-1/spark2".  For spark2 your bash_profile should like as below  export SPARK_HOME="/usr/hdp/2.6.0.11-1/spark2"
export SPARK_MAJOR_VERSION=2
export PYSPARK_PYTHON=python3.5 
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-27-2016
	
		
		03:43 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @mqureshi  Thanks for your response. yes it is quite a custom requirement. I thought its better to check with the community if anyone has implemented this kinda stuff.  I am trying to use either hadoop custom input format or python UDF's  to get this done. There seems to be no straightforward  way of doing this in spark. I can not use spark pivot also as it supports only column  as of now right?.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-25-2016
	
		
		08:39 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I have a requirement where in I need to ingest multiline CSV with semistructured records with some rows need to be converted to column and some rows needs to be both rows and column.  below is the input CSV file look like:  a,a1,a11,7/1/2008  b,b1,b11,8:53:00  c,c1,c11,25  d,d1,d11,1  e,e1,e11, ABCDEF  f,f1,f11,  sn1,msg,ref_sn_01,abc  sn2,msg,ref_sn_02,def  sn3,msg,ref_sn_02,ghi  sn4,msg,ref_sn_04,jkl  sn5,msg,ref_sn_05,mno  sn6,msg,ref_sn_06,pqr  sn7,msg,ref_sn_07,stu  sn8,msg,ref_sn_08,vwx  sn9,msg,ref_sn_09,yza  sn9,msg,ref_sn_09,yza  sn10,msg,ref_sn_010,  sn11,msg,ref_sn_011  cp1,ana,pw01,1.1  cp2,ana,pw02,1.1  cp3,ana,pw03,1.1  cp4,ana,pw04,1.1  cp5,ana,pw05,1.1  cp6,ana,pw06,1.1  cp7,ana,pw07,1.1  cp8,ana,pw08,1.1  cp9,ana,pw09,1.1  cp10,ana,pw10,1.1  cp11,ana,pw11,1.1  Below is the expected output:      please let me know whats the best to read it and load it in HDFS/Hive.     
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
 - 
						
							
		
			Apache Hadoop
 - 
						
							
		
			Apache Spark
 
			
    
	
		
		
		06-23-2016
	
		
		03:31 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Sri  Bandaru   
 Check the hive-site.xml contents. Should be like as below for spark.  Add hive-site.xml to the driver-classpath so that spark can read hive configuration. Make sure —files must come before you .jar file  Add the datanucleus jars using --jars option when you submit   Check the contents of hive-site.xml 
   <configuration>
    <property>
      <name>hive.metastore.uris</name>
      <value>thrift://sandbox.hortonworks.com:9083</value>
    </property>
  </configuration> 
  The Seq. of command   
 spark-submit \  --class <Your.class.name> \  --master yarn-cluster \  --num-executors 1 \  --driver-memory 1g \  --executor-memory 1g \  --executor-cores 1 \  --files /usr/hdp/current/spark-client/conf/hive-site.xml \  --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \   target/YOUR_JAR-1.0.0-SNAPSHOT.jar "show tables""select * from your_table"  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-20-2016
	
		
		04:00 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @revan  Apache Hive
Strengths:   The Apache Hive facilitates querying and managing large datasets
residing in distributed storage. Built on top of Apache Hadoop, it provides:  
 Tools to enable easy data extract/transform/load (ETL)   
 A mechanism to impose structure on a variety of data formats   
 Access to files stored either directly in Apache HDFS or in other data
storage systems such as Apache HBase Query execution via MapReduce   
 Hive defines a simple SQL-like query language, called QL, that enables
users familiar with SQL to query the data. At the same time, this language also
allows programmers who are familiar with the MapReduce framework to be able to
plug in their custom mappers and reducers to perform more sophisticated
analysis that may not be supported by the built-in capabilities of the
language.   
 QL can also be extended with custom scalar functions (UDF's),
aggregations (UDAF's), and table functions (UDTF's).   
 Indexing to provide acceleration, index type including compaction and
Bitmap index as of 0.10.   
 Different storage types such as plain text, RCFile, HBase, ORC, and
others.   
 Metadata storage in an RDBMS, significantly reducing the time to perform
semantic checks during query execution.   
 Operating on compressed data stored into the Hadoop ecosystem using
algorithms including DEFLATE, BWT, snappy, etc.   
 Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools. Hive supports extending the UDF set to handle
use-cases not supported by built-in functions.   
 SQL-like queries (HiveQL), which are implicitly converted into
MapReduce, or Spark jobs.   Apache Spark
Strengths:  Spark SQL has multiple interesting features:   
 it supports multiple file formats such as Parquet, Avro, Text, JSON, ORC   
 it supports data stored in HDFS, Apache HBase, Cassandra and Amazon S3   
 it supports classical Hadoop codecs such as snappy, lzo, gzip   
 it provides security through authentification via the use of a
"shared secret" (spark.authenticate=true on YARN, or
spark.authenticate.secret on all nodes if not YARN)    
 encryption, Spark supports SSL for Akka and HTTP protocols   
 it supports UDFs   
 it supports concurrent queries and manages the allocation of memory to
the jobs (it is possible to specify the storage of RDD like in-memory only,
disk only or memory and disk   
 it supports caching data in memory using a SchemaRDD columnar format
(cacheTable(““))exposing ByteBuffer, it can also use memory-only caching
exposing User object   
 it supports nested structures    When to use
Spark or Hive-  
 Hive is still a great choice when low latency/multiuser support is not a
requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the
time windows needed for such processing, but not to an extent that makes Hive
suitable for BI   
 Spark SQL, lets Spark users selectively use SQL constructs when writing
Spark pipelines. It is not intended to be a general-purpose SQL layer for
interactive/exploratory analysis. However, Spark SQL reuses the Hive frontend
and metastore, giving you full compatibility with existing Hive data, queries,
and UDFs. Spark SQL includes a cost-based optimizer, columnar storage and code
generation to make queries fast. At the same time, it scales to thousands of
nodes and multi hour queries using the Spark engine, which provides full
mid-query fault tolerance. The performance is biggest advantage of Spark SQL.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-18-2016
	
		
		12:24 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Since we can do pivoting on only one column so one way of doing in one go is combine the 2 columns to a new column and use that new column as pivot column. The output is some what close to what you are expecting.  Hope this helps.  The Input:  Domain,ReturnCode,RequestTyp
ewww.google.com,200,GET
www.google.com,300,GET
www.espn.com,200,POST		
  The code:  import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions.udf
object pivotDF {
  // Define the application Name
  val AppName: String = "pivotDF"
  // Set the logging level
  Logger.getLogger("org.apache").setLevel(Level.ERROR)
  // Define a udf to concatenate two passed in string values
  val concat = udf( (first: String, second: String) => { first + " " + second } )
  def main (args: Array[String]) {
    // define the input parameters
    val input_file = "/Users/gangadharkadam/myapps/pivot/src/main/resources/domains.csv"
    // Create the Spark configuration and the spark context
    println("Initializing the Spark Context...")
    val conf = new SparkConf().setAppName(AppName).setMaster("local")
    // Define the Spark Context
    val sc = new SparkContext(conf)
    // Define the SQL context
    val sqlContext = new HiveContext(sc)
    import sqlContext.implicits._
    //Load and parse the Engine Information data into Spark DataFrames
    val domainDF = sqlContext
      //Define the format
      .read.format("com.databricks.spark.csv")
      //use the first line as header
      .option("header", "true")
      
      //Automatically infer the data types and schema
      .option("inferSchema", "true")
      
      //load the file
      .load(input_file)
    
    // pivot using concatenated column
    domainDF.withColumn("combColumn", concat($"ReturnCode",$"RequestType"))
      .groupBy("domain").pivot("combColumn").agg(count).show()
  }
}
  The output:      domain   200 GET  200 POST  300 GET    www.espn.com  0   1  0    www.google.com  1  0  1    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-28-2016
	
		
		08:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 @Mahesh Deshmukh  The Sqoop merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset.   For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset.   The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.  # Lets Create a TEST Database in MySQL  create database test;  use test;  # Lets Create an Employee Table  create table emp(empid int not null primary key, empname VARCHAR(20), age int, salary int, city VARCHAR(20),cr_date date);  ##Describe table;  mysql> describe emp;  +---------+-------------+------+-----+---------+-------+  | Field   | Type        | Null | Key | Default | Extra |  +---------+-------------+------+-----+---------+-------+  | empid   | int(11)     | NO   | PRI | NULL    |       |  | empname | varchar(20) | YES  |     | NULL    |       |  | age     | int(11)     | YES  |     | NULL    |       |  | salary  | int(11)     | YES  |     | NULL    |       |  | city    | varchar(20) | YES  |     | NULL    |       |  | cr_date | date        | YES  |     | NULL    |       |  +---------+-------------+------+-----+---------+-------+  ##Load the Employee table  LOAD DATA LOCAL INFILE '/Users/gangadharkadam/Downloads/empdata.csv'  INTO TABLE emp   FIELDS TERMINATED BY ','  ENCLOSED BY '/'  LINES TERMINATED BY '\n'  IGNORE 1 LINES  (empid, empname, age, salary, city, @var1)  set   cr_date = STR_TO_DATE(@var1, '%m/%d/%Y');  # import the the emp table to hdfs using below command  sqoop import --connect jdbc:mysql://localhost/TEST --table emp --username hive -password hive --target-dir /sqoop/empdata/  # Update the few records in the TEST.emp table as below;  update emp set cr_date='2016-02-28' where empname like "A%";  # Now Merge these updated record with the HDFS file using --merge-key option  #merge tool will "flatten" two datasets into one  sqoop import --connect jdbc:mysql://localhost/test --table emp \  --username hive -password hive --incremental lastmodified --merge-key empid --check-column cr_date \  --target-dir /sqoop/empdata/  Below are some of the updated record with the  750,Anne Ward,57,99496,La Sierpe,2016-02-28  38,Anne Morrison,36,53358,San Pedro Carchá,2016-02-28  718,Adam Ford,56,98340,Arthur’s Town,2016-02-28 
						
					
					... View more