Member since 
    
	
		
		
		11-17-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                53
            
            
                Posts
            
        
                32
            
            
                Kudos Received
            
        
                4
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 6823 | 08-30-2016 03:38 PM | |
| 4377 | 08-09-2016 07:13 PM | |
| 3617 | 06-14-2016 03:25 PM | |
| 8693 | 02-26-2016 03:34 PM | 
			
    
	
		
		
		06-30-2017
	
		
		01:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							@Viswa Can you post your query and full explain plan? Looks like not all the output is there so hard for anyone to explain what it is doing.  In the meantime, here is a pretty helpful presentation about reading Hive explain plans: https://www.slideshare.net/HadoopSummit/how-to-understand-and-analyze-apache-hive-query-execution-plan-for-performance-debugging  Assuming you're using the new Hive explain plan (hive.explain.user=true), some general quick tips:   Data flows from the bottom of the explain plan to the top  Operators can have multiple children (ex: to do a MAPJOIN you might need to do a MAP and a FILTER)  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-09-2016
	
		
		10:18 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Create some properties in your pom.xml:  <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <scala.core>2.10</scala.core>
  <spark.version>1.6.1</spark.version>
</properties>
  Include spark-hive in your project's dependencies:  <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_${scala.core}</artifactId>
   <version>${spark.version}</version>
</dependency>
  Then in your code:  // create a new hive context from the spark context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
// create the data frame and write it to orc
// output will be a directory of orc files
val df = hiveContext.createDataFrame(rdd)
df.write.mode(SaveMode.Overwrite).format("orc")
  .save("/tmp/myapp.orc/")
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-26-2016
	
		
		01:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I've found two places so far where the operation logs are being used:   Ambari Hive View -> Logs tab normally displays output like the number of mappers/reducers.  After disabling operation logging the log tab displays nothing.    Beeline connections -> After executing a query normally some output is displayed.   After disabling operation logging there is no output except the results of the query.     Would be nice to get these back without compromising hiveserver2! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-19-2016
	
		
		02:27 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 In hive, there is the option to enable or disable operation logging: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-HiveServer2Logging  What are these logs for?  Some background on why I'm asking:  We recently ran into an issue where hiveserver2 was crashing due to huge number of open operation log files (same issue as https://community.hortonworks.com/questions/48351/hiveserver2-hive-users-nofile-ulimit-above-64000.html). The files all appear to be empty AND they are just being kept open indefinitely. We are going to disable the logs by setting hive.server2.logging.operation.enabled to false but want to know the impact of doing so. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hive
			
    
	
		
		
		08-31-2016
	
		
		10:32 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Pengqing Bao I haven't run into that error yet. Maybe something is up with the schema? If you import the table to HDFS are you able to load it into your Hive table and query it without issues? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-30-2016
	
		
		03:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							@SBandaru Like @Frank Lu mentioned, there are some things that are not supported when using TDCH with Sqoop. For example, creating the schema automatically with "sqoop import" doesn't work.  Instead, use "sqoop create-hive-table" to create the schema first and then do "sqoop import".  For more info see: https://community.hortonworks.com/content/kbentry/53531/importing-data-from-teradata-into-hive.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-29-2016
	
		
		06:14 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		6 Kudos
		
	
				
		
	
		
					
							 
	Overview  
	There are a couple of different options for importing data from Teradata into Hadoop:  
	
 Sqoop and the Teradata JDBC driver (documentation) 	
 Hortonworks Connector for Teradata (documentation) 	
 Teradata Connector for Hadoop (TDCH) (download README here or get the PDF doc)    
	This article will explore examples of each of the three above along with some of the pros/cons. The aim of this document is help understand when to use each option while keeping the following in mind:  
 Goal is to create a schema which is as close to Teradata as possible  Hive tables will usually end up as ORC but may need intermediate tables stored as TEXTFILE. The examples below use textfile.  I have not tested all possible column types. If you have more information, please let me know.  Versions tested with: HDP 2.3.2, Sqoop 1.4.6, TDCH 1.4.1 or 1.4.4   
	All of the examples will use the following common environment variables which you would define at the top of a bash script. The variables are plugged in to the various commands below: 
 HADOOP_CLASSPATH=$(hcat -classpath) export HADOOP_CLASSPATH
SQOOP_HOME=/usr/hdp/current/sqoop-client
TDUSER=myuser
#TDPASS=****** <-- use an env var instead?
TDHOST=mytdhost
TDDB=METRICS
TDTABLE=ERD
HIVEDB=thirty_day_tables
HIVETABLE=SQOOP_ERD
JDBCURL=jdbc:teradata://$TDHOST/DATABASE=$TDDB
TDCHJAR=/usr/lib/tdch/1.4/lib/teradata-connector-1.4.4.jar
TDCHJARS=$TDCHJAR,/usr/lib/tdch/1.4/lib/tdgssconfig.jar,/usr/lib/tdch/1.4/lib/terajdbc4.jar
TDWHERE="STATUS IN ('A','B','C') and GROUP = 'USA'"
  
	Sqoop and the Teradata JDBC driver  
	Import schema and data from Teradata to Hive using plain sqoop and JDBC.  
	Required components:  
	
 Sqoop (comes with HDP out of the box) 	
 Teradata JDBC driver (download from Teradata)   
	Example command: 
 sqoop import \
-libjars ${LIB_JARS},${TDCHJARS} \
--driver com.teradata.jdbc.TeraDriver \
--connect $JDBCURL \
--table $TDTABLE \
--hive-table ${HIVEDB}.${HIVETABLE} \
--where "${TDWHERE}" \
--username $TDUSER \
--password $TDPASS \
--map-column-java EFF_TSP=String \
--num-mappers 1 \
--map-column-hive EFF_TSP=STRING \
--hive-import
  
	Pros  
	
 This only requires the Teradata JDBC driver to be installed, so it is easy to get started with. 	
 Schema can be created for you automatically (although with issues... see cons)   
	Cons  
	
 Plain JDBC doesn't have some of the smarts that are built in to TDCH. 	
 Some fields are not supported and will have to be mapped. For example, TIMESTAMP and DATE fields in Teradata will get mapped to Hive STRING.   
	Hortonworks Connector for Teradata  
	Import data from Teradata to Hive using sqoop and the Hortonworks Connector for Teradata. The Hortonworks Connector for Teradata wraps TDCH.  
	Required components:  
	
 Sqoop (comes with HDP out of the box)
	 	
 Hortonworks Connector for Teradata (download addon) 
	
 
		
 Comes with version 1.4.1 of TDCH and the Teradata JDBC drivers installed into SQOOP_HOME/lib     If you want the schema created, it is a separate command:  sqoop create-hive-table \
-libjars ${LIB_JARS} \
--connect $JDBCURL \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username $TDUSER \
--password $TDPASS \
--table $TDTABLE \
--map-column-hive EFF_TSP=STRING \
--hive-table ${HIVEDB}.${HIVETABLE}
  
	Example command (requires the schema to have been created in hive already): 
 sqoop import \
-libjars ${LIB_JARS} \
-Dtdch.input.teradata.conditions="${TDWHERE}" \
--connect $JDBCURL \
--connection-manager org.apache.sqoop.teradata.TeradataConnManager \
--username $TDUSER \
--password $TDPASS \
--table $TDTABLE \
--hive-import \
--hive-table ${HIVEDB}.${HIVETABLE} \
--num-mappers 1  
	Pros  
	
 Use sqoop but get some of the smarts that come with TDCH  Can create the hive schema using sqoop create-hive-table   
	Cons  
	
 Can't create the hive schema in one step but it is possible in two using create-hive-table and then import  Dates and timestamps are converted to STRING automatically or throw an error which requires using the map-column-hive   	
 Some Sqoop commands are not supported in this mode like --where and --hive-overwrite.  If supported in TDCH then you need to specify the java property which gets messy sometimes.   	
 Version of TDCH included is out of date (newest available is TDCH 1.4.4 and version 1.4.1 is included right now)   
	Teradata Connector for Hadoop (TDCH)  
	Import data from Teradata to Hive using TDCH.  
	Required components:  
	
 Teradata Connector for Hadoop (TDCH)    
	Example command (requires the schema to have been created in hive already): 
 hadoop jar $TDCHJAR com.teradata.connector.common.tool.ConnectorImportTool \
-libjars $LIB_JARS \
-url ${JDBCURL} \
-username $TDUSER \
-password $TDPASS \
-jobtype hive \
-fileformat textfile \
-nummappers 1 \
-sourcetable ${TDTABLE} \
-sourceconditions "${TDWHERE}" \
-targetdatabase ${HIVEDB} \
-targettable ${HIVETABLE}
  
	Pros  
	
 TDCH supports a bunch of different methods for getting in/out of teradata   
	Cons  
	
 Can't use TDCH to create the Hive schema automatically  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		08-26-2016
	
		
		06:42 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Alternatively, you can use beeline instead of the Hive CLI. Here is an example using beeline running a file with a parameter:  beeline -u "jdbc:hive2://master01:2181,master02:2181,master03:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -f file.hql --hivevar HDFSDIR=/tmp/folder  Contents of file.hql:  USE myhivedb;
-- a comment
LOAD DATA INPATH '${HDFSDIR}/browser.tsv' OVERWRITE INTO TABLE browser;
-- other queries 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-26-2016
	
		
		06:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @SBandaru We have been struggling with this as well. I documented some of what we've found so far here: Importing data from Teradata into Hive. Even if you get the schema created automatically with Sqoop the column types are pretty bad (all dates/timestamps get converted to string as an example). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-23-2016
	
		
		04:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 +1, small correction that the HDFS directories will be under "/user" not "/usr":  hdfs dfs -ls /user
 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













