Member since 
    
	
		
		
		01-14-2019
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                144
            
            
                Posts
            
        
                48
            
            
                Kudos Received
            
        
                17
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1745 | 10-05-2018 01:28 PM | |
| 1361 | 07-23-2018 12:16 PM | |
| 1672 | 07-23-2018 12:13 PM | |
| 7954 | 06-25-2018 03:01 PM | |
| 5920 | 06-20-2018 12:15 PM | 
			
    
	
		
		
		03-27-2018
	
		
		02:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @swathi thukkaraju   The pipe is a special character for splits, please use single quotes to split pipe-delimited strings:  val df1 = sc.textFile("testfile.txt").Map(_.split('|')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()  Alternatively, you can use commas or another separator.  See the following StackOverflow post for more detail:  https://stackoverflow.com/questions/11284771/scala-string-split-does-not-work 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-26-2018
	
		
		09:32 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 If you'd like to generate some data to test out the HDP/HDF platforms at a larger scale, you can use the following GitHub repository:  https://github.com/anarasimham/data-gen  This will allow you to generate two types of data:   Point-of-sale (POS) transactions, containing data such as transaction amount, time stamp, store ID, employee ID, part SKU, and quantity of product. These are transactions you make at a store when you are checking out. For simplicity's sake, this assumes each shopper only buys one product (potentially greater than 1 in quantity)  Automotive manufacturing parts production records, simulating the completion of parts in an assembly line. Imagine a warehouse completing different components of a car, such as the hood, front bumper, etc. at different points in time and those parts being tested for heat and vibration thresholds. This data will contain a timestamp of when the part was produced, thresholds for heat & vibration, values as tested for heat & vibration, quanity of produced part, a "short name" identifier for the part, a notes field, and a part location   Full details of both schemas are documented in the code in file datagen/datagen.py at the repository above.  The application is able to generate data and insert into one of two supported locations:   Hive  MySQL   You will need to configure the table by running one of the scripts in the mysql folder after connecting to the desired server and the desired database as the desired user.  Once that is done, you can copy the inserter/mysql.passwd.template file into inserter/mysql.passwd and edit it to provide the correct details. If you'd like to insert into Hive, do the same with the hive.passwd.template file. After editing, you can execute using the following command:  python main_manf.py 10 mysql  This will insert 10 rows of manufacturing data into the configured MySQL database table.  At this point, you're ready to explore your data in greater detail. Possible next steps include using NiFi to pull the data out of MySQL and push into Druid for a dashboard-style data lookup workflow. You can also push into Hive for ad-hoc analyses. These activities are out of scope for this article but are suggestions to think about. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		02-26-2018
	
		
		07:48 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Serialization is the algorithm by which data is written to disk or transmitted somewhere. Different applications have different ways to serialize data to optimize for a specific outcome, whether it is dealing with reads or writes. As it says in the Hive language manual, integers and strings are encoded to disk and compressed in different ways, and it lists out the rules which it uses to do so. For example, variable-width encoding optimizes the space usage of the data, as it uses less space to encode smaller data.  See the following Wikipedia article for more detail:   https://en.wikipedia.org/wiki/Serialization 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-06-2018
	
		
		07:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 No, shouldn't be any special HDP policies. Perhaps you are running up against an upper quota? Cores/RAM/disk set by either AWS or your organization? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-05-2018
	
		
		09:21 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Malay Sharma  If you are writing a new file to HDFS and trying to read from it at the same time, your read operation will fail with a 'File does not exist' error message until the file write is complete.  If you are writing to a file via the 'appendToFile' command and try to read it mid-write, your command will wait until the file is updated and then read the new version of it.   In the case of tail, it will stream out the entire contents that you are appending instead of only the last few lines. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		02-05-2018
	
		
		08:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
	@Bob Thorman  
	According to your stack trace, you may not have the requisite permissions to perform this operation. Please check your AWS user permissions and make sure you have enough capacity to allocate the cluster you are trying to allocate.  
	cloudbreak_1 | 2018-02-01 21:00:13,056 [reactorDispatcher-9] accept:140 
DEBUG c.s.c.c.f.Flow2Handler - 
[owner:a290539e-7056-4492-8831-23d497654084] [type:STACK] [id:7] 
[name:storm] flow control event arrived: key: SETUPRESULT_ERROR, flowid:
 384a1d99-4eba-4e16-ba29-5e71534c852a, payload: 
CloudPlatformResult{status=FAILED, statusReason='You are not authorized 
to perform this operation.', 
errorDetails=com.sequenceiq.cloudbreak.cloud.exception.CloudConnectorException:
 You are not authorized to perform this operation., 
request=CloudStackRequest{, 
cloudStack=CloudStack{groups=[com.sequenceiq.cloudbreak.cloud.model.Group@c94202e,
 com.sequenceiq.cloudbreak.cloud.model.Group@6491d02d, 
com.sequenceiq.cloudbreak.cloud.model.Group@43807e4d, 
com.sequenceiq.cloudbreak.cloud.model.Group@4f5952db, 
com.sequenceiq.cloudbreak.cloud.model.Group@62e19e4e], 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-06-2017
	
		
		05:12 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 If you're using Ambari 2.5.2 you should be able to install NiFi on the same cluster using the HDF management pack: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf-and-hdp/content/ch_install-mpack.html  Yes you can automate the job with NiFi - you'll have to create a way to query your SFTP endpoint for incremental changes and then get those new files. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-06-2017
	
		
		03:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 You will need to use the GetHDFS processor to retrieve the file and then the InvokeHTTP processor to send the data to an HTTP endpoint. Data format shouldn't matter - a binary sequence is being transmitted so unless you need to parse the data before transmission it can be anything.  If you are dealing with a large file, you may want to split it as you could run into memory limitations. You will have to split it before transmission into manageable chunks and join it afterwards. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-06-2017
	
		
		03:04 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you have an HDF cluster running, you can create a NiFi flow to accomplish this. Otherwise you will need a client to download the file first before importing into HDFS. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-02-2017
	
		
		05:31 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Assumptions:  -You have a running HDP cluster with Sqoop installed  -Basic knowledge of Sqoop and its parameters  Ingesting SAP HANA data with Sqoop  To ingest SAP HANA data, all you need is a JDBC driver. To the HDP platform, HANA is just another database - drop the JDBC driver in and you can plug & play.  1. Download the JDBC driver. This driver is not publicly available - it is only available to customers using the SAP HANA product. Find it on their members-only website and download it.  2. Drop the JDBC driver into Sqoop's lib directory. For me, this is located at /usr/hdp/current/sqoop-client/lib  3. Execute a Sqoop import. This command has many variations and many command-line parameters, but the following is one such example.  sqoop import --connect "jdbc:sap://<HANA_SERVER>:30015" --driver com.sap.db.jdbc.Driver --username <YOUR_USERNAME> --password <PASSWORD> --table "<TABLE_NAME>" --target-dir=/path/to/hdfs/dir -m 1 -- --schema "<YOUR_SCHEMA_NAME>"  The '-m 1' argument will limit Sqoop to using one thread, so don't use this if you want parallelism. You'll need to use the --split-by argument and give it a column name to be able to parallelize the import work.  If all goes well, Sqoop should start importing the data into your target directory.  Happy Sqooping! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
 
        













