Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                488
            
            
                Posts
            
        
                433
            
            
                Kudos Received
            
        
                118
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3604 | 08-25-2017 03:09 PM | |
| 2512 | 08-22-2017 06:52 PM | |
| 4197 | 08-09-2017 01:10 PM | |
| 8977 | 08-04-2017 02:34 PM | |
| 8949 | 08-01-2017 11:35 AM | 
			
    
	
		
		
		12-19-2016
	
		
		01:24 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Sequence files are binary files containing key-value pairs. They can be compressed at the record (key-value pair) or block levels. A Java API is typically used to write and read sequence files but Sqoop can convert to sequence files.  Because they are binary, they have faster read/write than text formatted files.  The small file problem arises when many small files cause memory overhead for the namenode referencing large amounts of small files.  Large is a relative term, but if for example you have daily ingests of many small files ... over time you will start paying the price in memory as just stated.  Also, map-reduce operates on blocks of data and when files contain less than a block of data the job spins up more mappers (with overhead cost) compared to those for files with over a block of data.  Sequence files can solve the small file problem if they are used in the following way.  Sequence file is written to hold multiple key-value pairs and the key is a unique file metadata, like ingest filename or filename+timestamp and value is the content of the ingested file.  Now you have a single file holding many ingested files as splittable key-value pairs.  So if you loaded it into pig for example and grouped by key, each file content would be its own record.  Sequence files often are used in custom-written map-reduce programs.  Like any decision for file formats, you need to understand what problem you are solving by deciding on a particular file format for a particular use case.  If you are writing your own map-reducing programs and especially if you are also ingesting many small files repeatedly (and perhaps also want to do processing only on the ingested file metadata as well as its contents), then sequence files are a good fit.  If on the other hand you want to load the data into hive tables (and especially where most queries are on subsets of columns), you would be better off landing the small files into hdfs, merging them and converting to ORC, and then deleting the landed small files. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-19-2016
	
		
		12:12 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Kumar, as tests by me and @Devin Pinkston show, it is the actual file content that you need to look at (not the UI).  Thus ... no fears of the processor adding a new line. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-18-2016
	
		
		06:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Minor sidenote: can use TailFile with polling interval as alternative to GetFile. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-18-2016
	
		
		05:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 FetchFile does not append a new line character if it is not present in the original file.  When I run ListFiles -> FetchFile -> PutFile and view in text editor, the file content is identical among the following files (including new line characters viewable via the text editor):   original file on disk  file contents downloaded from FetchFile (Provenance/Contents)  file contents downloaded from PutFile (Provenance/Contents)  file put to disk   If you are not getting this behavior, please provide details. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2016
	
		
		06:55 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 If you are using Cloudbreak to deploy HDP to Azure, anything you deploy on premise can be identically deployed to Azure (also AWS, Google, OpenStack).  This is the IaaS model where only the infrastructure is virtualized in the cloud, not what is deployed there.  (The PaaS model additionally abstracts deployed components as virtualized components and thus is not identical to on premise deployments.  HDInsight on Azure is PaaS).   http://hortonworks.com/apache/cloudbreak/  http://hortonworks.com/products/cloud/azure-hdinsight/  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2016
	
		
		06:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 What is the largest load (MBs or GBs) you have run your use case on? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2016
	
		
		05:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You should change the Zeppelin port to something besides 8080 (9995 or 9999 are typical).  Then localhost:8080 should bring up Ambari.  These instructions are from: https://zeppelin.apache.org/docs/0.6.0/install/install.html  You can configure Apache Zeppelin with both environment variables in  conf/zeppelin-env.sh ( conf\zeppelin-env.cmd  for Windows) and Java properties in  conf/zeppelin-site.xml . If both are defined, then the environment variables will take priority.     zepplin-env.sh  zepplin-site.xml  
  
    ZEPPELIN_PORT  zeppelin.server.port    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2016
	
		
		01:47 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 That is really an issue of scaling (how many nodes and memory per node you have) and multitenancy (which other jobs will run at the same time, particularly spark or other memory-intensive jobs).  The more nodes and the less memory contention, the more data you can process in spark. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2016
	
		
		11:13 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Your first screen shot shows the local file system, not hdfs.  To see the files in hdfs from the command line you have to run the command   hdfs dfs -ls path  If you do not start path with / the path will be relative to the user directory of the user you are logged in as.  In your case it will be /user/root in hdfs.  Similarly from the NiFi side if you do not start the path in PutHDFS with / it will put to hdfs under the nifi user (I think it will be /user/nifi/Nifi).  It is a best practice to specify hdfs paths explicitly (ie starting with /)  You can use the explorer ui to navigate the hdfs file system and you can also use the Ambari files view.  (Just log into Ambari and go to views in upper right, then Files View)  See the following links for more:  http://www.dummies.com/programming/big-data/hadoop/hadoop-distributed-file-system-shell-commands/  https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/FileSystemShell.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-15-2016
	
		
		09:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I suggest looking at the merge and saveAsTextFile functions as per bottom post here http://stackoverflow.com/questions/31666361/process-spark-streaming-rdd-and-store-to-single-hdfs-file 
						
					
					... View more