Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                488
            
            
                Posts
            
        
                433
            
            
                Kudos Received
            
        
                118
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3604 | 08-25-2017 03:09 PM | |
| 2512 | 08-22-2017 06:52 PM | |
| 4196 | 08-09-2017 01:10 PM | |
| 8977 | 08-04-2017 02:34 PM | |
| 8949 | 08-01-2017 11:35 AM | 
			
    
	
		
		
		10-10-2016
	
		
		11:16 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Bibhas BurmanThat is an excellent tutorial for pushing log data to HDFS for historical analysis.  If you want to do real-time streaming analysis here are two links that should be useful  http://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/ (ignore the storm part)  https://community.hortonworks.com/articles/44550/horses-for-courses-apache-spark-streaming-and-apac.html (integrate with the kafka part from the first link)  Since you are getting your feet wet with the technology, definitely put in some time to play around with it and build small projects before working toward your end product.  And of course ... anytime you have a question along the way ask the HCC to get some guidance. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-10-2016
	
		
		08:27 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 NiFi is easy at capturing logs.  Why not use all technologies where they are best: NiFi to gather log data in realtime -> kafka queue -> Spark streaming analytics -> Zeppelin for spark and visualization.  You could also fork NiFi to mergecontent to hdfs to keep for historical analysis.  All technologies come out-of-the-box with HDF and HDP.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-07-2016
	
		
		01:19 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 These should be helpful  http://stackoverflow.com/questions/32080475/how-to-read-a-zip-containing-multiple-files-in-apache-spark  http://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-07-2016
	
		
		11:45 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you are using HDP, all of the tools discussed below are deployed when you install the distribution.  Store your data  Definitely store your data in Hadoop.  Spend some time thinking about how you will organize this from a file system perspective.  http://hortonworks.com/apache/hdfs/  Sqoop is a fast and effective way to pull your data from relational databases into hadoop.  http://hortonworks.com/apache/sqoop/  Clean your data  You may need to clean or transform that data after it has landed in Hadoop, e.g. trimming leading and trailing whitespaces, removing non-ascii characters.  Pig scripts can do this quick and effectively.  If you do have to clean the data, keep the raw data in one zone (HDFS directory) and clean it to a destination zone.  http://hortonworks.com/apache/pig/  Analyze and visualize your data  You most likely will want to use Spark to do your predictive analysis.  Spark is deployed with HDP.  It is an in-memory processing engine with libraries to easily perform sql and machine learning/predictive analysis against your data.  Being in-memory, analysis of GBs of data is very rapid. These libraries are accessed with Java, Scala or Python APIs.  (There are also streaming and graph capabilities, but it looks like you will not need these for your analysis).   https://hortonworks.com/apache/spark/  Zeppelin is an awesome UI to perform Spark analyses.  It is a notebook style UI -- it is browser based and composed of separate "paragraphs" which are areas to perform separate steps of your analysis.  Each paragraph is loaded with an interpreter.  These interpreters allow you to write shell commands directly against the linux box hosting the Zeppelin server, or to perform your predictive analysis using Spark's sql and machine learning / predictive analyses.  Zeppelin also has easy to use visualization capabilities.  https://hortonworks.com/apache/zeppelin/  You may want to use Hive to perform complex SQL against your data.  Hive is a SQL engine on Hadoop that is very effective in analyzing huge volumes of both structured and unstructured data. (Spark can reach limits on huge data sizes). For example, you can analyze tweets where fields in the hive table are json strings.  Or you can do complex joins across multiple tables.  Hive is not as fast as Spark, but it is solid against any volume of data and complexity of query.  Having said that, Hive performance has increased greatly in the past few years ... largely by implementation of the Tez engine, ORC file format, and in-memory LLAP.  You can build Hive tables from Spark and analyze from both, or you can build Hive tables through Hive and also analyze in Spark.  http://hortonworks.com/apache/hive/  General  As mentioned, all of the above tools come out of the box with HDP (current version is 2.5).  You can run your analysis from either a browser-based UI (Zeppelin, Ambari views) or from the command line from server in the cluster (you may want to set up a specialized "edge node" to perform analysis from the command line).  Your Approach  It sounds like you are about to launch on a very large project.  Be sure to start small by working with small samples of your data to learn the technology and to understand how best to design how you store and analyze the data.  You can get a quick start by downloading the sandbox and following tutorials.  http://hortonworks.com/products/sandbox/?gclid=CjwKEAjwj92_BRDQ-NuC98SZkWYSJACWmjhlzsGZqc3fexfPwVWKFOOLOUf__SAlbb1JVpafHxq5bxoC3-Hw_wcB 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-05-2016
	
		
		04:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 @Jonas Carson  Using custom properties files should solve your needs perfectly.  Bottom line is you configure your processor with NiFi Expression Language that references the custom property e.g ${my.cust.prop.name}.  Each environment has its own instance of the custom property file -- with same property names as file deployed to other environments but values specific to the environment.  To implement this, open the nifi.properties file and set the field nifi.variable.registry.properties  to a comma-delimited
list of paths to custom property files.  Be sure to make your property names unique if you are using more than one property file in the same envt.  Also, they must be unique from system and environment properties.  See the following links for more information:  https://community.hortonworks.com/articles/57304/supporting-custom-properties-for-expression-langua.html  https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Using_Custom_Properties   Note: You can also refer to system variables and environment variables the same way: ${system.variable.name} 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-05-2016
	
		
		12:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Use  --outputformat=dsv   This output format is configurable, but the default is a pipe so the above should be sufficient for your needs.   If you want to use something else as a delim, add  --delimiterForDSV=DELIMITER  For more details, see: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Separated-ValueOutputFormats  (If this is what you are looking for, let me know by accepting the answer.  Else, let me know the gaps in the answer). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-05-2016
	
		
		12:11 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Command Line  If these three files are in the same directory, run the following from the command line of a server in the cluster. It will merge the files into one file and store it locally  hdfs dfs -getmerge <hdfsDir> <localFile>  where <hdfsDir> is the directory holding the files on hdfs and <localFile> is the name of the merged file that will be stored locally.  If you are talking about a directory structure that looks like this in HDFS:  myFile.txt/_SUCCESS
myFile.txt/part-m-00000
myFile.txt/part-m-00001  this is the result of a map-reduce job.  <hdfsDir> in this case would be myFile.txt.  Note that _SUCCESS is a 0 byte file: there are not contents -- it is just a flag to designate the m-r job was a success.  Ambari   Alternatively, you can do this from the File View on Ambari.  Just open the directory holding the files you want to merge to one.  Then check the files you want to merge      Then click concatenate from the far right dropdown      This will download the merged (concatenated) files from your browser.  Note for both approaches:  The above works for multiple files in the same directory even if the files are not the result of a map-reduce job (but is typically used for map-reduce results).  (If this is what you were looking for, please let me know by accepting the answer.  Else, let me know the gaps in the answer). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-04-2016
	
		
		11:24 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 If you do not have to worry about partitions, it is as you state:  INSERT OVERWRITE old_data SELECT <statement> FROM new_data;  If you have a partition you must specify it as  INSERT OVERWRITE old_data PARTITION (id = <value>) SELECT <statement filtering by id> FROM new_data;  Note for the SELECT statement you have to select the same columns and column order as those you are inserting into.    See the following for more color:  https://community.hortonworks.com/questions/28683/overwriting-a-column-in-hive.html  https://community.hortonworks.com/questions/5579/insert-overwrite-of-2-gb-data.html  https://community.hortonworks.com/questions/49967/insert-overwrite-running-too-slow-when-inserting-d.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-04-2016
	
		
		03:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Seyma Menjour  Glad to hear you are flying though the technology stack with such ease 🙂  BTW, one little trick with Zeppelin is you can hide either the command or the output.  Small touch but hiding the Zeppelin commands after you run them can make story-telling to non-tech folks more direct -- you only see the visualizations. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-04-2016
	
		
		02:23 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 That tutorial has been completely replaced with visualization by Zeppelin.  If you really want to try to use it again, you can find a copy here https://github.com/hortonworks/tutorials/blob/f5f97f40807157891c2c9c85e279182d44fdc1ee/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop/hello-hdp-section-9.md but all the links to images are broken.    This is taken from an old github repo https://github.com/hortonworks/tutorials/tree/f5f97f40807157891c2c9c85e279182d44fdc1ee/tutorials/hortonworks/hello-hdp-an-introduction-to-hadoop  That is the best you will find with this old tutorial.  Bottom line though is that Zeppelin is an exciting new analytics and visualization tool that is worth investing your time in learning.  It is the current and future direction of Big Data analytics and visualization for most data discover, exploration and story-telling activities.  Check it out: http://hortonworks.com/apache/zeppelin/  (If this answer is what you are looking for, please let me know by accepting the answer) 
						
					
					... View more