Member since 
    
	
		
		
		01-11-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                355
            
            
                Posts
            
        
                232
            
            
                Kudos Received
            
        
                74
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 9262 | 06-19-2018 08:52 AM | |
| 3911 | 06-13-2018 07:54 AM | |
| 4562 | 06-02-2018 06:27 PM | |
| 5264 | 05-01-2018 12:28 PM | |
| 6810 | 04-24-2018 11:38 AM | 
			
    
	
		
		
		04-29-2016
	
		
		08:16 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hello,  If by pseudo mode you mean having a cluster in one machine, then you have 3 options:  
 Sandbox: it's the easiest way since you download and run a VM that contains all the composantes already installed and configured (here).  Use Ambari and Vagrant to create several VMs and install a cluster. Guide here  If you want to install HDP directly on the machine without virtualization than you need to follow this installation guide. This will help you install Ambari and Ambari agent on your machine and then install all other component on the same machine.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-29-2016
	
		
		07:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You can also use HDFS snapshot for protecting data from user errors : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-28-2016
	
		
		05:46 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Andrew Sears,  Here are the hive operations that are captured in Atlas 0.6 : create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename. (source here)  The operations that are not supported in v0.6 are not supported in V0.5 neither.  A new version (v0.7) is available today and support more operation in Hive Bridge and introduce new bridges (Sqoop, Falcon and Storm). 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		09:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @bigdata.neophyte I would not recommend having both physical and virtual nodes in the same cluster. I think it's best to identify your KPI and choose the best solution. This being said, having VMs cluster and physical clusters at the same time can be a good choice for implementing several environments (dev, testing, prod, etc) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		09:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Hi @bigdata.neophyte,  Hadoop has been designed to run on commodity hardware. There are important concepts such as data locality and horizontal scalability that make a hardware cluster the first choice for Hadoop clusters today. The pro of this choice is performance. The cons is the cost for installing and managing the cluster.  Virtual machines are also used for Hadoop today. VM with a central storage (SAN) is not the best choice for performance since you will loose data locality and you will have concurrent jobs/tasks concurrently accessing to the same storage. Some solutions today support dedicating hard disks to VMs. This way you can have a good hybrid approach. The pros of VM is the flexibility.  VM Hadoop clusters are usually used for development environment because they provide flexibility. It's easy to create and kill clusters. Physical clusters are usually recommended for production where the application has strong SLAs.  The final choice depends on your use cases, your existing infrastructure, and resources available to manage your cluster.  I hope this helps. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		09:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Pedro Alves  You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		08:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi @Roberto Sancho,  You can use Hive or Pig for doing ETL. In HDP, Hive and Pig run on Tez and not on MapReduce. This gives you a much better performance.  You can use Spark too as you stated. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-27-2016
	
		
		08:16 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 
	Hi @Kirk Haslbeck,  
	I want to add some information to the excellent Paul's answer.   
	First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page here:  Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research.  In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation.   Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be :  val lr = new LinearRegressionWithSGD()
  .setNumIterations(30)
val paramGrid = new ParamGridBuilder()
  .addGrid(lr.setpSize, Array(0.1, 0.01))
  .build()  Even if your ML problem is simple, I highly recommend looking to the Spark.ml library. This can reduce your dev time considerably.  I hope this helps. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		04-26-2016
	
		
		03:59 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Hi @David Lays  You have
mainly two high level approaches for data replication:   Replication
in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is
stored in primary and DR clusters. NiFi is great for this double ingestion. The
pro of this method is that you have data immediately in both clusters. The cons
is that you have only the raw data and not processing results. If you want to
get the same result on the DR cluster, you need to do the same processing in
the DR cluster.  Replication
in L (copying): in this scenario you ingest data at the primary cluster and later copy it
to the DR cluster. Tools like Distcp or Falcon can be used to implement this.
The pro is that you can replicate raw data and processing results in the same
process. The cons is that the DR cluster is lagging behind n-in terms of data.
The replication is usually scheduled and if you cluster goes down between you
will loose data generated (ingested or computed) since the last replication.   I hope this helps 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













