Member since
01-11-2016
355
Posts
230
Kudos Received
74
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8190 | 06-19-2018 08:52 AM | |
3147 | 06-13-2018 07:54 AM | |
3574 | 06-02-2018 06:27 PM | |
3879 | 05-01-2018 12:28 PM | |
5399 | 04-24-2018 11:38 AM |
04-29-2016
08:16 AM
Hello, If by pseudo mode you mean having a cluster in one machine, then you have 3 options:
Sandbox: it's the easiest way since you download and run a VM that contains all the composantes already installed and configured (here). Use Ambari and Vagrant to create several VMs and install a cluster. Guide here If you want to install HDP directly on the machine without virtualization than you need to follow this installation guide. This will help you install Ambari and Ambari agent on your machine and then install all other component on the same machine.
... View more
04-29-2016
07:42 AM
1 Kudo
You can also use HDFS snapshot for protecting data from user errors : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
... View more
04-28-2016
05:46 AM
1 Kudo
Hi @Andrew Sears, Here are the hive operations that are captured in Atlas 0.6 : create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename. (source here) The operations that are not supported in v0.6 are not supported in V0.5 neither. A new version (v0.7) is available today and support more operation in Hive Bridge and introduce new bridges (Sqoop, Falcon and Storm).
... View more
04-27-2016
09:23 PM
@bigdata.neophyte I would not recommend having both physical and virtual nodes in the same cluster. I think it's best to identify your KPI and choose the best solution. This being said, having VMs cluster and physical clusters at the same time can be a good choice for implementing several environments (dev, testing, prod, etc)
... View more
04-27-2016
09:09 PM
3 Kudos
Hi @bigdata.neophyte, Hadoop has been designed to run on commodity hardware. There are important concepts such as data locality and horizontal scalability that make a hardware cluster the first choice for Hadoop clusters today. The pro of this choice is performance. The cons is the cost for installing and managing the cluster. Virtual machines are also used for Hadoop today. VM with a central storage (SAN) is not the best choice for performance since you will loose data locality and you will have concurrent jobs/tasks concurrently accessing to the same storage. Some solutions today support dedicating hard disks to VMs. This way you can have a good hybrid approach. The pros of VM is the flexibility. VM Hadoop clusters are usually used for development environment because they provide flexibility. It's easy to create and kill clusters. Physical clusters are usually recommended for production where the application has strong SLAs. The final choice depends on your use cases, your existing infrastructure, and resources available to manage your cluster. I hope this helps.
... View more
04-27-2016
09:01 PM
Hi @Pedro Alves You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.
... View more
04-27-2016
08:49 PM
1 Kudo
Hi @Roberto Sancho, You can use Hive or Pig for doing ETL. In HDP, Hive and Pig run on Tez and not on MapReduce. This gives you a much better performance. You can use Spark too as you stated.
... View more
04-27-2016
08:16 PM
2 Kudos
Hi @Kirk Haslbeck,
I want to add some information to the excellent Paul's answer.
First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page here: Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research. In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation. Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be : val lr = new LinearRegressionWithSGD()
.setNumIterations(30)
val paramGrid = new ParamGridBuilder()
.addGrid(lr.setpSize, Array(0.1, 0.01))
.build() Even if your ML problem is simple, I highly recommend looking to the Spark.ml library. This can reduce your dev time considerably. I hope this helps.
... View more
04-26-2016
03:59 PM
3 Kudos
Hi @David Lays You have
mainly two high level approaches for data replication: Replication
in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is
stored in primary and DR clusters. NiFi is great for this double ingestion. The
pro of this method is that you have data immediately in both clusters. The cons
is that you have only the raw data and not processing results. If you want to
get the same result on the DR cluster, you need to do the same processing in
the DR cluster. Replication
in L (copying): in this scenario you ingest data at the primary cluster and later copy it
to the DR cluster. Tools like Distcp or Falcon can be used to implement this.
The pro is that you can replicate raw data and processing results in the same
process. The cons is that the DR cluster is lagging behind n-in terms of data.
The replication is usually scheduled and if you cluster goes down between you
will loose data generated (ingested or computed) since the last replication. I hope this helps
... View more