About ahadjidj

ahadjidj · ‎04-29-2016

What do you mean by share the screen ?

ahadjidj · ‎04-29-2016

Hello, If by pseudo mode you mean having a cluster in one machine, then you have 3 options: Sandbox: it's the easiest way since you download and run a VM that contains all the composantes already installed and configured (here). Use Ambari and Vagrant to create several VMs and install a cluster. Guide here If you want to install HDP directly on the machine without virtualization than you need to follow this installation guide. This will help you install Ambari and Ambari agent on your machine and then install all other component on the same machine.

ahadjidj · ‎04-29-2016

You can also use HDFS snapshot for protecting data from user errors : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

ahadjidj · ‎04-28-2016

Hi @Andrew Sears, Here are the hive operations that are captured in Atlas 0.6 : create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename. (source here) The operations that are not supported in v0.6 are not supported in V0.5 neither. A new version (v0.7) is available today and support more operation in Hive Bridge and introduce new bridges (Sqoop, Falcon and Storm).

ahadjidj · ‎04-27-2016

@bigdata.neophyte I would not recommend having both physical and virtual nodes in the same cluster. I think it's best to identify your KPI and choose the best solution. This being said, having VMs cluster and physical clusters at the same time can be a good choice for implementing several environments (dev, testing, prod, etc)

ahadjidj · ‎04-27-2016

Hi @bigdata.neophyte, Hadoop has been designed to run on commodity hardware. There are important concepts such as data locality and horizontal scalability that make a hardware cluster the first choice for Hadoop clusters today. The pro of this choice is performance. The cons is the cost for installing and managing the cluster. Virtual machines are also used for Hadoop today. VM with a central storage (SAN) is not the best choice for performance since you will loose data locality and you will have concurrent jobs/tasks concurrently accessing to the same storage. Some solutions today support dedicating hard disks to VMs. This way you can have a good hybrid approach. The pros of VM is the flexibility. VM Hadoop clusters are usually used for development environment because they provide flexibility. It's easy to create and kill clusters. Physical clusters are usually recommended for production where the application has strong SLAs. The final choice depends on your use cases, your existing infrastructure, and resources available to manage your cluster. I hope this helps.

ahadjidj · ‎04-27-2016

Hi @Pedro Alves You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.

ahadjidj · ‎04-27-2016

Hi @Roberto Sancho, You can use Hive or Pig for doing ETL. In HDP, Hive and Pig run on Tez and not on MapReduce. This gives you a much better performance. You can use Spark too as you stated.

ahadjidj · ‎04-27-2016

Hi @Kirk Haslbeck, I want to add some information to the excellent Paul's answer. First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page here: Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research. In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation. Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be : val lr = new LinearRegressionWithSGD() .setNumIterations(30) val paramGrid = new ParamGridBuilder() .addGrid(lr.setpSize, Array(0.1, 0.01)) .build() Even if your ML problem is simple, I highly recommend looking to the Spark.ml library. This can reduce your dev time considerably. I hope this helps.

ahadjidj · ‎04-26-2016

Hi @David Lays You have mainly two high level approaches for data replication: Replication in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is stored in primary and DR clusters. NiFi is great for this double ingestion. The pro of this method is that you have data immediately in both clusters. The cons is that you have only the raw data and not processing results. If you want to get the same result on the DR cluster, you need to do the same processing in the DR cluster. Replication in L (copying): in this scenario you ingest data at the primary cluster and later copy it to the DR cluster. Tools like Distcp or Falcon can be used to implement this. The pro is that you can replicate raw data and processing results in the same process. The cons is that the DR cluster is lagging behind n-in terms of data. The replication is usually scheduled and if you cluster goes down between you will loose data generated (ingested or computed) since the last replication. I hope this helps

Online	Offline
Last Visited	‎08-19-2019 05:07 AM

Member Since	‎01-11-2016 06:11 PM
Last Visited	‎08-19-2019 05:07 AM
Posts	355
Kudos received	230

Cloudera Community

Re: How to access NIFI Process Group variable in E...

Re: GETSFTP with NiFi cluster

Re: how is Kafka different from Mosquitto(MQTT) ?

Re: Whitelisting using LookupAttribute

Re: Is there any ways if we can schedule or trigge...

Re: INSTALLATION OF HDP - PSEUDE MODE

Re: INSTALLATION OF HDP - PSEUDE MODE

Re: How to protect HDFS directories from deletion ...

Re: Atlas 0.5 and Alter Scripts

Re: Virtual Machines in Hadoop cluster

Re: Virtual Machines in Hadoop cluster

Re: Storage data in HDFS - What's next?

Re: best way to do mapreduce

Re: How to find the best StepSize in a Spark ML Li...

Re: HDFS replication for DR