Member since
01-11-2016
355
Posts
228
Kudos Received
74
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4537 | 06-19-2018 08:52 AM | |
1490 | 06-13-2018 07:54 AM | |
1646 | 06-02-2018 06:27 PM | |
1465 | 05-01-2018 12:28 PM | |
2571 | 04-24-2018 11:38 AM |
05-07-2016
05:04 PM
1 Kudo
In documentation page for "Configure Hive and HiveServer2 for Tez" there are two properties that looks similar to me:
tez.queue.name: property to specify which queue will be used for Hive-on-Tez jobs. hive.server2.tez.default.queues: A list of comma separated values corresponding to YARN queues of the same name. When HiveServer2 is launched in Tez mode, this configuration needs to be set for multiple Tez sessions to run in parallel on the cluster. The only difference that I see is that when using "hive.server2.tez.default.queues" we can specify several queues so I guess jobs will be distributed over these queues. Hence, if we need all Hive jobs running in one queue we should use "tez.queue.name". Am I missing something here ?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache YARN
05-07-2016
04:56 PM
1 Kudo
Hi @Veera B. Budhi, Job by job approach: One solution to your problem is to specify the queue to use when submitting your Spark job or when you connect to hive. When submitting your Spark job you can specify the queue by --queue like in this example $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue SparkQueue lib/spark-examples*.jar 10
To specify the queue at connection time to HS2: beeline -u "jdbc:hive2://sandbox.hortonworks.com:10000/default?tez.queue.name=HiveQueue" -n it1 -p it1-d org.apache.hive.jdbc.HiveDriver Or you can set the queue after you are connected using set tez.queue.name=HiveQueue; beeline -u "jdbc:hive2://sandbox.hortonworks.com:10000/default" -n it1 -p it1-d org.apache.hive.jdbc.HiveDriver
>set tez.queue.name=HiveQueue; Change default queue: The second approach would be to specify a default queue for Hive or Spark to use. To do it for Spark set spark.yarn.queue to SparkQueue instead of default in Ambari To do this for Hive, you can add tez.queue.name to custom hiverserver2-site configuration in Ambari Hope this helps
... View more
05-07-2016
02:26 PM
@Premasish Dan What lab exercise are you doing ? is it a tutorial ?
... View more
05-07-2016
01:02 AM
4 Kudos
Hi @Sunile Manjee There's no Zeppelin interpreter for Solr. List of available interpreters is here. You can make Solr a Spark RDD and hence access Solr Data with Spark interpreter in Zeppelin. Another approach (that I didn't test) is to use Solr JDBC connection and Zeppelin JDBC interpreter. A Jira Ticket make me think that some problem may be encountered.
... View more
05-07-2016
12:47 AM
1 Kudo
Hi @Subhasis Roy Tuples are used to represent complex data types. Tuples are between parentheses like in this example: cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;
DUMP X;
(3,4)
(1,3)
(2,9) In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);
DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)
If you want to access only some fields of your data you use this (here I show only the 4 first fields): X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000) Does this answer your question ?
... View more
05-06-2016
11:56 PM
2 Kudos
@jbarnett
I say not Flume 🙂 have you tried NiFi ? You can can have several processors for your app, configure each one of them with some click in GUI !! you want re-configure a particular processor, no problem !! stop it, right click, configure it and run it again. If you really want to use Flume, I recommend using a config file per agent as stated in the doc : Hortonworks recommends that administrators use a separate configuration file for each Flume agent. .... While it is possible to use one large configuration file that specifies all the Flume components needed by all the agents, this is not typical of most production deployments. Since you have several agents in the same host, Ambari is not an option Use NiFi !!
... View more
05-06-2016
05:14 PM
4 Kudos
Hi @Indrajit swain, You are hitting the ElasticSearch that Atlas is running in background for its operations. This is why you get an older version of ES when you curl port 9200. To check it, stop your ES instance and check if you have something listening to port 9200 netstat -npl | grep 9200 You should still have something listening even when your ES is down. You can see the configuration of existing ES in Atlas configuration in Ambari When ES starts and find its port used (9200) it picks the next available one. So your ES instance will be running on port 9201. You can see it in the starting logs (like in my example) : [2016-05-06 17:09:41,452][INFO ][http ] [Speedball] publish_address {127.0.0.1:9201}, bound_addresses {127.0.0.1:9201} You can try to curl the two ports to test: [root@sandbox ~]# curl localhost:9200
{
"status" : 200,
"name" : "Gravity",
"version" : {
"number" : "1.2.1",
"build_hash" : "6c95b759f9e7ef0f8e17f77d850da43ce8a4b364",
"build_timestamp" : "2014-06-03T15:02:52Z",
"build_snapshot" : false,
"lucene_version" : "4.8"
},
"tagline" : "You Know, for Search"
}
[root@sandbox ~]# curl localhost:9201
{
"name" : "Speedball",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.3.2",
"build_hash" : "b9e4a6acad4008027e4038f6abed7f7dba346f94",
"build_timestamp" : "2016-04-21T16:03:47Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}
You can also change the port of ES to something you want in the yaml file. Hope this helps
... View more
05-05-2016
05:46 PM
Hi @Revathy Mourouguessane, have you tried this solution ?
... View more
04-30-2016
03:10 PM
4 Kudos
Hi @Rendiyono Wahyu Saputro, What are you trying to build is what we call Connected Data Platform at Hortonworks. You need to understand that you have two types of workloads/requirements and you need to use HDF and HDP jointly. ML model construction: the first step towards you goal is to build your machine learning model. This require processing lot of historical data (data at rest) to detect some pattern related to what you are trying to predict. This phase is called "training phase".The best tool do this is HDP and more specifically Spark. Applying the ML model: once step1 completed, you will have a model that you can apply to new data to predict something. In my understanding you want to apply this at real time data coming from twitter (data at motion). To get the data in real time and transform to what the ML model needs, you can use NiFi. Next, NiFi send the data to Storm or Spark Streaming that applies the model and get the prediction. So you will have to use HDP to construct the model, HDF to get and transform the data, and finally a combination of HDF/HDP to apply the model and make the prediction. To build a web service with NiFi you need to use several processors: one to listen to incoming requests, one or several processors to implement your logic (transformation, extraction, etc), one to publish the result. You can check this page that contains several data flow examples. The "Hello_NiFi_Web_Service.xml" gives an example on how to do it. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
... View more
04-29-2016
03:35 PM
Hi, Unfortunately I can't do a webex with you. Describe your problem here and I and the community will be happy to help you. Also, call the support if you have a subscription. Thanks
... View more
04-29-2016
08:16 AM
Hello, If by pseudo mode you mean having a cluster in one machine, then you have 3 options:
Sandbox: it's the easiest way since you download and run a VM that contains all the composantes already installed and configured (here). Use Ambari and Vagrant to create several VMs and install a cluster. Guide here If you want to install HDP directly on the machine without virtualization than you need to follow this installation guide. This will help you install Ambari and Ambari agent on your machine and then install all other component on the same machine.
... View more
04-29-2016
07:42 AM
1 Kudo
You can also use HDFS snapshot for protecting data from user errors : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
... View more
04-28-2016
05:46 AM
1 Kudo
Hi @Andrew Sears, Here are the hive operations that are captured in Atlas 0.6 : create database, create table, create view, CTAS, load, import, export, query, alter table rename and alter view rename. (source here) The operations that are not supported in v0.6 are not supported in V0.5 neither. A new version (v0.7) is available today and support more operation in Hive Bridge and introduce new bridges (Sqoop, Falcon and Storm).
... View more
04-27-2016
09:23 PM
@bigdata.neophyte I would not recommend having both physical and virtual nodes in the same cluster. I think it's best to identify your KPI and choose the best solution. This being said, having VMs cluster and physical clusters at the same time can be a good choice for implementing several environments (dev, testing, prod, etc)
... View more
04-27-2016
09:09 PM
3 Kudos
Hi @bigdata.neophyte, Hadoop has been designed to run on commodity hardware. There are important concepts such as data locality and horizontal scalability that make a hardware cluster the first choice for Hadoop clusters today. The pro of this choice is performance. The cons is the cost for installing and managing the cluster. Virtual machines are also used for Hadoop today. VM with a central storage (SAN) is not the best choice for performance since you will loose data locality and you will have concurrent jobs/tasks concurrently accessing to the same storage. Some solutions today support dedicating hard disks to VMs. This way you can have a good hybrid approach. The pros of VM is the flexibility. VM Hadoop clusters are usually used for development environment because they provide flexibility. It's easy to create and kill clusters. Physical clusters are usually recommended for production where the application has strong SLAs. The final choice depends on your use cases, your existing infrastructure, and resources available to manage your cluster. I hope this helps.
... View more
04-27-2016
09:01 PM
Hi @Pedro Alves You can also use Spark for data cleansing and transformation. The pro is to use the same tool for data preparation, discovery and analysis/ML.
... View more
04-27-2016
08:49 PM
1 Kudo
Hi @Roberto Sancho, You can use Hive or Pig for doing ETL. In HDP, Hive and Pig run on Tez and not on MapReduce. This gives you a much better performance. You can use Spark too as you stated.
... View more
04-27-2016
08:16 PM
2 Kudos
Hi @Kirk Haslbeck,
I want to add some information to the excellent Paul's answer.
First, tuning an ML parameters is one of hardest tasks of a data scientist and it's an active research area. In your special case (LinearRegressionWithSGD), the stepSize is one of a hardest parameter to tune as stated in MLlib optimisation page here: Step-size. The parameter γγ is the step-size, which in the default implementation is chosen decreasing with the square root of the iteration counter, i.e. γ:=st√γ:=st in the tt-th iteration, with the input parameter s=s= stepSize. Note that selecting the best step-size for SGD methods can often be delicate in practice and is a topic of active research. In a general ML problem, you want to build a data pipeline where you combine several data transformations to clean data and build features as well as several algorithms to achieve the best performance. This is a repetitive task where you try several options for each step. Also, you would like to test several parameters and choose the best one. For each of your pipeline, you need to evaluate the combination of algorithms/parameters that you have chosen. For the evaluation you can use things like cross-validation. Testing the combination of these variables manually can be hard and time consuming. Spark.ml is a package that can help make this process fluent. Spark.ml uses concepts such as transformers, estimators and params. The "params" helps you automatically test several values for a parameter and choose the value that gives you the best model. This works by providing a ParamGridBuilder with the different values that you want to consider for each param in your pipeline. An example of this is in your case can be : val lr = new LinearRegressionWithSGD()
.setNumIterations(30)
val paramGrid = new ParamGridBuilder()
.addGrid(lr.setpSize, Array(0.1, 0.01))
.build() Even if your ML problem is simple, I highly recommend looking to the Spark.ml library. This can reduce your dev time considerably. I hope this helps.
... View more
04-27-2016
03:19 PM
1 Kudo
Hi @JAYA PARASU, My pleasure. This won't work for cases where the directory rights are different from drwx. To have only files with your approach, you need to grep all lines starting with 'd'. You can do it like this: hadoop fs -ls /tmp | sed '1d;s/ */ /g'| grep '^d' | cut -d\ -f8
... View more
04-27-2016
05:33 AM
3 Kudos
Hi @JAYA PARASU As you can see in the ls documentation page, the command returns these information for a file : permissions number_of_replicas userid groupid filesize modification_date modification_time filename
and these information for a directory : permissions userid groupid modification_date modification_time dirname There's no option to limit the output to only files or directories names directly in HDFS. However, you can use sed and cut to manipulate the output and get only the files names (example token from here) : hadoop fs -ls /tmp | sed '1d;s/ */ /g'| cut -d\ -f8
... View more
04-26-2016
03:59 PM
3 Kudos
Hi @David Lays You have
mainly two high level approaches for data replication: Replication
in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is
stored in primary and DR clusters. NiFi is great for this double ingestion. The
pro of this method is that you have data immediately in both clusters. The cons
is that you have only the raw data and not processing results. If you want to
get the same result on the DR cluster, you need to do the same processing in
the DR cluster. Replication
in L (copying): in this scenario you ingest data at the primary cluster and later copy it
to the DR cluster. Tools like Distcp or Falcon can be used to implement this.
The pro is that you can replicate raw data and processing results in the same
process. The cons is that the DR cluster is lagging behind n-in terms of data.
The replication is usually scheduled and if you cluster goes down between you
will loose data generated (ingested or computed) since the last replication. I hope this helps
... View more
04-25-2016
11:43 PM
1 Kudo
Hi @Revathy Mourouguessane, You can use IsEmpty to check if A1 is empty or not. Try something like this grouped = COGROUP ..... ;
filtered = FILTER grouped BY not IsEmpty($2);
DUMP filtered;
Here's an example that shows how this work for something similar: cat > owners.csv
adam,cat
adam,dog
alex,fish
david,horse
alice,cat
steve,dog
cat > pets.csv
nemo,fish
fido,dog
rex,dog
paws,cat
wiskers,cat
owners = LOAD 'owners.csv' USING PigStorage(',') AS (owner:chararray,animal:chararray);
pets = LOAD 'pets.csv' USING PigStorage(',') AS (name:chararray,animal:chararray);
grouped = COGROUP owners BY animal, pets by animal;
filtered = FILTER grouped BY not IsEmpty($2);
DUMP grouped;
(cat,{(alice,cat),(adam,cat)},{(wiskers,cat),(paws,cat)})
(dog,{(steve,dog),(adam,dog)},{(rex,dog),(fido,dog)})
(horse,{(david,horse)},{})
(fish,{(alex,fish)},{(nemo,fish)})
DUMP filtered;
(cat,{(alice,cat),(adam,cat)},{(wiskers,cat),(paws,cat)})
(dog,{(steve,dog),(adam,dog)},{(rex,dog),(fido,dog)})
(fish,{(alex,fish)},{(nemo,fish)})
... View more
04-22-2016
05:05 PM
Hi @AKILA VEL, Please check this tutorial on how you can do a wordcount with Spark on HDP 2.3: http://fr.hortonworks.com/hadoop-tutorial/a-lap-around-apache-spark/ Section 1 shows how to upgrade Spark to 1.6 version. You can ignore it and go directly to section 2. I hope this will help you.
... View more
04-21-2016
12:38 PM
Can you delete this question please since it's a duplicate. Thanks
... View more
04-21-2016
12:36 PM
Hi @Klaus Lucas, The VM has Ambari installed and configured so you should get Ambari UI at port 8080. Can you check your VM settings (port redirection, network, etc) and see if you can get access to Ambari ?
... View more
03-29-2016
07:47 PM
4 Kudos
Hi @Vadim, OpenCV is famous for image processing in general. They have several tools for image and face recognition. Here is an example of how to do face recognition with OpenCV: tutorial. In terms of integration with Hadoop, there's a framework called HIPI developed by University of Virginia for leveraging HDFS and MapReduce for large scale image processing. This framework supports OpenCV too. Finally, for image processing in motion, you can use HDF with an OpenCV processor like the one published here
... View more
03-16-2016
05:12 PM
Hi @Lubin Lemarchand Try to change the parameter through Ambari. Go to HDFS -> Config and search for dfs.permissions.superusergroup Ambari stores the configuration in a database which is the truth of configuration. If you directly modify configuration files that are managed by Ambari, it will update the file and delete your modification at service restart. See this link doc
... View more
03-06-2016
10:21 PM
5 Kudos
@Abha R Panchal What user are you currently logged in as ? the user dev_maria doesn't have admin access so you will not have the add service button. To add services, you have to log in with admin. The admin user has been deactivated in HDP 2.4 sandbox. To activate it use the following command: ambari-admin-password-reset
... View more
03-05-2016
03:26 PM
2 Kudos
@Kyle Prins The sandbox gives you an easy way to have a working Hadoop installation in a VM. If you need a multi nodes cluster my advice is to install an HDP cluster by yourself. This way, you will understand what have been installed and how it was configured. Use Ambari for the installation, it's straightforward and quick : http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.0/bk_Installing_HDP_AMB/content/index.html If you want to have all nodes as VMs in your local machine, you can use Vagrant too. Look at these links to have an idea on how to do it http://uprush.github.io/hdp/2014/12/29/hdp-cluster-on-your-laptop/ and https://cwiki.apache.org/confluence/display/AMBARI/Quick+Start+Guide
... View more
- « Previous
- Next »