Member since
01-11-2016
355
Posts
230
Kudos Received
74
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8190 | 06-19-2018 08:52 AM | |
3147 | 06-13-2018 07:54 AM | |
3574 | 06-02-2018 06:27 PM | |
3878 | 05-01-2018 12:28 PM | |
5397 | 04-24-2018 11:38 AM |
05-07-2016
05:04 PM
1 Kudo
In documentation page for "Configure Hive and HiveServer2 for Tez" there are two properties that looks similar to me:
tez.queue.name: property to specify which queue will be used for Hive-on-Tez jobs. hive.server2.tez.default.queues: A list of comma separated values corresponding to YARN queues of the same name. When HiveServer2 is launched in Tez mode, this configuration needs to be set for multiple Tez sessions to run in parallel on the cluster. The only difference that I see is that when using "hive.server2.tez.default.queues" we can specify several queues so I guess jobs will be distributed over these queues. Hence, if we need all Hive jobs running in one queue we should use "tez.queue.name". Am I missing something here ?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache YARN
05-07-2016
04:56 PM
1 Kudo
Hi @Veera B. Budhi, Job by job approach: One solution to your problem is to specify the queue to use when submitting your Spark job or when you connect to hive. When submitting your Spark job you can specify the queue by --queue like in this example $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue SparkQueue lib/spark-examples*.jar 10
To specify the queue at connection time to HS2: beeline -u "jdbc:hive2://sandbox.hortonworks.com:10000/default?tez.queue.name=HiveQueue" -n it1 -p it1-d org.apache.hive.jdbc.HiveDriver Or you can set the queue after you are connected using set tez.queue.name=HiveQueue; beeline -u "jdbc:hive2://sandbox.hortonworks.com:10000/default" -n it1 -p it1-d org.apache.hive.jdbc.HiveDriver
>set tez.queue.name=HiveQueue; Change default queue: The second approach would be to specify a default queue for Hive or Spark to use. To do it for Spark set spark.yarn.queue to SparkQueue instead of default in Ambari To do this for Hive, you can add tez.queue.name to custom hiverserver2-site configuration in Ambari Hope this helps
... View more
05-07-2016
02:26 PM
@Premasish Dan What lab exercise are you doing ? is it a tutorial ?
... View more
05-07-2016
01:02 AM
4 Kudos
Hi @Sunile Manjee There's no Zeppelin interpreter for Solr. List of available interpreters is here. You can make Solr a Spark RDD and hence access Solr Data with Spark interpreter in Zeppelin. Another approach (that I didn't test) is to use Solr JDBC connection and Zeppelin JDBC interpreter. A Jira Ticket make me think that some problem may be encountered.
... View more
05-07-2016
12:47 AM
1 Kudo
Hi @Subhasis Roy Tuples are used to represent complex data types. Tuples are between parentheses like in this example: cat data
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
X = FOREACH A GENERATE t1.t1a,t2.$0;
DUMP X;
(3,4)
(1,3)
(2,9) In your case, your data is simple and not between parentheses so you don't need to use tuple in your schema. Just run this A = LOAD '/tmp/test.csv' USING PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray, e:chararray);
DUMP A;
(1201,gopal, manager, 50000, TP)
(1202,manisha, proof reader, 50000, TP)
If you want to access only some fields of your data you use this (here I show only the 4 first fields): X = FOREACH A GENERATE $0, $1, $2, $3;
DUMP X;
(1201,gopal, manager, 50000)
(1202,manisha, proof reader, 50000) Does this answer your question ?
... View more
05-06-2016
11:56 PM
2 Kudos
@jbarnett
I say not Flume 🙂 have you tried NiFi ? You can can have several processors for your app, configure each one of them with some click in GUI !! you want re-configure a particular processor, no problem !! stop it, right click, configure it and run it again. If you really want to use Flume, I recommend using a config file per agent as stated in the doc : Hortonworks recommends that administrators use a separate configuration file for each Flume agent. .... While it is possible to use one large configuration file that specifies all the Flume components needed by all the agents, this is not typical of most production deployments. Since you have several agents in the same host, Ambari is not an option Use NiFi !!
... View more
05-06-2016
05:14 PM
4 Kudos
Hi @Indrajit swain, You are hitting the ElasticSearch that Atlas is running in background for its operations. This is why you get an older version of ES when you curl port 9200. To check it, stop your ES instance and check if you have something listening to port 9200 netstat -npl | grep 9200 You should still have something listening even when your ES is down. You can see the configuration of existing ES in Atlas configuration in Ambari When ES starts and find its port used (9200) it picks the next available one. So your ES instance will be running on port 9201. You can see it in the starting logs (like in my example) : [2016-05-06 17:09:41,452][INFO ][http ] [Speedball] publish_address {127.0.0.1:9201}, bound_addresses {127.0.0.1:9201} You can try to curl the two ports to test: [root@sandbox ~]# curl localhost:9200
{
"status" : 200,
"name" : "Gravity",
"version" : {
"number" : "1.2.1",
"build_hash" : "6c95b759f9e7ef0f8e17f77d850da43ce8a4b364",
"build_timestamp" : "2014-06-03T15:02:52Z",
"build_snapshot" : false,
"lucene_version" : "4.8"
},
"tagline" : "You Know, for Search"
}
[root@sandbox ~]# curl localhost:9201
{
"name" : "Speedball",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.3.2",
"build_hash" : "b9e4a6acad4008027e4038f6abed7f7dba346f94",
"build_timestamp" : "2016-04-21T16:03:47Z",
"build_snapshot" : false,
"lucene_version" : "5.5.0"
},
"tagline" : "You Know, for Search"
}
You can also change the port of ES to something you want in the yaml file. Hope this helps
... View more
05-05-2016
05:46 PM
Hi @Revathy Mourouguessane, have you tried this solution ?
... View more
04-30-2016
03:10 PM
4 Kudos
Hi @Rendiyono Wahyu Saputro, What are you trying to build is what we call Connected Data Platform at Hortonworks. You need to understand that you have two types of workloads/requirements and you need to use HDF and HDP jointly. ML model construction: the first step towards you goal is to build your machine learning model. This require processing lot of historical data (data at rest) to detect some pattern related to what you are trying to predict. This phase is called "training phase".The best tool do this is HDP and more specifically Spark. Applying the ML model: once step1 completed, you will have a model that you can apply to new data to predict something. In my understanding you want to apply this at real time data coming from twitter (data at motion). To get the data in real time and transform to what the ML model needs, you can use NiFi. Next, NiFi send the data to Storm or Spark Streaming that applies the model and get the prediction. So you will have to use HDP to construct the model, HDF to get and transform the data, and finally a combination of HDF/HDP to apply the model and make the prediction. To build a web service with NiFi you need to use several processors: one to listen to incoming requests, one or several processors to implement your logic (transformation, extraction, etc), one to publish the result. You can check this page that contains several data flow examples. The "Hello_NiFi_Web_Service.xml" gives an example on how to do it. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
... View more
04-29-2016
03:35 PM
Hi, Unfortunately I can't do a webex with you. Describe your problem here and I and the community will be happy to help you. Also, call the support if you have a subscription. Thanks
... View more