Member since
09-17-2015
102
Posts
61
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
677 | 06-15-2017 11:58 AM | |
419 | 06-15-2017 09:18 AM | |
514 | 06-09-2017 10:45 AM | |
412 | 06-07-2017 03:52 PM | |
633 | 01-06-2017 09:41 PM |
09-22-2020
02:00 AM
2 Kudos
In Cloudera Machine Learning experience (or CDSW for the on-prem version), projects are backed with git. You might want to use GitHub on your projects, so here is a simple way to do that.
First things first: there are basically two ways of interacting with git/GitHub: HTTPS or SSH; We'll use the latter to make the authentication easy. You might also consider SSO or 2FA for enhancing security, here we'll focus on the basics.
To make this authentication going on under the hood, copy our SSH key from CML to Github.
Find your SSH key in the Settings of CML:
Copy that key and add it in Github, under the SSH and GPG keys in your github.com settings: Add SSH key.
Put cdsw in the Title and paste your ssh content in the Key:
Let's start with creating a new project on github.com:
The important thing here is the access mode we want to use: SSH
In CML, start a new project with a template:
Open a Terminal window in a new session:
Convert the project to a git project: cdsw@qp7h1qllrh9dx1hd:~$ git init
Initialized empty Git repository in /home/cdsw/.git/
Add all files to git: cdsw@qp7h1qllrh9dx1hd:~$ git add .
Commit of the project in GitHub: cdsw@qp7h1qllrh9dx1hd:~$ git commit -m "initial commit"
[master (root-commit) 5d75525] initial commit
47 files changed, 14086 insertions(+)
create mode 100755 .gitignore
create mode 100644 LICENSE.txt
create mode 100755 als.py
[...]
Add a remote origin server with the "URL" of the remote repository where your local repository will be pushed: cdsw@qp7h1qllrh9dx1hd:~$ git remote add origin git@github.com:laurentedel/MyProject.git
Make the current Git branch a master branch: cdsw@qp7h1qllrh9dx1hd:~$ git branch -M master
Finally, push the changes (so all files for the first commit) to our master, so on github.com: cdsw@qp7h1qllrh9dx1hd:~$ git push -u origin master
The authenticity of host 'github.com (140.82.113.4)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,140.82.113.4' (RSA) to the list of known hosts.
Counting objects: 56, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (46/46), done.
Writing objects: 100% (56/56), 319.86 KiB | 857.00 KiB/s, done.
Total 56 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), done.
To github.com:laurentedel/MyProject.git
* [new branch] master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.
There you go!
Now we can use the git commands are used to Modify file(s): cdsw@qp7h1qllrh9dx1hd:~$ echo "# MyProject" >> README.md
What's our status? cdsw@qp7h1qllrh9dx1hd:~$ git status
On branch master
Your branch is up to date with 'origin/master'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
README.md
nothing added to commit but untracked files present (use "git add" to track)
Commit/push: cdsw@qp7h1qllrh9dx1hd:~$ git add README.md
cdsw@qp7h1qllrh9dx1hd:~$ git commit -m "adding a README"
[master 7008e88] adding a README
1 file changed, 1 insertion(+)
create mode 100644 README.md
cdsw@qp7h1qllrh9dx1hd:~$ git push -u origin master
Warning: Permanently added the RSA host key for IP address '140.82.114.4' to the list of known hosts.
Counting objects: 3, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 290 bytes | 18.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To github.com:laurentedel/MyProject.git
5d75525..7008e88 master -> master
Branch 'master' set up to track remote branch 'master' from 'origin'.
Happy commits!
... View more
05-10-2019
12:52 PM
Hi @Alireza, I strongly suggest to use a separate zookeeper cluster for HDF as it would be heavily loaded; Sharing a ZK cluster between HDP and HDF would definitely be a pain point and a bottleneck
... View more
03-13-2019
12:38 PM
The property dfs.client.retry.policy.enabled should be set to false in cluster with HA enabled. The reason being in a Namenode High Availability (NN HA) system, when one of the namenodes goes down (NN process stopped), attempts to use hdfs can result in repeating errors and apparent hangs. Running or new jobs that depended on HDFS access will also fail because the failed NN is being talked to.
... View more
12-03-2018
04:28 PM
This article has been set on a HDP 2.5.3 version, you may consider adjusting some parameters to reflect your actual version. We'll here set Kafka loglevel through the Logging MBean with jConsole. For that, the first step is to enable JMX access: add in Kafka configs/kafka-env template export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false"
export JMX_PORT="9999"
For avoiding JMX port conflicts like mentioned in https://community.hortonworks.com/articles/73750/kafka-jmx-tool-is-failing-with-port-already-in-use.html,
let’s modify /usr/hdp/current/kafka/bin/kafka-run-class.sh on all broker
nodes: replace # JMX port to use
if [ $JMX_PORT ]; then
KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_PORT"
fi
with # JMX port to use
if [ $ISKAFKASERVER = "true" ]; then
JMX_REMOTE_PORT=$JMX_PORT
else
JMX_REMOTE_PORT=$CLIENT_JMX_PORT
fi
if [ $JMX_REMOTE_PORT ]; then
KAFKA_JMX_OPTS="$KAFKA_JMX_OPTS -Dcom.sun.management.jmxremote.port=$JMX_REMOTE_PORT"
fi After brokers has been restarted, lets modify the logLevel with
jConsole: $ jconsole <BROKER_FQDN>:<JMX_PORT> It launches a jconsole window, asking for Retry insecurely, go ahead
with that go to the Mbeans tab then Kafka/kafka.log4jController/Attributes, and double-click on the Value of Loggers to get all Log4j controllers You can see the kafka logger above those presented is set to INFO. We can check it using the getLogLevel Operations entering the kafka
loggerName Fortunately, you can also set the value without restarting with the
setLogLevel operation, putting in DEBUG or TRACE for example.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- Kafka
- loglevel
Labels:
10-17-2017
10:09 AM
2 Kudos
When starting spark-shell, it tries to bind to port 4040 for the SparkUI. If that port is already taken because of another spark-shell session active, it tries then to bind on 4041, then 4042, etc. Each time the binding didn't suceed, there's a huge WARN stack trace which could be filtered [user@serv hive]$ SPARK_MAJOR_VERSION=2 spark-shell
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/20 11:49:43 WARN AbstractLifeCycle: FAILED ServerConnector@2d258eff{HTTP/1.1}
{0.0.0.0:4040}: java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
at org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
at org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$newConnector$1(JettyUtils.scala:333)
at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$httpConnect$1(JettyUtils.scala:365)
at org.apache.spark.ui.JettyUtils$$anonfun$7.apply(JettyUtils.scala:368)
at org.apache.spark.ui.JettyUtils To filter that stacktrace, let's put that class log4j verbosity in ERROR level in /usr/hdp/current/spark2-client/conf/log4j.properties # Added for not having stack traces when binding to SparkUI
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
... View more
- Find more articles tagged with:
- binding
- FAQ
- spark-shell
Labels:
06-20-2017
01:28 PM
1 Kudo
adapted 2 requests for Postgres (and added vacuum): CREATE TEMPORARY TABLE tmp_request_id AS SELECT MAX(request_id) AS request_id FROM request WHERE create_time <= (SELECT (EXTRACT(epoch FROM NOW()) - 2678400) * 1000 as epoch_1_month_ago_times_1000);
CREATE TEMPORARY TABLE tmp_task_id AS SELECT MAX(task_id) AS task_id FROM host_role_command WHERE request_id <= (SELECT request_id FROM tmp_request_id);
CREATE TEMPORARY TABLE tmp_upgrade_ids AS SELECT upgrade_id FROM upgrade WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM execution_command WHERE task_id <= (SELECT task_id FROM tmp_task_id);
DELETE FROM host_role_command WHERE task_id <= (SELECT task_id FROM tmp_task_id);
DELETE FROM role_success_criteria WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM stage WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM topology_logical_task;
DELETE FROM requestresourcefilter WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM requestoperationlevel WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM upgrade_item WHERE upgrade_group_id IN (SELECT upgrade_group_id FROM upgrade_group WHERE upgrade_id IN (SELECT upgrade_id FROM tmp_upgrade_ids));
DELETE FROM upgrade_group WHERE upgrade_id IN (SELECT upgrade_id FROM tmp_upgrade_ids);
DELETE FROM upgrade WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM request WHERE request_id <= (SELECT request_id FROM tmp_request_id);
DELETE FROM topology_host_task;
DELETE FROM topology_host_request;
DELETE FROM topology_logical_request;
DELETE FROM topology_host_info;
DELETE FROM topology_hostgroup;
DELETE FROM topology_request;
DROP TABLE tmp_upgrade_ids;
DROP TABLE tmp_task_id;
DROP TABLE tmp_request_id;
VACUUM FULL VERBOSE ANALYZE;
... View more
06-15-2017
11:58 AM
1 Kudo
@Guillaume Roger you can set doas to true but it's advised not to for some reasons. For example, when setting doAs to false then HS2 is able to share resources you're encouraged to use Ranger instead of file-based permissions (SQL permissions are available and translated to Ranger rules when activated), it's more secure and you can even set permissions to a column level.
... View more
06-15-2017
09:18 AM
@Guillaume Roger partition is viewed as a new column in the table definition hence you can't partition by an already existing field. as a side note, PK is not working as in standard SGBD in Hive, it's just here for compliance (ie you can't deduplicate fields just by adding a PK)
... View more
06-14-2017
03:50 PM
check RedHat article here https://access.redhat.com/solutions/46111
... View more
06-13-2017
03:41 PM
you can launch hive in a debug session mode to have more info hive --hiveconf hive.root.logger=INFO,console
... View more
06-13-2017
03:39 PM
did you changed your fs.defaultFS in HDFS configuration, by any chance ?
... View more
06-13-2017
03:09 PM
@Nikkie Thomas you can specify the number of reducers for a query : hive> set mapreduce.job.reduces=1;
... View more
06-12-2017
12:55 PM
Ken, you can use the Java action, unfrtunately the sqoop1.x API is very limitated (Sqoop.runTool() and that's quite all). But in my experience this works perfectly.
... View more
06-12-2017
11:37 AM
Hi @srinivas s if you're using rack awareness you should probably get rid of 50 datanodes by decomissioning them without loosing some blocks, otherwise you'll probably will. rebalance time depends on your network and cluster utilization, you can adjust some parameters to make it fastest if necessary, basically hdfs dfsadmin -setBalancerBandwidth <bandwidth (kb/s)> or within your HDFS params (example) : dfs.balance.bandwidthPerSec=100000000
dfs.datanode.max.transfer.threads=16384
dfs.datanode.balance.max.concurrent.moves=500 please check Accept if you're satisfied with my answer
... View more
06-12-2017
10:32 AM
1 Kudo
hi, sqoop2 isn't supported on HDP, you should use sqoop 1.4 bundled.
... View more
06-12-2017
08:50 AM
no it won't, it's acting like a regular HDFS client (well, it is)
... View more
06-12-2017
07:31 AM
Hello Shashi, NiFi write in its 3 repositories (flow files, content and provenance) on local nodes only; you can certainly do provenance export in Atlas for example (there's some work around that) but it's not embedded.
... View more
06-09-2017
11:05 AM
you may experience some network issues, or HDFS issues, could you get a HDFS report to make sure everything's clear from that $ hdfs fsck /
... View more
06-09-2017
10:50 AM
1 Kudo
Hi, if your labels are tagged as non exclusive then yes your queues will be able to access those resources. documentation is pretty clear around labeling: take a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/configuring_node_labels.html
... View more
06-09-2017
10:45 AM
1 Kudo
Hi, you have a "Hortonworks Sandbox Archive" just below "Hortonworks Sandbox in the Cloud"
... View more
06-07-2017
03:52 PM
1 Kudo
job.properties only contains properties to be propagated to your workflows, so small answer is yes you can have a single job.properties for your 20 workflows. Bundle and coordinators requires other parameters (like start/endDate) so it's better to have a specific job.properties for them
... View more
01-18-2017
02:17 PM
On RHEL/CentOS you might encounter an exception when trying to stop or restart Oozie : resource_management.core.exceptions.Fail: Execution of 'cd /var/tmp/oozie && /usr/hdp/current/oozie-server/bin/oozie-stop.sh' returned 1. -bash: line 0: cd: /var/tmp/oozie: No such file or directory This is likely because of a shell crontab /etc/cron.daily/tmpwatch which delete files/directories unmodified for 30d+ [root@local ~]# cat /etc/cron.daily/tmpwatch
#! /bin/sh
flags=-umc
/usr/sbin/tmpwatch "$flags" -x /tmp/.X11-unix -x /tmp/.XIM-unix \
-x /tmp/.font-unix -x /tmp/.ICE-unix -x /tmp/.Test-unix \
-X '/tmp/hsperfdata_*' 10d /tmp
/usr/sbin/tmpwatch "$flags" 30d /var/tmp
for d in /var/{cache/man,catman}/{cat?,X11R6/cat?,local/cat?}; do
if [ -d "$d" ]; then
/usr/sbin/tmpwatch "$flags" -f 30d "$d"
fi
done
Just recreate the directory and you're good to go [root@local ~]# mkdir /var/tmp/oozie
[root@local ~]# chown oozie:hadoop /var/tmp/oozie
[root@local ~]# chmod 755 /var/tmp/oozie
... View more
- Find more articles tagged with:
- crontab
- FAQ
- Governance & Lifecycle
- Oozie
Labels:
01-06-2017
10:27 PM
You should consider running hadoop streaming using your python mapper and reducer. Take a look at https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.2.2.3_Streaming for an example of such that workflow Try first to execute your streaming directly with something like yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input /user/theuser/input.csv -output /user/theuser/out
Then it'll be easier to schedule that with Oozie, worst case scenario you'll do a shell action with that command Please accept answer if I answered your question
... View more
01-06-2017
09:41 PM
@justlearning Oozie can't do mapreduce by itself, it's a Hadoop scheduler which launch workflows composed of jobs, which can be mapreduce. You here want to run a job defined by workflow.xml with parameters in job.properties, so the syntax is oozie job --oozie http://sandbox.hortonworks.com:11000/oozie -config job.properties -run
... View more
11-25-2016
08:00 AM
ok , you're talking about preparing the environment. That's shall be only to not overload the docs since the repositories are filled with ubuntu (same with SuSE for example), as the commands are very basic (quite replacing yum by apt) Hortonworks is delivering Ubuntu rpm means they explicitely support it (my point of view from a field perspective a large majority of distros is centos/RH/Suse), but certainly can do better for the documentation.
... View more
11-24-2016
05:52 PM
1 Kudo
@rama did you try increasing your mappers memory? and is your request failing with hive.execution.engine=mr as well?
... View more
11-24-2016
05:50 PM
checksum failed could be issued with certain java versions, is a manual kinit ok with your keytab? you may also check if your keytabs are expiring
... View more
11-24-2016
05:43 PM
@Anirudh K The docs mention ubuntu explicitely, take a look at ambari installation : http://docs.hortonworks.com/HDPDocuments/Ambari-2.4.1.0/bk_ambari-installation/content/download_the_ambari_repo.html Even for manual install, are you talking about something else than Ambari or HDP ?
... View more
11-03-2016
10:28 PM
more specifically, it's located in your job.properties : namenode=hdfs://<nameservice>
... View more
10-23-2016
10:39 AM
@Pierre Villard I got it working with -D mapred.job.name=mySqoopTest
... View more