Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 9929 | 07-20-2016 07:06 PM |
07-08-2016
01:28 AM
@slachterman Good catch. fixed.
... View more
07-06-2016
10:28 PM
5 Kudos
This tutorial will show how to export data out of hbase table into csv format. We will use airport data from american statical association available here. Assume you have a sandbox up and running lets start. First ssh into your sandbox and switch user to hdfs sudo su - hdfs Then grab the airport data by issues a wget wget http://stat-computing.org/dataexpo/2009/airports.csv For my example the file is located /home/hdfs/airports.csv Now lets create a hbase table called "airports" with column family "info". Do this in hbase shell Now that the table is created lets load it. Get out of hbase shell. as user hdfs run the following to load the table hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,info:iata,info:airport,info:city,info:country,info:lat,info:long" airports hdfs://sandbox.hortonworks.com:/tmp/airports.csv That will kick off map reduce job to load airport table in hbase. once that is done you can do a quick verify in hbase shell by running counts 'airports' You should see 3368 records in the table. Now lets log into pig shell. We will create a variable called airport_data which we will load our hbase table into by issuing: airport_data = LOAD 'hbase://airports'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'info:iata,info:airport,info:city,info:country,info:lat,info:long', '-loadKey true')
AS (iata,airport,city,country,lat,long); Now that we have our data in a variable lets dump it to hdfs using csv format by issuing: store airport_data into 'airportData/export' using PigStorage(','); So we have dumped the export into hdfs directory airportData/export. Lets go view it And there you go. We have loaded data into hbase table. Exported data from the table using pig in csv format. Happy pigging.
... View more
Labels:
07-06-2016
08:39 PM
1 Kudo
Follow the instructions here on how to download and import the vm into virtual box Once you have imported the vm select the vm and click on setting Then click on network To assign a IP in the attach to down drop list select "Bridge Adapter" Then under option Promiscuous Mode select "Allow All" Now start your vm Once the machine is up verify you have a IP address Now you have IP for your vm. have fun.
... View more
Labels:
06-28-2016
10:19 PM
5 Kudos
How to get a docker image up and running which encapulates a PyCharm IDE integrated with spark and pybuilder. The IDE reside on the docker container and will be display on your laptop/machine. This is to isolate your development enviorment with has spark integrated with spark. Why? I am a spark developer and spend significant time trying to build a integrated environment. I am spending way too much time on integration before doing what I get paid to do --- Develop! Creating a isolated environment which is integrated with spark and a CIT, easily spun up and down, and repeatable is something which would accelerate my efficiency.
Download latest virtualbox from here. To run docker containers or build images a docker machine is required. Download docker machine from here. Download xQuartz to display the IDE on your laptop. View my docker page for information on the docker image here. Clone my PyCharm github repo. You are doing this bootstrap code sample code I have built to your docker container during launch. For example I performed git clone in my /Users/smanjee/docktest
git clone https://github.com/sunileman/pycharm.git To start this tutorial start docker machine in a new terminal. For example on my laptop here is the start script :/Applications/Docker/Docker*app/Contents/Resources/Scripts/start.sh Run docker-machine env to check the IP your machine is assigned (informational only) Pull the image docker pull sunileman/pycharm Build the image docker build -t sunileman/pycharm . Open another terminal and start port forwarding socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\" Get your IP address (not docker machines) Run the image docker run -it -v /tmp/.X11-unix/:/tmp/.X11-unix/ -v ~/docktest/pycharm/PycharmProjects:/root/PycharmProjects -v ~/docktest/pycharm/.Pycharm40:/root/.PyCharm40 -e DISPLAY=XXX.XX.XX.X:0 --rm sunileman/pycharm Replace XXX.xx.xx.x with your IP replace ~/docktest/pycharm/PycharmProjects with your path to pycharm which you downloaded from my github repo Replace ~/docktest/pycharm/.Pycharm40 with your path to pycharm which you downloaded from my github repo Click on I do not have previous versions Click on OK Click on OPEN to open the project you mounted to the docker container Find the PyCharm project to open
Now the project has been imported
So you have the project imported into your IDE which is running within the docker container. To prove the IDE is connected/integrated with spark simply run the python file and you will see spark modules have been imported
... View more
Labels:
05-24-2016
03:53 PM
@Constantin Stanca what do you mean by "Additionally, you need to start the JVM with something like this in order to be able to truly access the JVM remotely"? JVM start as they normally do to use this tool.
... View more
05-24-2016
03:51 PM
1 Kudo
@Constantin Stanca This is standard for hbase development. If you can't see what your JVM is doing your driving blind. tuning the flushes for the memstore and blockcache are vital for performance. Testing GC for G1 vs CMS on namenode or hbase is vital for performance. For production remote access yes you always need clearance. Monitoring JVM during development is highly useful for namenode and hbase.
... View more
05-21-2016
01:34 AM
12 Kudos
The goal of this article is to provide you step by step instruction to install jVisualVM to monitor JVMs inside your hadoop environment. Consult your operations team prior to making any production changes Go here to download jVisualVM - tool to visually monitor JVM in hadoop. HBase is a great example where you want to visually analyze JVM health. You may already have this tool on your workstation. Simple go to command line and type jVisualVM. If it comes up your in business. else download it: Once you have application up and running install the Visual GC plugin. Tools ->available plugins->Visual GC Lets go to the node you want to monitor. For jVisualVM to work it needs jstatd to run on the node. Once jstatd is running it automatically sends updates of remote applications running on the node to jVisualVM. Run jstatd from command line. If you get the following error: We need to perform an additional simple configuration. Do the following steps Run which jstatd then cd into the bin location Create a policy file jstatd.all.policy in the bin directory (same location as jstatd). We are building a policy to allow jstatd to have permission to everything. Inside the file add the following: grant codebase "file:${java.home}/../lib/tools.jar" { permission java.security.AllPermission; }; Now run following command to start jstatd with your policy: ./jstatd -J-Djava.security.policy=jstatd.all.policy & End of steps for jstatd error Test if jstatd is running by issuing this command: jps -l -m -v rmi://localhost You will see tons of output such as: Which looks good Open jvisualVM (command line type jVisualVM) Right click on remote and select add remote host Add hostname or IP address of node you want to analyze: Now you are connected. Click on arrow next to your remote. This will display all JVMs running on the node: Now you can start analyzing. lets click on hbase data node and click on monitor tab: Click on Visual GC tab to analyze different generations inside the JVM. I hope this article helps you start analyzing at a deeper level JVMs on hadoop especially with hbase.
... View more
Labels:
05-20-2016
02:26 AM
Good article. I would like to see which one of these have support for kerberos. Going further how one would connected to kerberized cluster. I find many people struggle with this concept.
... View more
04-27-2016
03:05 AM
Falcon by default coes with authorization turned off. To turn on set the following through ambari falcon config: *.falcon.security.authorization.enabled = true *.falcon.security.authorization.superusergroup = <linux group> In my example I am using linux group users In my example oozie,ambari-qa,tez,falcon,hue,guest are in the group "users". The purpose of this group is to only allow users within this group to view, edit, and delete each others material. Any user outside this group should not have access. Now logging in a falcon who is part of users group: Falcon user has created a cluster "authTest" and feed "feed1" Lets view it: Great so falcon is see the feed and cluster. Now lets go in with user hdfs who is NOT part of group users Logged in as user hdfs who is NOT part of group users. This user will do a simple search for everything cluster/feed entity which exist in falcon. So hdfs user search does not return anything since the use is not allowed. Now lets log in with user tez who IS part of user group users User tez will do a simple search for everything cluster/feed entity which exist in falcon. As you can see tez is able to view what user falcon created since they are part of the same group. user hdfs was not since it is not part of the same group.
... View more
Labels:
- « Previous
- Next »