Support Questions

Find answers, ask questions, and share your expertise

How to access different nodes where different haddop componenets are installed or setup?

avatar
Explorer

Hello,

I have trained myself on hadoop, I know how to work with MR,Pig,HIVE,SPARK,SCALA,SQOOp and all however I worked on all these components in my personal system and in singlenode architecture.

Now I need to know that how real time LIVE project works? How multi node structure works? If I am trying to process one CSV file then How do I access spark and hive and all which are installed on different nodes? And How do I access those?

I need detailed documents if somebody have or any article that anyone is aware of which shows complete steps and process to access different components.

I feel helpless as nobody in my group or in my connection works on real time Hadoop ecosystem

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Hardik Dave

1. The Edge node usually has all client software like Spark client,hive client installed to interact with the cluster (YARN,Namenode and Datanodes etc ) the edgenode has client config config distributed during the cluster setup so for the hive and spark you will connect i.e to the hive database using jdbc driver your client uses the local hive-site.xml which has the hive database configuration.

2. HDFS is a fault tolerant file system that's why its called a distributed computing, in a production environment you will need at minimum 3 datanodes (Remember although 3x replication is quite common. The actual optimal value would depend on the cost of N-way replication, the cost of failure, and relative probability of failure.)

The reason of having at least 3 datanodes is to avoid data loss

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

View solution in original post

5 REPLIES 5

avatar
Master Mentor

@Hardik Dave

I a real world situation you connect to the edge node which has all the client libraries & configs.ie in a simple 6 node cluster 2 namenodes (HA setup) and 3 datanode (rep factor of 3 ) and 1 edge node is where the client softwares ie hive,flume,sqoop,HFDS are installed and connections to the cluster should be restricted only through the edge node. During deployment of the cluster the jar files are copied to the sharedlib if not it can be done after.

You should be able to invoke hive from the edge node

As the root user on the edge node

# su - hive

mm

[root@myserver 0]# su - hdfs
[hdfs@myserver ~]$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.5.0.0-817/0/hive-log4j.properties
hive show databases;
.....
hive> create database olum;
OK
Time taken: 11.821 seconds
hive> use olum;
OK
Time taken: 5.593 seconds
hive>
hive> CREATE TABLE olum (surname string, name string, age INT);
OK
Time taken: 8.024 seconds
hive> INSERT INTO olum VALUES ('James', 'Bond', 22), ('Peter','Welsh', 33);
Query ID = hdfs_20161220082127_de77b9f0-953d-4442-a280-aa93dcc30d9c
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established. 

You should see something like this

avatar
Explorer

First of all Thanks Geoffrey for your quick response, hope I have addressed you name correctly.

Suppose I have one CSV file, I want to process it through SPARK , I submit it on YARN and I need data to be loaded in HIVE tables .

In this case where would I write my spark code(I will code write in eclipse however on which machine?), how would I submit it on YARN and How would I access my hive tables, all components would be distributed? or SPARK and HIVE would be on same node?

If they are on same node then why do we need other 3 data nodes if one edge node can do all stuff

avatar
Master Mentor

@Hardik Dave

1. The Edge node usually has all client software like Spark client,hive client installed to interact with the cluster (YARN,Namenode and Datanodes etc ) the edgenode has client config config distributed during the cluster setup so for the hive and spark you will connect i.e to the hive database using jdbc driver your client uses the local hive-site.xml which has the hive database configuration.

2. HDFS is a fault tolerant file system that's why its called a distributed computing, in a production environment you will need at minimum 3 datanodes (Remember although 3x replication is quite common. The actual optimal value would depend on the cost of N-way replication, the cost of failure, and relative probability of failure.)

The reason of having at least 3 datanodes is to avoid data loss

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

avatar
Master Mentor

@Hardik Dave

All clients software should be able to lauch a job on the cluster to be process and view through the RM UI ! For the Hive table its usually in HDFS usually in /user/hive/warehouse/xxx someting like this

hdfs dfs -ls /apps/hive/warehouse/prodtest.db
Found 4 items
drwxrwxrwx   - hive hdfs          02016-10-2900:54/apps/hive/warehouse/prodtest.db/t1
drwxrwxrwx   - hive hdfs          02016-10-2900:54/apps/hive/warehouse/prodtest.db/t2
drwxrwxrwx   - hive hdfs          02016-10-2900:54/apps/hive/warehouse/prodtest.db/t3
drwxrwxrwx   - hive hdfs          02016-10-2900:54/apps/hive/warehouse/prodtest.db/t4

You should be able to creat a table on the fly after your processing

create database if not exists prodtest;
use prodtest;
--no LOCATION
create table t1 (i int);
create EXTERNAL table t2(i int);
create table t3(i int) PARTITIONED by(b int);
create EXTERNAL table t4(i int) PARTITIONED by(b int);
--with LOCATION create table t5 (i int) LOCATION '/tmp/tables/t5';
create EXTERNAL table t6(i int)  LOCATION '/tmp/tables/t6';

Just examples

avatar
Explorer

Hi Hdave ,

You should pickup one tool at a time. for example take Hive.

Below are high level steps.

1. Upload your csv files from your local system to HDFS file system. Hint : Use PUT command

2. Launch hive - Hint - Beeline

3. Create hive table as per CSV columns.

4. Load CSV file into table.

5. Query table from HIVE CLI or Beeline.

Once this is done , please pick up another tool and try same.

Regards,

Fahim