Created on 07-10-2017 07:06 AM - edited 09-16-2022 04:54 AM
Hello,
I have trained myself on hadoop, I know how to work with MR,Pig,HIVE,SPARK,SCALA,SQOOp and all however I worked on all these components in my personal system and in singlenode architecture.
Now I need to know that how real time LIVE project works? How multi node structure works? If I am trying to process one CSV file then How do I access spark and hive and all which are installed on different nodes? And How do I access those?
I need detailed documents if somebody have or any article that anyone is aware of which shows complete steps and process to access different components.
I feel helpless as nobody in my group or in my connection works on real time Hadoop ecosystem
Created 07-11-2017 05:18 PM
1. The Edge node usually has all client software like Spark client,hive client installed to interact with the cluster (YARN,Namenode and Datanodes etc ) the edgenode has client config config distributed during the cluster setup so for the hive and spark you will connect i.e to the hive database using jdbc driver your client uses the local hive-site.xml which has the hive database configuration.
2. HDFS is a fault tolerant file system that's why its called a distributed computing, in a production environment you will need at minimum 3 datanodes (Remember although 3x replication is quite common. The actual optimal value would depend on the cost of N-way replication, the cost of failure, and relative probability of failure.)
The reason of having at least 3 datanodes is to avoid data loss
To launch a Spark application in cluster
mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
Created 07-10-2017 01:03 PM
I a real world situation you connect to the edge node which has all the client libraries & configs.ie in a simple 6 node cluster 2 namenodes (HA setup) and 3 datanode (rep factor of 3 ) and 1 edge node is where the client softwares ie hive,flume,sqoop,HFDS are installed and connections to the cluster should be restricted only through the edge node. During deployment of the cluster the jar files are copied to the sharedlib if not it can be done after.
You should be able to invoke hive from the edge node
As the root user on the edge node
# su - hive
mm
[root@myserver 0]# su - hdfs [hdfs@myserver ~]$ hive WARNING: Use "yarn jar" to launch YARN applications. Logging initialized using configuration in file:/etc/hive/2.5.0.0-817/0/hive-log4j.properties hive show databases; ..... hive> create database olum; OK Time taken: 11.821 seconds hive> use olum; OK Time taken: 5.593 seconds hive> hive> CREATE TABLE olum (surname string, name string, age INT); OK Time taken: 8.024 seconds hive> INSERT INTO olum VALUES ('James', 'Bond', 22), ('Peter','Welsh', 33); Query ID = hdfs_20161220082127_de77b9f0-953d-4442-a280-aa93dcc30d9c Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established.
You should see something like this
Created 07-10-2017 02:21 PM
First of all Thanks Geoffrey for your quick response, hope I have addressed you name correctly.
Suppose I have one CSV file, I want to process it through SPARK , I submit it on YARN and I need data to be loaded in HIVE tables .
In this case where would I write my spark code(I will code write in eclipse however on which machine?), how would I submit it on YARN and How would I access my hive tables, all components would be distributed? or SPARK and HIVE would be on same node?
If they are on same node then why do we need other 3 data nodes if one edge node can do all stuff
Created 07-11-2017 05:18 PM
1. The Edge node usually has all client software like Spark client,hive client installed to interact with the cluster (YARN,Namenode and Datanodes etc ) the edgenode has client config config distributed during the cluster setup so for the hive and spark you will connect i.e to the hive database using jdbc driver your client uses the local hive-site.xml which has the hive database configuration.
2. HDFS is a fault tolerant file system that's why its called a distributed computing, in a production environment you will need at minimum 3 datanodes (Remember although 3x replication is quite common. The actual optimal value would depend on the cost of N-way replication, the cost of failure, and relative probability of failure.)
The reason of having at least 3 datanodes is to avoid data loss
To launch a Spark application in cluster
mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
Created 07-21-2017 07:48 PM
All clients software should be able to lauch a job on the cluster to be process and view through the RM UI ! For the Hive table its usually in HDFS usually in /user/hive/warehouse/xxx someting like this
hdfs dfs -ls /apps/hive/warehouse/prodtest.db Found 4 items drwxrwxrwx - hive hdfs 02016-10-2900:54/apps/hive/warehouse/prodtest.db/t1 drwxrwxrwx - hive hdfs 02016-10-2900:54/apps/hive/warehouse/prodtest.db/t2 drwxrwxrwx - hive hdfs 02016-10-2900:54/apps/hive/warehouse/prodtest.db/t3 drwxrwxrwx - hive hdfs 02016-10-2900:54/apps/hive/warehouse/prodtest.db/t4
You should be able to creat a table on the fly after your processing
create database if not exists prodtest; use prodtest; --no LOCATION create table t1 (i int); create EXTERNAL table t2(i int); create table t3(i int) PARTITIONED by(b int); create EXTERNAL table t4(i int) PARTITIONED by(b int); --with LOCATION create table t5 (i int) LOCATION '/tmp/tables/t5'; create EXTERNAL table t6(i int) LOCATION '/tmp/tables/t6';
Just examples
Created 08-04-2017 01:20 PM
Hi Hdave ,
You should pickup one tool at a time. for example take Hive.
Below are high level steps.
1. Upload your csv files from your local system to HDFS file system. Hint : Use PUT command
2. Launch hive - Hint - Beeline
3. Create hive table as per CSV columns.
4. Load CSV file into table.
5. Query table from HIVE CLI or Beeline.
Once this is done , please pick up another tool and try same.
Regards,
Fahim