Member since
02-10-2015
84
Posts
2
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13444 | 06-04-2015 06:09 PM | |
7349 | 05-22-2015 06:59 AM | |
5997 | 05-13-2015 03:19 PM | |
2434 | 05-11-2015 05:22 AM |
05-22-2015
01:11 PM
Also, what Spark userid's HDFS folder structure should look like? So far I am having only one HDFS folder: /user/spark/applicationHistory
... View more
05-22-2015
12:46 PM
Should I un-set it? CM keeps complaining...
... View more
05-22-2015
10:53 AM
Interesting... Somehow, the Spark Parameter spark_jar_hdfs_path is set to (HDFS) '/user/spark/share/lib/spark-assmbly.jar' value and CM complains about 'Failed parameter validation'! Should I unset it??
... View more
05-22-2015
09:43 AM
Cool! I'll do the same for SNN's HDFS disk. <Q1> How does Hadoop know which HDFS folder/file to use? The one(s) in MASTER or the one(s) in DATA nodes?? Is the HDFS parameter 'dfs.namenode.edits.dir' that will be set to the HDFS directory created in MASTER?? (I guess based on RF Replication Factor files could be anywhere...) (Definitely will be faster for MASTER if it has to write to its own local disks...) <Q2> Should I use RAID-1 for the 2nd 300GB disk (the one that will hold CM's logs) at MASTER? (I guess I should!)
... View more
05-22-2015
09:14 AM
The Spark Jar Location (HDFS) (spark_jar_hdfs_path) parameter is set to /user/spark/share/lib/spark-assembly.jar However, the HDFS file /user/spark/share/lib/spark-assembly.jar is NOT there! The only HDFS folder/file for Spark that exists is /user/spark/applicationHistory Although I have run via CM to 'Upload Spark Jar' (from drop-down Actions option) successfully (at least that's what CM tells me) when I check the spark HDFS folders/files the jar (spark-assembly.jar) is not there!!!
... View more
Labels:
- Labels:
-
Apache Spark
05-22-2015
07:49 AM
Wilfred thank you! Some clarifications. MASTER Node Disk Layout (Total of 4x300GB HDs) ================ -- 2 disks for OS (RAID-1) -- 1 disk for apps & logs (CM's logs etc...) -- 1 disk (JBOD) for HDFS (what will be stored here?????) DATA Nodes Disk Layout (Total of 25x300GB HDs) =============== -- 2 disks for OS (RAID-1) -- 23 disks (JBODs) for HDFS ( --1. Does it make a difference if # of disks is even or odd??) ( --2. Should I go for higher capacity of disks and less # of them, i.e. 6x1.2TB HDs ??) DEFINITELY SPARK ON YARN!!!! The link for YARN tuning configuration is great!!! Please provide a link for tuning network traffic within the cluster (data movement among nodes in the cluster vs. data ingestion from sources).
... View more
05-22-2015
06:59 AM
Thank you for your comment! The issue has been resolved, it had to do with permissions! I had to reset the mode of the /user/history/done folder /user/history/done_intermediate was good!
... View more
05-20-2015
11:13 AM
That's a good start 🙂 For the argument sake, I am planning on provisioning 1 MASTER node w/ 2 CPUs (Intel E5-2690 v2 @3.00GHz, 10 cores each) and 256GB of RAM. Do I turn CPU multi-threading on? (Actually by default is on, which means I am getting 40 CPU threads). I will configure 4x300GB disks: 2 disks for OS (RAID-1) 2 disks for apps & logs (RAID-1) DO I NEED TO CONFIGURE ANY DISKS FOR HDFS IN MASTER? --- For the DATA nodes (3 of them), planning to have the same cpu/ram setting as MASTER. I will configure 25x300GB disks: 2 disks for OS (RAID-1) 2 disks for apps & logs (RAID-1) 21 disks for HDFS (JBODs) === Based on the above settings and the fact that CM and almost of CDH services will be running on MASTER and DataNodes, Spark-Workers, and RegionServers will be running on DATA nodes how do it look? Do you have any links/docs to share about ratio of cores/to memory/to disks/to workload ... Also, some useful documentation about configuring YARN's containers will be great! Cheers!
... View more
05-19-2015
07:57 AM
I have created a new user (user2) and when running any MapReduce jobs the JobHistoryServer portal (port 19888) does NOT show any!!! I can see the job from witihin the ResourceManager Web UI (port 8088) and when I click on the 'History' link (under the Tracking UI column) I am getting the error: "No Found: job_xxxxxxxxxxxxx_xxxx). This happening for the MapReduce jobs only!!! When I run Spark Jobs (under the same userid, user2) I am able to see the 'History' logs!!! An existing userid (user1) works fine! It seems to me that the issue is permissions-related. Here are the HDFS permissions of the userids: [root@master ~]# hdfs dfs -ls /tmp/logs Found 5 items drwxrwxrwt - user1 hadoop 0 2015-04-20 08:48 /tmp/logs/user1 drwxrwxrwt - hdfs supergroup 0 2015-04-09 14:59 /tmp/logs/hdfs drwxrwxrwx - user2 hadoop 0 2015-05-06 16:23 /tmp/logs/user2 drwxrwxrwt - root hadoop 0 2015-04-09 16:46 /tmp/logs/root Also, here are the persmissions of the HDFS /user/history/done and /user/history/done_intermediate folders: [root@master ~]# hdfs dfs -ls /user/history Found 2 items drwxrwx--- - mapred hadoop 0 2015-05-11 10:32 /user/history/done drwxrwxrwt - mapred hadoop 0 2015-05-18 14:39 /user/history/done_intermediate
... View more
Labels:
05-15-2015
12:16 PM
Looking for Best Practices! Having almost all CDH services (and CM) in the Master node and YARN's NodeManager, Spark's Workers, HDFS 's DataNodes, and HBase's RegionServers in the Data nodes, what type of CPU configuration should be suitable? For instance, should I provision the Master host with 20 cores with 3.00GHz of speed (see Intel's Xeon CPU E5-2690 v2 - Ivy Bridge processor, 2 CPU's with 10 cores per socket and 3.00GHz of speed )? Should I provision the Data hosts with 24 cores with 2.70GHz of speed (see Intel's Xeon CPU E5-2697 v2 - Ivy Bridge processor, 2 CPU's with 12 cores per socket, but with 2.70GHz of speed )? Again, looking for the ultimate configuration and optimizing both cores and speed...
... View more