Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4083 | 10-18-2017 10:19 PM | |
| 4341 | 10-18-2017 09:51 PM | |
| 14840 | 09-21-2017 01:35 PM | |
| 1839 | 08-04-2017 02:00 PM | |
| 2420 | 07-31-2017 03:02 PM |
02-11-2017
10:52 PM
You are very welcome. Please don't hesitate to ask any questions you might have in future. Good luck.
... View more
02-11-2017
10:23 PM
@Shannon Dyck cloudbreak is to spin up on cloud. I would think it might be more expensive for universities but may be you already have some agreements. Curious why not just have students run docker container on their own laptops? That will save a lot of resources and makes things easier for you. Otherwise, have docker containers run on lab machines and for remote access, they will have to VPN and ssh into it.
... View more
02-11-2017
07:35 PM
@Ankur Kapoor I used your record and it works. Do you have a inferavroschema before calling convertJsontoAvro? Look at my screen shot. My second screen shot shows details of convertjsontoavro processor details. Third screenshot shows details of InferAvroSchema details.screen-shot-2017-02-11-at-13434-pm.png screen-shot-2017-02-11-at-13242-pm.pngscreen-shot-2017-02-11-at-13302-pm.png
... View more
02-11-2017
03:36 PM
1 Kudo
@Ankur Kapoor In the last "type" after "Vehicle Speed", there is a space. Please remove that space and it will work. { "name" : "Vehicle_Speed"," type" : "integer" } --> wrong
{ "name" : "Vehicle_Speed", "type" : "integer" } --> right
... View more
02-10-2017
08:26 PM
@Shannon Dyck
For an environment like you describe, I would recommend using Docker containers. Think how much memory/CPU each student should get. Basically setup a container and tune it so it works well for one student. Then its simply a matter of giving every student that one container to work with. At the end of the semester destroy containers. If students want, they can copy their container and take with them. You reclaim all your resources at the end of semester and make them available when the new batch comes in. Check this out https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html
... View more
02-10-2017
05:06 PM
@Leonardo Araujo Check this link: https://wiki.apache.org/hadoop/HowManyMapsAndReduces Target one map job per block. If the file you are reading has five blocks distributed on 3 nodes (or four or five nodes) on five disks, then you should have five mappers, one for each disk.
... View more
02-10-2017
03:37 PM
1 Kudo
@Leonardo Araujo Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc: When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi , with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id . Sqoop cannot currently split on multi-column
... View more
02-10-2017
06:25 AM
what's the principal in the keytab sparkjob.keytab? I am pretty sure its not kadmin. find using following commands on your machine. root@venice fire-ui]# ktutil
ktutil: read_kt /home/test/sparktest/princpal/sparkjob.keytab
ktutil: list
slot KVNO Principal
---- ---- ---------------------------------------------------------------------
1 1 <will display your principal>
2 1 <will display your principal>
... View more
02-10-2017
01:28 AM
@Shashant Panwar
You can set HDFS quota. @mbalakrishnan is right. It really doesn't work at Hive level. Think about this. You set quota at directory level. Assume that you set your hive warehouse directory to be 25% of HDFS storage. So what. I'll just create an external table. You restrict that external directory, I can create another directory and point my external table to it. So, here is how you can almost achieve it but it's a combination of technology as well as policy you will implement. 1. Assign HDFS quota to directories where users can create table (data warehouse directory as well as external directories). 2. This quota combined should not be more than 25%. 3. Establish an organizational policy that Hive tables must be created only on above directories. If people create table outside of those directories, you should warn them that, that data will be deleted. 4. That's it. Enforce your policy.
... View more
02-09-2017
06:15 AM
For your lease exception, what is the value of hbase.regionserver.lease.period? When a client connects to HBase, it gets a lease and client needs to report back within this time period. If it doesn't then it is considered dead and you run into this exception. On way to avoid this is to increase the lease period, but that's just addressing a sympton. question is, why is a client taking more than 60 seconds (assuming that's the value you have set - default of 60 seconds). Check the following link. This is a really good discussion on this issue. http://mail-archives.apache.org/mod_mbox/hbase-user/201209.mbox/%3CCAOcnVr3R-LqtKhFsk8Bhrm-YW2i9O6J6Fhjz2h7q6_sxvwd2yw%40mail.gmail.com%3E Here is from documentation: In some situations clients that fetch data from a RegionServer get a LeaseException instead of the usual Section 12.5.1, “ScannerTimeoutException or UnknownScannerException”. Usually the source of the exception is org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230) (line number may vary - in your case this is line 221 but its exact same exception). It tends to happen in the context of a slow/freezing RegionServer#next call. It can be prevented by having hbase.rpc.timeout > hbase.regionserver.lease.period .
... View more