About mqureshi

mqureshi · ‎02-11-2017

You are very welcome. Please don't hesitate to ask any questions you might have in future. Good luck.

mqureshi · ‎02-11-2017

@Shannon Dyck cloudbreak is to spin up on cloud. I would think it might be more expensive for universities but may be you already have some agreements. Curious why not just have students run docker container on their own laptops? That will save a lot of resources and makes things easier for you. Otherwise, have docker containers run on lab machines and for remote access, they will have to VPN and ssh into it.

mqureshi · ‎02-11-2017

@Ankur Kapoor I used your record and it works. Do you have a inferavroschema before calling convertJsontoAvro? Look at my screen shot. My second screen shot shows details of convertjsontoavro processor details. Third screenshot shows details of InferAvroSchema details.screen-shot-2017-02-11-at-13434-pm.png screen-shot-2017-02-11-at-13242-pm.pngscreen-shot-2017-02-11-at-13302-pm.png

mqureshi · ‎02-11-2017

@Ankur Kapoor In the last "type" after "Vehicle Speed", there is a space. Please remove that space and it will work. { "name" : "Vehicle_Speed"," type" : "integer" } --> wrong { "name" : "Vehicle_Speed", "type" : "integer" } --> right

mqureshi · ‎02-10-2017

@Shannon Dyck For an environment like you describe, I would recommend using Docker containers. Think how much memory/CPU each student should get. Basically setup a container and tune it so it works well for one student. Then its simply a matter of giving every student that one container to work with. At the end of the semester destroy containers. If students want, they can copy their container and take with them. You reclaim all your resources at the end of semester and make them available when the new batch comes in. Check this out https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html

mqureshi · ‎02-10-2017

@Leonardo Araujo Check this link: https://wiki.apache.org/hadoop/HowManyMapsAndReduces Target one map job per block. If the file you are reading has five blocks distributed on 3 nodes (or four or five nodes) on five disks, then you should have five mappers, one for each disk.

mqureshi · ‎02-10-2017

@Leonardo Araujo Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc: When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi , with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id . Sqoop cannot currently split on multi-column

mqureshi · ‎02-10-2017

what's the principal in the keytab sparkjob.keytab? I am pretty sure its not kadmin. find using following commands on your machine. root@venice fire-ui]# ktutil ktutil: read_kt /home/test/sparktest/princpal/sparkjob.keytab ktutil: list slot KVNO Principal ---- ---- --------------------------------------------------------------------- 1 1 <will display your principal> 2 1 <will display your principal>

mqureshi · ‎02-10-2017

@Shashant Panwar You can set HDFS quota. @mbalakrishnan is right. It really doesn't work at Hive level. Think about this. You set quota at directory level. Assume that you set your hive warehouse directory to be 25% of HDFS storage. So what. I'll just create an external table. You restrict that external directory, I can create another directory and point my external table to it. So, here is how you can almost achieve it but it's a combination of technology as well as policy you will implement. 1. Assign HDFS quota to directories where users can create table (data warehouse directory as well as external directories). 2. This quota combined should not be more than 25%. 3. Establish an organizational policy that Hive tables must be created only on above directories. If people create table outside of those directories, you should warn them that, that data will be deleted. 4. That's it. Enforce your policy.

mqureshi · ‎02-09-2017

For your lease exception, what is the value of hbase.regionserver.lease.period? When a client connects to HBase, it gets a lease and client needs to report back within this time period. If it doesn't then it is considered dead and you run into this exception. On way to avoid this is to increase the lease period, but that's just addressing a sympton. question is, why is a client taking more than 60 seconds (assuming that's the value you have set - default of 60 seconds). Check the following link. This is a really good discussion on this issue. http://mail-archives.apache.org/mod_mbox/hbase-user/201209.mbox/%3CCAOcnVr3R-LqtKhFsk8Bhrm-YW2i9O6J6Fhjz2h7q6_sxvwd2yw%40mail.gmail.com%3E Here is from documentation: In some situations clients that fetch data from a RegionServer get a LeaseException instead of the usual Section 12.5.1, “ScannerTimeoutException or UnknownScannerException”. Usually the source of the exception is org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230) (line number may vary - in your case this is line 221 but its exact same exception). It tends to happen in the context of a slow/freezing RegionServer#next call. It can be prevented by having hbase.rpc.timeout > hbase.regionserver.lease.period .

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Hadoop cluster lab environment

Re: Hadoop cluster lab environment

Re: Convert Json to Avro processor -- Failed to Pa...

Re: Convert Json to Avro processor -- Failed to Pa...

Re: Hadoop cluster lab environment

Re: How to know the degree of parallelism availabl...

Re: How to know the degree of parallelism availabl...

Re: main : requested yarn user is kadmin User kadm...

Re: Hive Database Size

Re: HBASE Master Heap Size Recommendation