About mqureshi

mqureshi · ‎02-17-2017

See my reply inline. I must tell you that I am not a docker expert. So I highly recommend this forum to ask your docker questions. There are bunch of good people who would be able to help you. 1. If I build a server for them, am I essentially hosting their containers (is there any issues with hosting multiple containers using the same image)? Answer: Yes. Use same image for everyone. There shouldn't be any issue with that (at least theoretically) Remember, each student, when they start working will change the initial image depending on how they use their image. 2. Or am I hosting images for them to pull and run from whatever machine they are on? (trying to understand server sizing requirements which is difficult as I don't think I quite understand docker, although I've picked up some reading material since I last posted as well). Answer: You can do that too but I recall you had issues with giving students their own image because then you are helping seventy students, each with a different issue on their own laptop (Quite frankly, I don't think you would be able to avoid this much even if you have your own servers - because each student working on its own image, will likely make same mistake as they do on their own laptop and ask you for help). Having students pull down the image will be the cheapest option also. This doesn't make you sound newbie - I am not much of a docker guy either except for the concepts and features on how it works. Check that other forum I talked about.

mqureshi · ‎02-16-2017

@joseph c HBase stores data is stored in bytes. The date you have stored 15-JAN-14 is definitely less than 30-JUN-2011 lexicographically. It is however, smaller than 01-JUN-2011 because JAN is less than JUN (at least that's what it seems). May be put the two date condition in brackets. But, please remember, what you are doing is not really a date comparison as far as HBase is concerned. It is a string comparison. if you want to do proper date comparisons. then you should store timestamps instead of date as a string. or use this format "20110630".

mqureshi · ‎02-16-2017

@Bala Vignesh N V Glad, it was helpful. If you think its a complete answer you were looking for, please accept the answer.

mqureshi · ‎02-15-2017

@ssathish hdfs user is not usually allowed to access encryption keys. This ensures that even Hadoop admin cannot access the encrypted data. Check in your Ranger KMS who is authorized to access keys and use that user. You should not be using HDFS user to access encryption keys. https://community.hortonworks.com/content/supportkb/49505/how-to-correctly-setup-the-hdfs-encryption-using-r.html

mqureshi · ‎02-14-2017

@Bala Vignesh N V Tez is a DAG (Directed acyclic graph) architecture. A typical Map reduce job has following steps: 1. Read data from file -->one disk access 2. Run mappers 3. Write map output --> second disk access 4. Run shuffle and sort --> read map output, third disk access 5. write shuffle and sort --> write sorted data for reducers --> fourth disk access 6. Run reducers which reads sorted data --> fifth disk output 7. Write reducers output -->sixth disk access Tez works very similar to Spark (Tez was created by Hortonworks well before Spark): 1. Execute the plan but no need to read data from disk. 2. Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output. Only one read and one write. Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time. Now to answer your question on why Tez queries fail but executed in MR. This should not happen. Possible bugs or sometimes people working with Hive have used MapReduce for a while and know how to make things work but not as familiar with Tez. I think, Tez queries should not fail any more than Map Reduce. I highly recommend skimming quickly over following slides, specially starting from slide 7. http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey

mqureshi · ‎02-14-2017

I have a question about HBase backup. Let's say if I am running HBase replication. Replication is master-slave. Slave is the DR site. Now imagine for some reason, the network failure occurs or let's say Slave cluster dies. Master is running just fine. Now we bring Slave up after 3 hours (or whatever number of hours). What's the best way in this case to make sure slave gets the data for those three hours? I was thinking about copyTable using startTime and endTime but wanted to confirm if that's the right/best approach. Also, how much load will copy table create on master cluster? I read in documentation that I should be able to run copyTable from my target machine (DR in this case). Is my understanding correct and how does this alleviate any load? Is that because map reduce is running on remote cluster and reading data from a remote machine?

mqureshi · ‎02-13-2017

@Ankur Kapoor If you have really big XML files which are not coming in real time, but rather sitting on machines, then I would not use Nifi. Nifi is more for real time data flow. For a use where you have large files to import and change formats, for example, from XML to AVRO, I would suggest writing a script, where you create a hive table on your XML data and then use INSERT INTO <avro table> SELECT FROM <xml table> to write data in avro format. Use the following serde https://github.com/dvasilen/Hive-XML-SerDe http://stackoverflow.com/questions/41299994/parse-xml-and-store-in-hive-table -->good example on how to use here Nifi will do the job to but I would not introduce a new tool just for this batch use case.

mqureshi · ‎02-12-2017

@sudarshan kumar Do you have a combiner? Can you try adding a combiner and see if that helps?

mqureshi · ‎02-12-2017

@Ankur Kapoor Also attaching the template of my workflow.hcc.xml

mqureshi · ‎02-12-2017

@Ankur Kapoor I just did the same thing and it still works. Where do you write your output of "AttributesToJson"? I am creating a new flow file because this gives me a clean json record. entireflow.pngevaluate-xpath.pngattributestojson.pnginferavroschema.pngconvertjsontoavro.png I can see the file being created on my machine.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Hadoop cluster lab environment

Re: I need to get data from hbase table for a give...

Re: Difference between mr and Tez?

Re: CopyFromLocal command fails when run as HDFS u...

Re: Difference between mr and Tez?

What's the best to approach to bring HBase DR in s...

Re: Regarding Converting XML to Avro schema

Re: Error: Java heap space in reducer phase

Re: Convert Json to Avro processor -- Failed to Pa...

Re: Convert Json to Avro processor -- Failed to Pa...