About mqureshi

mqureshi · ‎05-17-2017

@Abraham Abraham Do you mean the maximum size of each file while the table may have multiple files or maximum size of the table? For maximum file size, you cannot do much except for a block size (each file having multiple blocks). There is no limit to a file size. However, you can limit the size of data in each directory using HDFS disk quota. So assume you have external table at /user/mytable Now you can the quota for this directory to be 1TB for example. In this case you will limit the table size to be 1TB. But you can of course have multiple files.

mqureshi · ‎05-16-2017

@Kanagha Kumar The value looks right. Please take a look at the following link for more details: https://community.hortonworks.com/questions/53450/how-do-you-change-the-timezone-for-the-hdp-cluster.html

mqureshi · ‎05-15-2017

@Karan Alang Assuming you have implemented everything correctly, ask your network team if port 50470 is open. This is a connection issue and not an SSL issue.

mqureshi · ‎05-15-2017

@Vinay Uppala you are likely logging into the VM but you need to log into docker container. Try the following: ssh -p 2222 root@127.0.0.1 Notice port number is 2222 and not 22.

mqureshi · ‎05-15-2017

screen-shot-2017-05-15-at-51812-pm.pngscreen-shot-2017-05-15-at-45432-pm.pngStep7: Make sure you import the CA root to Ambari-server by running "ambari-server setup-security" So you auhtority is OpenSSL. In this case, there should be an OpenSSL root certificate which was used to sign your certificate that you have imported. You need that root certificate. Let me elaborate on how this works. Let's say you go to your bank website (let's say "www.chase.com"). Now how do you know that it really is "chase.com"? What if someone has hacked the connection and rerouted you to their own server and the site looks just like chase.com. You then proceed to enter your username and password and get an error. You wonder what has happened. Next thing you know, the hacker has your user id and password and can now use on the real website to access your account. So how do we resolve this problem? What you do is, you say that I cannot trust when chase.com says it is chase.com. I want someone who I trust certify to me that it is indeed chase.com. So you decide to first trust some authorities like verisign or thwate etc and then they certify to you that the site you are visiting is indeed "chase.com". At this point you are probably wondering, when did I trust verisgn or thwate or any other authority for that matter. Well, check your browser. Under browser advance settings, you should see "manage certificates" or something like that. Check system root certificates in there. Most browsers already come with root certificates with most dominant players like verisign, thwate etc. When you visit chase.com, chase.com provides a certificate (ssl connection only) and your browser says "hold on, let me check and confirm with my root certificates if your certificate was signed by an authority I trust." Once verified, your browser says perfect and you visit the website. You usually see a green lock on top. This of course happens behind the scene. If you try to visit a website for which a certificate is signed by an authority you don't trust or your browser doesn't trust (usually happens for internal website), you get "this connection is not trusted" and option to "proceed anyway". The two screenshots show you that chase.com is signed by verisgn, an authority my browser trusts and second shows you all the root certificates that are installed in my browser. So, you need to import the root certificate of your OpenSSL authority which has signed your certificate. Without it, just like your browser, you will get an error that is similar to "this connection is not trusted". a) when i run -> ambari-server setup-security, i see options given below ... So, do i use the option 5 i.e. Import the certificate to the truststore ? yes, you need to import your OpenSSL root certificate into your truststore. Notice the name. It says you trust these authorities. Truststore is a special type of keystore which stores root certificates for authorities you have decided to trust. The browser screenshot I have shared is my browser truststore. b) Pls. note -> the truststore & keystores were create on nwk6 (where nameNode is installed), while ambari is installed on nwk7. So, do the keystore & truststore need to be copied onto nwk7 Or re-created ? First, this is something I have not done but a certificate issued for nwk6 will not work for nwk7. Think about it when you created a certificate signing request. It asked you a bunch of questions including common name. That is the name of your server. certificates are created for servers from which certificate signing request was created. But I think you are trying to secure your namenode/clients and not ambari (I may be wrong). You might need separate certificates for each server. Long story short. certificate issued for one server will not work for any other server. Hope this helps.

mqureshi · ‎05-14-2017

@Satish S What's the file type? If its CSV, you can use CSVExcelStorage. Otherwise, the following should work. A = LOAD 'file_name' as (line:chararray); B = FILTER A by $0>1; See the following link for more options on this. https://community.hortonworks.com/questions/74738/pig-error-error-orgapachepigtoolsgruntgrunt-error.html#comment-74749

mqureshi · ‎05-13-2017

@Karan Alang Based on your question, let me just elaborate the difference. You are confusing a few things here. first, your keystore.jks is a keystore file which will store your private/public key pairs. think, like you have a safe box where you keep keys to go to different secret rooms. That safe box is your keystore. keys are stored inside this keystore. You have generated a key called nwk8 to be stored in this keystore file. "The client key -> /etc/security/clientKeys/keystore.jks is the default entry in file -> /etc/hadoop/2.5.3.0-37/0/ssl-client.xml" I am not sure I understand this. I am not sure where this file /etc/hadoop/2.5.3.0-37/0/ssl-client.xml comes from all of a sudden. Have some basic questions (since i dont think i understand this yet) - which .jks file should i use ? is that something i get from CA ? What if i use OpenSSL ? You can use your .jks created in above step or if there is an Enterprise keystore (described in this link - Hadoop SSL keystore management factory. If you have one in your organization then you should use that). You don't get keystore from your certificate authority. Your certificate authority will only give you signed certificates and I can be more than sure that you will have an internal certificate authority or use OpenSSL. So if you don't have an internal authority then just use OpenSSL to make your own authority and sign your certificate (ask your boss). To get a signed certificate, you will first create a certificate signing request which will be sent to your certificate authority and they will in return provide you with a certificate. That's it.

mqureshi · ‎05-12-2017

@Shiv Kabra I think there might be a confusion of what Nifi does. I also think you are making it more complex then this needs to be. First thing first. There is a ReplaceText processor which you *might* be able to use to mask data by changing data content and replacing them with your masking values. It supports Regular expressions. Now, since you are new to Nifi, I will try to give you an overview of what Nifi is purpose built for. Nifi is a data flow management tool. It helps creates a data flow in a few minutes without writing a single line of code. Nifi enables you to ingest data from multiple sources using different protocols where data might be in different format and process the data by may be enriching metadata, changing format (for example JSON to Avro), filtering records, track lineage, move data across data centers (cloud and on-prem) securely, send it to different destinations and much more. Companies use Nifi to manage enterprise data flow. Its rich features include queuing (at each processor level), back pressure and lineage. 2. Can I pass the tables list as an input parameter to the process To do what? Which processor? Check the list of processors below: https://nifi.apache.org/docs.html 3. Can I restart a process - in case there is any failure during execution One of the best features of Nifi. When a failure occurs, you can replay records, stop flow at a processor level, make changes and restart it. 4. Does have any inbuilt Process to handle such requests i.e. doing masking of the sensitive information in tables. I think ReplaceText should do what you are looking for. Nifi is extensible so you can also write your own processor if one of the 200 plus is not enough for you. There is also an executescript processor that you can use to call outside scripts.

mqureshi · ‎05-12-2017

@Bala Vignesh N V You have a file size of 300 MB and both block size and split max size are set to 256 MB. min size is 128 MB. In this case split size will be 256 MB, so yes, you should see 2 mappers (its just one mapreduce job) and you should see two files. As for max size, based on above formula, in this case it would be 256 MB. You know, so far we assuming map only jobs. If you have reducers running, then you should also look at hive.exec.max.created.files, mapred.reduce.tasks, hive.exec.reducers.bytes.per.reducer Check the following link and you will have to play a little bit to understand different scenarios. https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties As for your merge question, simply use hive.merge.mapfiles (for map only jobs) or hive.merge.mapredfiles for mapreduce jobs to merge small files.

mqureshi · ‎05-12-2017

@Bala Vignesh N V Why it is splitted into multiple chunks? Does each files represent the block size? Are you importing data using Sqoop? In that case. it is based on --number-mappers or -m argument. In this case no, it just depends on number of mappers and data is split by primary key unless you specify split-by on a different key. If you are not using Sqoop then hive uses mapreduce.input.fileinputformat.split.minsize. Since Hive 0.13, hive uses org.apache.hadoop.hive.ql.io.CombineHiveInputFormat by default for hive.input.format. This value also combines files that are smaller than mapreduce.input.fileinputformat.split.minsize assuming data is not of different nodes. Does the block size and mapred size has anything to do with file size in hive? Here is the formula to calculate split. max(mapreduce.input.fileinputformat.split.minsize, min(mapreduce.input.fileinputformat.split.maxsize, dfs.block.size)) Now based on this above formula, you should be able to tell how many files will be generated. This should answer your other questions too hopefully. Please let me know if you have a followup question.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Is there a way to define the HDFS maximum file...

Re: Setting timezone jvm parameter for hadoop jobs

Re: enabling SSL/TLS for HDFS - running into issue...

Re: jps command is not working after opening HDP s...

Re: enabling SSL/TLS for HDFS - running into issue...

Re: How to Skip header row using Pig

Re: enabling SSL/TLS for HDFS - running into issue...

Re: Using Nifi to mask sensitive data

Re: How Files loaded through a Hive table can be d...

Re: How Files loaded through a Hive table can be d...