About mqureshi

mqureshi · ‎07-30-2016

@sankar rao check on the timestamp when you run the query and open that log. i think you should look at beeswax_server.out as well as error.log. Please share what you see.

mqureshi · ‎07-30-2016

@jestin ma Is it possible to increase the block size of data? I know you already have data which means you will have to do some extra work. But if you can increase block size, your number of tasks will go down and it wouldn't hurt because more of your data will be sitting together. A simple 256 MB block will reduce your number of tasks to 7500. May be increase it further and see the benefits? Find your sweet spot with the block size first and then look at other options. I don't think you should go beyond 512 MB but you have to find out. What is the data format?

mqureshi · ‎07-30-2016

@oula.alshiekh@gmail.com alshiekh MapReduce: When Hadoop was first created, the only way to read and write data from Hadoop was writing a MapReduce job. I highly recommend you read the Google MapReduce paper or go over the slides here. One limitation of Map Reduce is that the API was in Java which means you have to be a Java programmer. That limits the platform significantly. What if you want your analysts who can only write SQL to be able to query data in Hadoop. That's where hive comes in. It was created so people can run SQL on their data in Hadoop. See below for details on Hive. Hive: A tool that enables you to run SQL on top of your tabular data in Hadoop. Imagine you have CSV, TAB delimited or similar file in Hadoop. Now you want to read the data in this file. You can of course "cat" the file like linux. But what if you want to fetch only few rows? What if you have hundreds of such files and you want to combine some data in these files and get results like you read from traditional RDBMS. Hive enables you to run SQL on your data in Hadoop. Assume you have a file called "fileA" which has data in this format: col1,col2,col3,....coln With Hive you can create a table by specifying the location of this file and read data using SQL. Notice unlike traditional RDBMS, you already had the file and data before you created the table. After creating table, you can of course bring in more data by either appending to the same file or creating new files with same structure in the same directory as "fileA" above. SQOOP: Name comes from "SQL" (SQ) and "Hadoop" (OOP). When companies started using Hadoop, and bringing data from traditional databases into Hadoop, there was a need for a tool that helps import data from databases like Teradata, Netezza, SQL Server, Oracle and so on. SQOOP provides a command line mechanicsm to import data into and export data from Hadoop to a traditional database. It uses drivers provided by the database you are importing or exporting data to/from. HUE: "Hadoop User Experience". This tool provides a nice GUI, where you can run Hive queries, or even SQOOP commands from a nice GUI. It enables you to save your work and come back later. It is basically a development tool. Think how eclipse helps you write Java programs. Just like that HUE helps you write Hadoop scripts for example Hive or Pig scripts. Pig: A scripting language to work with your data in Hadoop. Hive enables you to write SQL. But what if you want to write something similar to PL/SQL? Well, you can use Pig for that. Spark: If you read the MapReduce paper, you'll understand how it works. In short, it basically reads data from disk (and lots of data on hundreds of machines in parallel) and then for intermediate steps (mapper output, shuffle and sort, and finally a reducer output which is the output of whole job), it writes data back to disk. What this means is that you go back to disk up to six times (read data from disk, write mappers output to disk. shuffle/sort will read mappers output and write sorted data back to disk and reducers will read sorted data and write the output back). What Spark enables you is that unlike MapReduce, once the data is read from disk, it stays in memory. The results of all intermediate steps stay in memory. This in memory processing significantly improves performance of your jobs over MapReduce. HBase: (H)adoop Data(Base). Based on Google's Big Table, it provides a low latency (single to low double digit milli second) high throughput read/write access for your data in Hadoop. Thousands of records per second can be read and written from HBase. Massively scalable, last I knew, your facebook messenger was powered by HBase. In 2010, 350 million users were sending around 15 billion messages per month, all powered by HBase. So, if you need fast, highly scalable and reliable technology for your system, HBase is the tool you are looking for. Also check this page.

mqureshi · ‎07-30-2016

@sankar rao can you please share the logs from /var/log/hue/*. It could be a simple version issue. What version of HUE beeswax are you using? Does it support the hive version you are using?

mqureshi · ‎07-29-2016

@ARUN I cannot say if HBase is corrupted but you can try running "hbck" from your hbase install bin directory (the one for ams). If hbck does find any inconsistency, please follow the guidelines on this page to fix the issues. http://hbase.apache.org/0.94/book/apbs03.html If details on above page are not enough, please see Apendix C on this link.

mqureshi · ‎07-29-2016

@Girish Chaudhari How much memory do you have on your cluster (on each node)? A simple select statement is simply able to stream results back to JDBC client. count() however, needs to work and requires memory and I am just wondering what is assigned here?

mqureshi · ‎07-29-2016

@ARUN Ambari metrics collector is built using HBase and Phoenix and HBase uses Zookeeper. Is your Zookeeper running fine? What about HMaster? Check the following pages and its child pages for details. https://cwiki.apache.org/confluence/display/AMBARI/Metrics

mqureshi · ‎07-28-2016

Before I answer this, you need to understand that everything I say here will not just work perfect;y in your environment. You will find some issues in your testing and then you can share with us and we'll help you along. That being said, theoretically my answers should help you implement this with minimal issues: During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster? -> Since the same Kerberos from your AD is used for both clusters, it should get service tickets for both names nodes. You should not run into any issue here. That being said, if you an error during your testing, please share here and we'll help fix that. 2. Are the key tab files required anywhere else across the clusters apart from distcp-node? -> No. Only on your destination cluster where you are running your distcp from. 3. Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names. -> yes. You will need auth_to_local settings. Rule depends on your principal name. Usually you need to strip the kerberos realm. Please see this link on what rule you should setup. 4. Is there any need to configure proxy_user rules in Hadoop? -> Yes, you need this because your files are owned by different users but your distcp will run with the special user that is only doing distcp. So while you run the job using this user, you should let this user impersonate whoever owns the data. 5. As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS -> You are talking about encryption at rest. When you read the data, it should be decrypted automatically using whatever mechanism is being used when you access your data otherwise (for example, how is data decrypted when you run hive query? Same mechanism should automatically kick in unless there are authorization issues which of course you'll have tot ake care of regardless). 6. If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots? -> Number 5 should work. honestly I don't understand question 6 but 5 should work.

mqureshi · ‎07-27-2016

@ScipioTheYounger Here is how you can add multiple principals to same keytab. Go to kadmin or kadmin.local and then kadmin: xst -norandkey -k <desired keytab file name> principal1/<host fully qualified domain name> principal2/fully.qualified.domain.name You can also use ktadd command to add a pricipal to an existing keytab. Please see following link. http://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-admin/Adding-Principals-to-Keytabs.html ktadd -k <your keytab file that contains one keytab already> principal

mqureshi · ‎07-27-2016

The user needs to exist on machines where ever you are reading file blocks from based on Posix permissions. Like I said, you might not need them if you are using Ranger and/or dfs.permissions.enabled = false in core-site.xml. When you are in HUE and run a hive query, it runs as Hive user, not as HUE. You want to make sure, you have a user named "hive" where you have HiveServer2. Then you enable hive impersonation to decide who you want to give what access. https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.0.0/bk_ambari_views_guide/content/_configuring_your_cluster_for_hive_view.html http://hortonworks.com/blog/best-practices-for-hive-authorization-using-apache-ranger-in-hdp-2-2/ Ambari just a management tool, so you can have Ambari accounts for people who need access to Ambari and this would be independent of cluster. see this link to create "local" users for Ambari.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Hue(beewax) hive error: Internal error process...

Re: Tuning parallelism: increase or decrease?

Re: hive,sqoop,hue and others

Re: Hue(beewax) hive error: Internal error process...

Re: ambari metrics collector

Re: Hive query failed over JDBC.

Re: ambari metrics collector

Re: distcp between secured clusters

Re: AD how to add multiple principals in the same ...

Re: User Management Question