About sunile_manjee

sunile_manjee · ‎08-04-2016

@Amila De Silva There is a simple way to do this. Go to yarn through ambari. Click on yarn then click on quick links, then resource manager UI. then click on job name. and you have full access to all logs.

sunile_manjee · ‎08-04-2016

@Anil Khiani what version of HDP are you using? can you verify you have pig client installed on node you are trying to execute pig grunt shell on?

sunile_manjee · ‎08-03-2016

@Vijaya Narayana Reddy Bhoomi Reddy You can leverage kerberos impersonations and maintain your read/write policy for the user you plan on impersonating through ranger. Setup user on ranger to read from cluster one. and cluster2 have ranger policy to able user to write. Have you looked into apache falcon? might be easier to setup the replication confirm hadoop.security.authorization is set to true To enable kerberos impersonations, core-site.xml <property> <name>hadoop.proxyuser.yourapp.groups</name> <value>ImpersonationGrp1,ImpersonationGrp2</value> </property> <property> <name>hadoop.proxyuser.yourapp.hosts</name> <value>host</value> </property> Update yourapp with your service princple name. UPdate ImpersonationGrp1 and ImpersonationGrp2 with groups your user is allowed to impersonate. Finally update host with your app server

sunile_manjee · ‎08-03-2016

Does it display defaults or assume I should know them?

sunile_manjee · ‎08-03-2016

Ok I found the defaults here https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC But why does describe extended not show me the default values. It assume I know them.

sunile_manjee · ‎08-03-2016

Is there any way to see the system level defaults for when I create hive ORC table? For example when I create hive table what is the default for orc.compress.size? I set this manual during table creating; however, I need to know all the defaults. Another example is when I call out compression type as zlib. When I do not call this out and run describe extended it does not show me compression type. I assume that means it is default type. but what is that default? Why does describe extended not read out all the defaults. It is assuming I know them.

sunile_manjee · ‎08-02-2016

@jayaprakash gadi please try this.. I haven't tested yet. C = foreach(join A by($1,$2),B by($1,$2)) generate B.*

sunile_manjee · ‎08-02-2016

I have been working on EDW for last 10 years. Applying well established relational concepts to Hadoop I have seen many anti-patterns. How about some patterns which work? Lets get to work. Slowly changing dimensions are a known and well established design pattern. Patterns were established on relational theory. Why? Those were the dominant database tech used by virtually everyone. This article in no way expresses the only way to do SCD on Hadoop. I am sharing with you a few patterns which lead to victory. What is relational theory you ask? "In physics and philosophy, a relational theory is a framework to understand reality or a physical system in such a way that the positions and other properties of objects are only meaningful relative to other objects." - wiki So now we have a challenge. Hadoop and all the integrated animals were not based or found on relational theory. Hadoop was built on software engineering principles. This is extremely important to understand and absorb. Do not expect a 1:1 functionality. The platform paradigms are completely different. There is no lift and shift operation or turn key solution. If a vendor is selling you that..challenge them. Understand relational theory and how it different then software engineering principles. So that is out of the way lets start focusing on slowing changing dimension type 1. What is SCD type 1? "This methodology overwrites old with new data, and therefore does not track historical data." - Wiki This in my opinion is the easiest out of the several SCD types. Simply upset based on surrogate key. Requirements - Data ingested needs simple processing Target tables (Facts, and Dims) are of type 1. Simply upsert based on surrogate or natural key There are known and unknown query patterns (consumption of end product) There are know query patters (during integration/ETL) Step 1 We will first build staging tables in Phoenix (HBase). Why Phoenix? Phoenix/HBase handles upserts very well and handles known query patterns like a champ. I'm going with Phoenix. ETL will be performed on the staging tables and then finally load into our product/final output tables. The final output tables are the golden records. They will host all your post ETL'd data. Those tables will be available for end consumption for down stream BI, analytics, ETL, and etc. Using Apache NiFi, simply drag and drop your sources and your Phoenix staging tables onto the canvas and connect them. Do any simple transformation you wish here as well. Some requirements may state to land raw data and store transformed data into another table. Essentially creating a System of Record. That will work as well. We will mostly work with the post System of Record tables here. Step 2 Next we want to assign a primary keys to all records in the staging table. This primary key can either be a surrogate or natural key hash. Build a pig script to join both stage and final dimension records based on natural key. Records which have a match, use the primary key and upsert stage table for those records. For records that do not match you will need to generate a primary key. Again either generate a surrogate key or use natural key hash. Important the ExecuteProcess is a processor within Apache NiFi. I am just calling it out to be clear what needs to be done during the workflow. The part I purposely left out is the "how" to generate a surrogate key. There are many ways to skin a cat. Disgusting. I hate that phrase but you get the idea. Here are some ways of generate a surrogate key http://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/ https://github.com/manojkumarvohra/hive-hilo http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/ Using Pig Rank function http://amintor.com/1/post/2014/07/implement-scd-type-2-in-hadoop-using-hive-transforms.html Another option to point out - Use a RDBMS. I know many cringe when they hear this. I don't care. It works. Do the SCD1 processing for that incremental data set on a free and open source RDBMS. Then use RDBMS table to update Phoenix stage table. Want to join both data sets? You can also use Spark to join both RDBMS tables & HBase table. The connector information is here. Then you can do step 2 processing in Spark. I plan to write another article on this in the coming days/weeks. Stay tuned. This may end up being the dominant pattern. Step 4 Referential integrity. What is Referential integrity? Referential integrity is a property of data which, when satisfied, requires every value of one attribute (column) of a relation (table) to exist as a value of another attribute (column) in a different (or the same) relation (table). For this topic I plan on creating a separate article. Basically you either code up all your validation here or build a rules engine. The rules engine will be leveraged to manage referential integrity. Bottom line. Hadoop does not adhere to relational theory. Applying relational theory concepts does not come naturally. There is some thinking involved. I call it engineering. Don't be afraid to take this on. Again I will post article on this. Step 5 Now we have stage table with our beautiful surrogate keys. Time to update our final tables. But notice I do not only update Phoenix tables. I have built the same tables and data set in hive. Why? For known query pattern Phoenix kicks butt. For unknown query patterns (Ad Hoc BI queries) I rather leverage Hive on Tez. Therefore using Apache NiFi pull your stage tables and upsert Phoenix and Hive final tables. Hive ACID is in technical preview. If you rather not do upsert in hive then this will involve another processing setup. This is well documented here so no reason for me to regurgitate that. I hope this help with your SCD type 1 design on Hadoop. Leverage Apache NiFi and other animals in Hadoop. Many way to skin..ahh i'm going there. Next article I will post design pattern on Hadoop for SCD type 2.

sunile_manjee · ‎08-02-2016

Another work around which worked for me is havimg hive interpretur as first one called out in youe bind. good friend binu matthew found that one

sunile_manjee · ‎08-02-2016

I am using HDP 2.4.2 & another environment 2.3.2. I heard through another blog that stating yarn queue during beeline has changed. I can't find any documentation on this. Here is what I am using: beeline !connect jdbc:hive2://your.host:your.port/data_base?mapred.job.queue.name=your_queue_name If this is no longer valid please advise how I should be doing it.

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Re: Viewing logs for Hive query Executions

Re: Pig install and configuration

Re: Impersonation for distcp

Re: Hive ORC defaults read out

Re: Hive ORC defaults read out

Hive ORC defaults read out

Re: How can select columns from two relations afte...

Using Apache NiFi for Slowly Changing Dimensions o...

Re: Queries in Zeppelin

Setting yarn queue for hive with beeline