About mkumar13

mkumar13 · ‎07-12-2016

yarn application -status {Application ID} command returns "Aggregate Resource Allocation" in terms of "MB-seconds" and "vcore-seconds" For e.g. -status for one of my applications returned: Aggregate Resource Allocation : 12865641 MB-seconds, 1041 vcore-seconds

mkumar13 · ‎07-11-2016

For Composite Key: <LEVELNAME>_<ENTITYNAME>_Key Note: For multiple key needs put multiple keys in Fact tables. OOzie Job Naming: <VENDOR>_<ENTITY>_<LEVELNAME>_<FREQUENCY>_[<CALC>|<AGRT>|<DownStream>].xml File extension for Hadoop: HQL files extension ".hql" Java files extension ".java" Property file extentsion ".properties" Shell script extension ".sh" Oozie config files ".xml" Data definition files ".ddl"

mkumar13 · ‎07-11-2016

Check if this help. https://community.hortonworks.com/articles/26030/apache-metron-deployment-first-steps-in-the-cloud.html

mkumar13 · ‎07-11-2016

Is the notebook server running on a different machine? Or in a virtual machine? localhost means 'this computer', so the default settings require it to be running on the same machine as the browser. Just check if you are using chrome. Settings-> Enable guest browsing.

mkumar13 · ‎07-08-2016

Check url below for manual step to step installation of HBase, if you missed some configuration during setup... http://linuxpitstop.com/configure-distributed-hbase-cluster-on-centos-linux-7/

mkumar13 · ‎07-08-2016

It seems that HBASE is having some issue, therefore cannot retrieve the rows from HBASE as some region server are not responding/down can cause this issue. Restarting region servers can resolve this issue.

mkumar13 · ‎07-07-2016

These convention are for all those business application who are now ready/planning to migrate Hadoop. So you dont need to invent convention wheel again, we already did a lot brain storming on this.

mkumar13 · ‎07-07-2016

I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be like <APPID>_<TECHNOLOGY>_TBLS <APPID>_<TECHNOLOGY>_VIEW If application serve to vendor then database name would be like <APPID>_<VENDORNAME>_TBLS <APPID>_<VENDORNAME>_VIEW If database application further required to divide by module then database name would be like <APPLID>_<MODULE>_TBLS <APPLID>_<MODULE>_VIEW Fact Table Names: TFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 (We can set number according to our requirement) FREQUENCY: HOURLY (range from 201 to 399) DAILY (range from 401 to 599) External Table Names: TEXXX_<FREQUENCY> Dim Table Names: TDXXX_<DIM_TYPE_NAME> XXX: Range from 001 to 999 Lookup\Config tables TLXXX_<REF> XXX: Range from 001 to 999 Control tables: TCXXX_<TABLENAME> XXX: Range from 001 to 999 Temporary Tables: TMP_<JOBNAME>_<Name> Note: (This should be used for the tables which is created and dropped by job while it’s executing) PRM_<JOBNAME>_<Name> Note: (This should be used for the tables which are used to insert and drop data while it’s executing) View Names: VFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 FREQUENCY: HOURLY DAILY etc Column Names: Should not start with number Should not have any special chars except “_” Start with a Capital letter Few downstream databases have column limitation is 128 characters. Stored Procs or HQL Query: PSXXX_[<FREQUENCY>|<CALC>|<AGRT>|<DownStream>] Example: PS001_ENGINEERING_HOURLY XXX: Range from 001 to 999 Macro: MCXXX_<MODULENAME> XXX: Range from 001 to 999 UDF(Hadoop): UDFXXX_<MODULENAME> XXX: Range from 001 to 999 Index: Index Names TFXXX _ PRI _ IDX#_<NUSI/USI> IDX = constant for primary index # = secondary index sequential numeric number(1, 2, 3, 4, ...) PRI – primary index (used to distribute data across amps and then for access performance NUSI- non unique secondary index used for access performance USI - unique secondary index used for access performance Next Article i'll share more naming convention on Oozie, file naming and Data types...

mkumar13 · ‎07-07-2016

We have Hive over Hbase table and lets say there are few columns with INT datatype, data loaded from Hive. Now if we would like to delete data based on values present in that particular column(INT), is not possible. It is because values are converted to Binary, even HBase API filter(SingleColumnValueFilter) would return wrong result if we query that particular column values from HBase. Problem to solve: How purge Hive INT datatype column data from HBase? This is the first textual series containing the resolution of above problem. Next series i'll create a small video on running code and cover other datatypes too. In such scenario we cant use standard API and unable to apply filters on binary column values, Solution :- Below JRuby program code. So you have already heard many advantages of storing data in HBase(specially binary block format) and create Hive table on top of that to query your data. I am not going to explain use case for this, why we required HBase over Hive but simple reason for batter visibility/representation of data in tabular format. I have come across this problem few days back when we required to purge HBase data after completion of retention period and we struck to delete data from HBase table using HBase API's and filters when particular column/columns is of INT data type from Hive. Below is sample use case:- There are two type of storage format when for Hive data in HBase:- 1. Binary 2. String Storing data in Binary block in HBase has its own advantages. Below script to create sample tables in both Hbase and Hive:- HBase:- 1. create 'tiny_hbase_table1', 'ck', 'o', {NUMREGIONS => 16, SPLITALGO => 'UniformSplit'} Hive:- CREATE EXTERNAL TABLE orgdata ( key INT, kingdom STRING, kingdomkey INT, kongo STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b,o:kongo#b") TBLPROPERTIES( "hbase.table.name" = "tiny_hbase_table1", "hbase.table.default.storage.type" = "binary" ); insert into orgdata(1,'London',1001,'victoria secret'); insert into orgdata values(2,'India',1001,'Indira secret'); insert into orgdata values(3,'Saudi Arabia',1001,'Muqrin'); insert into orgdata values(4,'Swaziland',1001,'King Mswati'); hbase(main):080:0> scan 'tiny_hbase_table1' ROW COLUMN+CELL \x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland \x00\x00\x00\x01 column=o:kingdomKey, timestamp=1467806798430, value=\x00\x00\x03\xE9 \x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India \x00\x00\x00\x02 column=o:kingdomKey, timestamp=1467806928329, value=\x00\x00\x03\xE9 \x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia \x00\x00\x00\x03 column=o:kingdomKey, timestamp=1467806933574, value=\x00\x00\x03\xE9 \x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland \x00\x00\x00\x04 column=o:kingdomKey, timestamp=1467807030737, value=\x00\x00\x03\xE9 4 row(s) in 0.0690 seconds Now lets apply our HBase filter we get no result:- hbase(main):001:0> scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('\x00\x00\x00\x01') hbase(main):002:1" scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('1') If we dont know what is the equivalent value of INT column like kingdomkey, its not possible to apply filter. Now you can see we get wrong results and with SingleColumnValueFilter would fail in this scenario, see below:- import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'tiny_hbase_table1', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('o'), Bytes.toBytes('kingdomKey'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('1001')), COLUMNS => 'o:kingdom' } ROW COLUMN+CELL \x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland \x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India \x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia \x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland 4 row(s) in 0.3640 seconds Now Solution is below JRuby program, using it you get proper results and inside program you can apply delete_row hbase command to delete candidate record as soon as you find in loop:- import org.apache.hadoop.hbase.HBaseConfiguration import org.apache.hadoop.hbase.client.HTable import org.apache.hadoop.hbase.client.Get import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Result; import java.util.ArrayList; def delete_get_some() var_table = "tiny_hbase_table1" htable = HTable.new(HBaseConfiguration.new, var_table) rs = htable.getScanner(Bytes.toBytes("o"), Bytes.toBytes("kingdomKey")) output = ArrayList.new output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL" rs.each { |r| r.raw.each { |kv| row = Bytes.toInt(kv.getRow) fam = kv.getFamily ql = Bytes.toString(kv.getQualifier) ts = kv.getTimestamp val = Bytes.toInt(kv.getValue) rowval = Bytes.toInt(kv.getRow) output.add "#{row} #{ql} #{val}" } } output.each {|line| puts "#{line}\n"} end delete_get_some ROW COLUMN+CELL 1 kingdomKey 1001 2 kingdomKey 1001 3 kingdomKey 1001 4 kingdomKey 1001 You can declare variable and apply custom filter on values and delete rowkey based on readable values:- if val <= myVal and row.include? 'likeme^' output.add "#{val} #{row} <<<<<<<<<<<<<<<<<<<<<<<<<<- Candidate for deletion" deleteall var_table, rowend Hope this solve a problem you are facing too. Let me know in case of any query and suggestions...

mkumar13 · ‎07-06-2016

Thankyou!!! yes its syntax error caused the issue and now i am able to run program successfully...

Online	Offline
Last Visited	‎08-15-2019 08:33 PM

Member Since	‎05-05-2016 12:35 PM
Last Visited	‎08-15-2019 08:33 PM
Posts	147
Kudos received	222

Cloudera Community

Re: HDP3.0.1 Ambari unable to stop all services...

Re: Do we need to create a normal managed table be...

Re: Where can I find list of enhancements (Release...

Re: Spark performance parameter num-executors has ...

Re: How we can connect an external Hive table to a...

Re: Yarn container size

Re: Hive Naming conventions and database naming st...

Re: Metron Single Node AWS Deployment?

Re: jupyter doesnt open a browser

Re: Any Link or Documeny for HBASE implementation ...

Re: I have Hive over Hbase table and below select ...

Re: Hive Naming conventions and database naming st...

Hive Naming conventions and database naming standa...

JRuby code to purge/query/filter data on Hbase ove...

Re: HBase JRuby program error "NoMethodError: unde...