Member since
05-05-2016
147
Posts
223
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3654 | 12-28-2018 08:05 AM | |
3610 | 07-29-2016 08:01 AM | |
2969 | 07-29-2016 07:45 AM | |
6889 | 07-26-2016 11:25 AM | |
1366 | 07-18-2016 06:29 AM |
07-12-2016
08:13 AM
1 Kudo
yarn application -status {Application ID} command returns "Aggregate Resource Allocation" in terms of "MB-seconds" and "vcore-seconds"
For e.g. -status for one of my applications returned:
Aggregate Resource Allocation : 12865641 MB-seconds, 1041 vcore-seconds
... View more
07-11-2016
04:39 PM
For Composite Key: <LEVELNAME>_<ENTITYNAME>_Key Note: For multiple key
needs put multiple keys in Fact tables. OOzie Job Naming: <VENDOR>_<ENTITY>_<LEVELNAME>_<FREQUENCY>_[<CALC>|<AGRT>|<DownStream>].xml File extension
for Hadoop: HQL files extension ".hql" Java files extension ".java" Property file extentsion ".properties" Shell script extension ".sh" Oozie config files ".xml" Data
definition files ".ddl"
... View more
07-11-2016
10:09 AM
1 Kudo
Check if this help. https://community.hortonworks.com/articles/26030/apache-metron-deployment-first-steps-in-the-cloud.html
... View more
07-11-2016
07:47 AM
2 Kudos
Is the notebook server running on a different machine? Or in a virtual machine? localhost means 'this computer', so the default settings require it to be running on the same machine as the browser. Just check if you are using chrome. Settings-> Enable guest browsing.
... View more
07-08-2016
10:04 AM
3 Kudos
Check url below for manual step to step installation of HBase, if you missed some configuration during setup... http://linuxpitstop.com/configure-distributed-hbase-cluster-on-centos-linux-7/
... View more
07-08-2016
06:55 AM
4 Kudos
It seems that HBASE is having some issue, therefore cannot retrieve the rows from HBASE as some region server are not responding/down can cause this issue.
Restarting region servers can resolve this issue.
... View more
07-07-2016
06:04 PM
1 Kudo
These convention are for all those business application who are now ready/planning to migrate Hadoop. So you dont need to invent convention wheel again, we already did a lot brain storming on this.
... View more
07-07-2016
06:04 PM
6 Kudos
I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be like <APPID>_<TECHNOLOGY>_TBLS <APPID>_<TECHNOLOGY>_VIEW If application serve to vendor then database name would be like <APPID>_<VENDORNAME>_TBLS <APPID>_<VENDORNAME>_VIEW If database application further required to divide by module then database name would be like <APPLID>_<MODULE>_TBLS <APPLID>_<MODULE>_VIEW Fact Table Names: TFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores
lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 (We can set number according to our requirement) FREQUENCY:
HOURLY (range
from 201 to 399) DAILY (range
from 401 to 599) External Table Names: TEXXX_<FREQUENCY> Dim Table Names: TDXXX_<DIM_TYPE_NAME> XXX: Range from 001 to 999 Lookup\Config
tables TLXXX_<REF> XXX: Range from 001 to 999 Control tables: TCXXX_<TABLENAME> XXX: Range from 001 to 999 Temporary Tables: TMP_<JOBNAME>_<Name> Note: (This should be used for the tables which is
created and dropped by job while it’s executing) PRM_<JOBNAME>_<Name> Note: (This
should be used for the tables which are used to insert and drop data while it’s
executing) View Names: VFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores
lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 FREQUENCY:
HOURLY DAILY etc Column Names: Should not start with number Should not have any special chars except “_” Start with a Capital letter Few downstream databases have column limitation is 128 characters. Stored Procs or
HQL Query: PSXXX_[<FREQUENCY>|<CALC>|<AGRT>|<DownStream>] Example: PS001_ENGINEERING_HOURLY XXX: Range from
001 to 999 Macro: MCXXX_<MODULENAME> XXX: Range from
001 to 999 UDF(Hadoop): UDFXXX_<MODULENAME> XXX:
Range from 001 to 999 Index: Index Names TFXXX _ PRI _
IDX#_<NUSI/USI> IDX = constant for primary index # = secondary index sequential numeric number(1, 2, 3, 4, ...) PRI – primary index (used to
distribute data across amps and then for access performance NUSI- non unique secondary index used for access performance USI - unique secondary index used for access
performance Next Article i'll share more naming convention on Oozie, file naming and Data types...
... View more
Labels:
07-07-2016
06:03 PM
4 Kudos
We have Hive over Hbase table and lets say there are few columns with INT datatype, data loaded from Hive. Now if we would like to delete data based on values present in that particular column(INT), is not possible. It is because values are converted to Binary, even HBase API filter(SingleColumnValueFilter) would return wrong result if we query that particular column values from HBase. Problem to solve: How purge Hive INT datatype column data from HBase? This is the first textual
series containing the resolution of above problem. Next series i'll create a small
video on running code and cover other datatypes too. In such scenario we cant use
standard API and unable to apply filters on binary column values, Solution :- Below JRuby program code. So you have already heard many advantages of
storing data in HBase(specially binary block format) and create Hive table on
top of that to query your data. I am not going to explain use case for this,
why we required HBase over Hive but simple reason for batter
visibility/representation of data in tabular format. I have come across this
problem few days back when we required to purge HBase data after completion of
retention period and we struck to delete data from HBase table using HBase
API's and filters when particular column/columns is of INT data type from Hive.
Below is sample use case:- There are two type of storage
format when for Hive data in HBase:- 1. Binary 2. String Storing data in Binary block
in HBase has its own advantages. Below script to create sample tables in both
Hbase and Hive:- HBase:- 1. create 'tiny_hbase_table1', 'ck', 'o', {NUMREGIONS => 16, SPLITALGO => 'UniformSplit'} Hive:- CREATE EXTERNAL TABLE orgdata (
key INT,
kingdom STRING,
kingdomkey INT,
kongo STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b,o:kongo#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table1",
"hbase.table.default.storage.type" = "binary"
);
insert into orgdata(1,'London',1001,'victoria secret');
insert into orgdata values(2,'India',1001,'Indira secret');
insert into orgdata values(3,'Saudi Arabia',1001,'Muqrin');
insert into orgdata values(4,'Swaziland',1001,'King Mswati');
hbase(main):080:0> scan 'tiny_hbase_table1'
ROW COLUMN+CELL
\x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland
\x00\x00\x00\x01 column=o:kingdomKey, timestamp=1467806798430, value=\x00\x00\x03\xE9
\x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India
\x00\x00\x00\x02 column=o:kingdomKey, timestamp=1467806928329, value=\x00\x00\x03\xE9
\x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia
\x00\x00\x00\x03 column=o:kingdomKey, timestamp=1467806933574, value=\x00\x00\x03\xE9
\x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland
\x00\x00\x00\x04 column=o:kingdomKey, timestamp=1467807030737, value=\x00\x00\x03\xE9
4 row(s) in 0.0690 seconds
Now lets apply our HBase
filter we get no result:- hbase(main):001:0> scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('\x00\x00\x00\x01')
hbase(main):002:1" scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('1')
If we dont know what is the
equivalent value of INT column like kingdomkey, its not possible to apply
filter. Now you can see we get wrong results and
with SingleColumnValueFilter would fail in this scenario, see below:- import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'tiny_hbase_table1', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('o'), Bytes.toBytes('kingdomKey'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('1001')), COLUMNS => 'o:kingdom' }
ROW COLUMN+CELL
\x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland
\x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India
\x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia
\x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland
4 row(s) in 0.3640 seconds Now Solution is below JRuby
program, using it you get proper results and inside program you can apply delete_row hbase command to delete candidate record as soon as you find in loop:- import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;
def delete_get_some()
var_table = "tiny_hbase_table1"
htable = HTable.new(HBaseConfiguration.new, var_table)
rs = htable.getScanner(Bytes.toBytes("o"), Bytes.toBytes("kingdomKey"))
output = ArrayList.new
output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
rs.each { |r| r.raw.each { |kv|
row = Bytes.toInt(kv.getRow)
fam = kv.getFamily
ql = Bytes.toString(kv.getQualifier)
ts = kv.getTimestamp
val = Bytes.toInt(kv.getValue)
rowval = Bytes.toInt(kv.getRow)
output.add "#{row} #{ql} #{val}"
}
}
output.each {|line| puts "#{line}\n"}
end
delete_get_some
ROW COLUMN+CELL
1 kingdomKey 1001
2 kingdomKey 1001
3 kingdomKey 1001
4 kingdomKey 1001
You can declare variable and
apply custom filter on values and delete rowkey based on readable values:- if val <= myVal and row.include? 'likeme^'
output.add "#{val} #{row} <<<<<<<<<<<<<<<<<<<<<<<<<<- Candidate for deletion"
deleteall var_table, rowend
Hope this solve a problem you are facing too. Let me know in case of any query and suggestions...
... View more
Labels:
07-06-2016
01:24 PM
1 Kudo
Thankyou!!! yes its syntax error caused the issue and now i am able to run program successfully...
... View more