Member since
05-05-2016
147
Posts
223
Kudos Received
18
Solutions
10-07-2016
03:42 PM
2 Kudos
PostgreSQL extension PG-Strom, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient. PG-Storm based on two basic ideas:
On-the-fly native GPU code generation. Asynchronous pipeline execution mode. Below figure shows how query is submitted to execution engine and during query optimization phase, PG-Storm detects whether a given query is fully or partially executable on the GPU, and then determines whether the query can be transferred. If the query can be transferred, PG-Storm creates the source code for the GPU native binaries on the fly, starting the real-time compilation process before the execution phase. Next, PG-Storm loads the extracted rowset into the DMA cache (the size of a buffer is defaulted to 15MB) and asynchronously starts DMA transfers and GPU core execution. The CUDA platform allows these tasks to be executed in the background, so PostgreSQL can run the current process ahead of time. Through GPU acceleration, these asynchronous correlation slices also hide the general delay. After loading PG-Strom, running SQL on the GPU does not require special instructions. It allows the user to customize the way PostgreSQL is scanned, and provides additional workarounds for scan/join logic that can be run on the GPU. If the expected cost is reasonable, Task Manager places the custom scan node instead of the built-in query execution logic. The graph below shows the benchmark results for PG-Strom and PostgreSQL. The abscissa is the number of tables, and the ordinate is the query execution time. In this test, all relevant internal relations can be loaded into the GPU RAM on a one-time basis, pre-aggregation greatly reduces the number of rows the CPU needs to process. For more details, test code can be viewed https://wiki.postgresql.org/wiki/PGStrom As can be seen from this figure, PG-Strom is much faster than PostgreSQL alone. Here are a few ways you can improve the performance of PostgreSQL: 1. Similar vertical expansion 2. Heterogeneous vertical expansion 3. Horizontal expansion PG-Strom uses a heterogeneous longitudinal extension approach that maximizes hardware benefits for workload characteristics. In other words, the PG-Strom allocates simple, large numbers of numerical calculations on GPU devices before running on the CPU core. https://www.linkedin.com/pulse/pg-storm-let-postgresql-run-faster-gpu-mukesh-kumar?trk=prof-post Evolution, Right...
... View more
Labels:
07-20-2016
02:24 PM
2 Kudos
Heterogeneous Storage in HDFS Hadoop version 2.6.0 introduced a new feature heterogeneous storage. Heterogeneous storage can be different according to each play their respective advantages of the storage medium to read and write characteristics. This is very suitable for cold storage of data. Data for the cold means storage with large capacity and where high read and write performance is not required, such as the most common disk for thermal data, the SSD can be used to store this way. On the other hand when we required efficient read performance, even in rate appear able to do ten times or a hundred times the ordinary disk read and write speed, or even data directly stored memory, lazy loaded hdfs. HDFS heterogeneous storage characteristics are when we do not need to build two separate clusters to store cold thermal class II data within a cluster can be done, so this feature is still very large practical significance. Here I introduce heterogeneous storage type, and if the flexible configuration of heterogeneous storage!
Ultra cold data storage, hard disk storage is very inexpensive - bank notes video system scenario IO read and write large-scale deployment scenarios, providing order - the default storage type
Type SSD storage - Efficient data query visualization, external data sharing, improve performance.
RAM_DISK - For extreme performance.
Hybrid disc - an ssd or a hdd + sata or sas
HDFS Storage Type ARCHIVE - Archival storage is for very dense storage and is useful for rarely accessed data. This storage type is typically cheaper per TB than normal hard disks. DISK - Hard disk drives are relatively inexpensive and provide sequential I/O performance. This is the default storage type. SSD - Solid state drives are useful for storing hot data and I/O-intensive applications. RAM_DISK - This special in-memory storage type is used to accelerate low-durability, single-replica writes. HDFS Storage Policies has six preconfigured storage policies Hot - All replicas are stored on DISK. Cold - All replicas are stored ARCHIVE. Warm - One replica is stored on DISK and the others are stored on ARCHIVE. All_SSD - All replicas are stored on SSD. One_SSD - One replica is stored on SSD and the others are stored on DISK. Lazy_Persist - The replica is written to RAM_DISK and then lazily persisted to DISK.
Next article i'll show practical usage with HDFS storage settings and a Storage Policy for HDFS Using Ambari, to be continue..
... View more
Labels:
07-19-2016
08:44 AM
3 Kudos
Apache Shiro design is intuitive and simple way to ensure the safety of the application. For more detail on Apache Shiro project go to http://shiro.apache.org/what-is-shiro.html Software design is generally based on user stories to achieve, that is, based on how users interact with the system to design the user interface or service API. For example, a user story will be displayed after a user logs on a button to view personal account information, if the user is not registered, it displays a registration button. This user story implies major application user needs to be accomplished. Even if the user is not a person here, but third-party systems, when coding is also interacting with the system as "user" to deal with. Apache Shiro reaction of this concept in their design by the intuitive notion exposed to developers so that Apache Shiro in almost all applications are easy to use. Outline: Shiro has three main top-level concept: Subject, SecurityManager, Realms. The following diagram describes the interactions between these concepts, the following will introduce 11 to do. Subject: Subject is a microcosm of the current user in the security field. User usually implied meaning people, and Subject can be people, may be a third-party service, the guardian accounts, corn operation. Or any interaction with the system can be called Subject. Subject all instances must be bound to a SecurityManager, so that when interacting with the Subject, in fact, has been transformed into this SecurityManager Subject associated interact. SecurityManager: SecurityManager Shiro as the core framework, as in the form of "umbrella" object exists, it coordinated its internal security components forming an object graph. Upon the completion of its internal configuration SecurityManager and objects in the application, SecurityManager will take a back seat, developers use most of the time in the Subject API. Then-depth understanding of SecurityManager, Again: When interaction with the Subject, in fact, hidden behind heavy Subject SecurityManager responsible for safe operation. This point in the above figure also reflected. Realms: Realms as a bridge Shiro and security applications between data sources or connectors. When you need to get the user accounts for authentication (login) or authorization (access control), Shiro will find in the application configuration is responsible for this work Realm (one or more) to obtain complete data security. In this sense, Realm is essentially a security-related Dao: It encapsulates the details of the data source of the link, and provide data in accordance with the needs of Shiro. When you configure Shiro, the authentication and authorization to provide at least one Reaml. You can configure multiple Realm, but at least one. Shiro built a number of security can be connected to a data source Realm, such as LDAP, relational databases (JDBC), similar to the INI text configuration files resources and properties. If the built-in Realm can not meet the demand, you can also use the custom data source on behalf of their own Realm implementation. And other internal components, like, SecurityManager management Realm how to obtain related Subject security and identity information. The following figure shows the core concepts Shiro framework, followed by a brief description will be made eleven: Original link: http://shiro.apache.org/architecture.html
... View more
Labels:
07-19-2016
12:25 AM
3 Kudos
This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.
HBase Replication Hbase Replication solution can solve the cluster
security, data security, read and write separation, operation and maintenance,
and the guest operating errors, and so the ease of management and
configuration, provide powerful online applications support. Hbase replication currently used in the industry are rare,
because there are many aspects, such as HDFS has multiple backup copies in a
way to help security HBASE underlying data, and the relatively small number of
companies in the cluster size. Another reason the data is not very high degree
of importance, such as some logging system or as a second warehouse of
historical data to split a large number of read requests. Such data lost to be present
or back up at other places (database cluster). For such cases the Slave
Replication cluster become dispensable, the fundamental importance not
reflected. Therefore in hbase management platform a low level of security and
essential services is area of concern and following discussion of Replication
cluster cannot waste time to read.
Currently on HBase exists very important applications, both
online and off-line applications. So security Hbase data also appears very important.
For the problems often come from a single cluster are following:-
Failure data managers, irreversible DDL
operations. BLOCK underlying HDFS file block corruption Excessive short-term pressure on the cluster
read data caused by adding servers to deal with this situation is more a waste
of resources. System upgrades, maintenance, diagnose problems
will cause the cluster unavailable time to grow. Double the atomic difficult to guarantee. Unpredictable for some reason. (Eg engine room
off, large-scale hardware damage, disconnection, etc.) Impact of MR computing offline applications
cause larger delay on-line literacy. If you worry about the above questions, then, Replication main
cluster is a good choice, and we are in this area to do some simple research.
By simply following the problems we encountered in the use and methods taken.
It is popular online backup comparison program For backup solutions to a redundant data center there are
several angles to analyze like current consistency, transactional delay,
throughput, data loss and Failover we have currently several options:- Simple Backup: - Simple
backup mode where timing of Dump the cluster is scheduled, usually by snapshot
to set the timestamp. We can make an elegant design too for on-line data center
with low interference or no interference.
However, this scheme is have some disadvantages like just before the
time point of snapshot if unexpected events occur inevitable lead to data loss
of entire duration, as many people cannot accept that. Master-slave mode: -Master-slave
mode (Master-the Slave) This model is simple compared to a lot more advantages
backup mode, you can ensure data consistency eventual consistency, data from
the primary cluster to the standby cluster low latency, asynchronous writes
will not the primary cluster to bring pressure on performance, how much will
have a minimal impact on performance, incident comes less data loss, and the
main cluster in the standby cluster can also be guaranteed. Usually by
constructing better Log system plus check Point to achieve,
can read and write separation, the primary cluster can act as reader services,
but only to prepare clusters generally bear reading services. Master master mode: - Master
master mode (Master-Master) principle is similar to the overall
master-slave mode, the difference is two clusters can take each other to
write separation, can bear to read and write services. Two -phase commit:- Two
phase commit such programs to ensure consistency and strong transaction, the
server returned to the client successfully indicates that certain data has been
successfully backed up, it will not cause any data loss.Each server can bear to
read and write services. But the disadvantage is the delay caused by
cluster higher overall throughput decreases. Paxosalgorithm: -
Paxos algorithm based on Paxos strong
consistency algorithm program implementation, the same client
connection server to ensure data consistency. The disadvantage
is complex, latency and throughput clusters with different clustered servers. Hbase simple
backup mode if the table is not online relatively easy to handle, you
can copy table or distcp or spapshot table. If the table is
online and offline cannot be allowed only
through snapshot scheme online table implement a
backup. Hbase
Replication master-slave mode equipment by specifying the cluster will
send Hlog data asynchronously to the standby inside the cluster, basically
no performance impact on the primary cluster, the data delay time is
shorter. Main cluster provides literacy services to prepare the cluster to
provide reading services. If the primary cluster fails, you can quickly
switch to the backup cluster. We can look back to Hbase backup
status, Hbase can offer the online backup and offline backup through
the above simple backup mode, master-slave and the master-mode three backup modes. Hbase Replication Master Master
Mode between two mutual clusters backup, provide literacy services,
separate read and write. By
comparison, the overall opinion Hbase Replication solution can solve
the cluster security, data security, read and write separation, operation and
maintenance, and the guest operating errors, and so the question, and ease of
management and configuration, provide powerful online applications support. to be continue...
... View more
Labels:
07-15-2016
11:35 AM
8 Kudos
Apache Kylin origin In today's era of big data, Hadoop has become the de facto standards, and a large number of tools one after another around the Hadoop platform to build, to address the needs of different scenarios. For example, Hadoop Hive is a data warehouse tools, data files stored on HDFS distributed file system can be mapped to a database table and provides SQL queries. Hive execution engine can be converted to SQL MapReduce task to run, ideally suited for data warehouse data analysis.
Another example HBase is based on Hadoop, high availability, high performance, column-oriented, scalable distributed storage system, Hadoop HDFS architecture to provide high reliability of HBase underlying storage support. Although existing business analytical tools such as Tableau, etc. are here because of the lack of Hadoop-based distributed analysis engine. But they exist with significant limitations, such as difficult to extend horizontally, can not handle large scale data, but also the lack of support for Hadoop. Apache Kylin (Chinese: Kirin) appears, can solve the above problems based on Hadoop. Apache Kylin is an open source distributed storage engine originally developed by the eBay contribution to the open source community. It provides Hadoop above the SQL query interface and multidimensional analysis (OLAP) capability to support large-scale data, and even be able to handle TB PB-level analysis tasks, be able to query a huge table in the Hive sub-second, and supports high concurrency.
Apache Kylin scenarios
(1) If your data exists in the Hadoop HDFS distributed file system, and you use Hive to build a data warehouse based on HDFS systems, and data analysis, and huge amount of data, such as TB levels. (2) At the same time you can also use HBase Hadoop platform for data storage and use HBase line keys for fast data query applications (3) The huge amount of data your Hadoop platform Accumulated daily and would like to do Dimension Data analysis. If your application is similar to the above, it is very suitable for Apache Kylin do large amounts of multidimensional data analysis.
Apache Kylin core idea is to use the space for time, the computed result is stored in multidimensional data HBase, fast data query. And because Apache Kylin develop a variety of flexible policy in terms of queries and further improve the utilization of space, so that such a balance in the application of the policy worthwhile. Apache Kylin development course
Apache Kylin in October 2014 in github open source, and soon joined Apache Incubator in November 2014, in November 2015 officially graduated to become top-level Apache project, also became the first entirely Chinese team designed and developed the top-level Apache project. Apache Kylin official website is:
http://kylin.apache.org
In March 2016, Apache Kylin core developers create Kyligence company in Shanghai, to better promote the rapid development of the project and the community. The company's official website is: http: //kyligence.io
In order to get better development, in April 2016, big data company Kyligence Kui-Technology has been awarded a multi-million dollar angel investment round.
... View more
Labels:
07-12-2016
08:38 AM
5 Kudos
I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i'll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R. Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step. Setting the Scene Data analysis is a study of subjective question and study even includes developing and executing a plan for collecting data, a data analysis presumes the data have already been collected. More specifically, a study includes the development of a hypothesis or question, the designing of the data collection process (or study protocol), the collection of the data, and the analysis and interpretation of the data. Activities of data Analysis There are 5 core activities of data analysis: 1. Stating and refining the question 2. Exploring the data 3. Building formal statistical models 4. Interpreting the results 5. Communicating the results 1. Stating and Refining the Question Doing data analysis requires quite a bit of thinking and we believe that when you’ve completed a good data analysis, you’ve spent more time thinking than doing. The thinking begins before you even look at a dataset, and it’s well worth devoting careful thought to your question. This point cannot be over-emphasized as many of the “fatal” pitfalls of a data analysis can be avoided by expending the mental energy to get your question right. Types of Questions:- Descriptive A descriptive question is one that seeks to summarize a characteristic of a set of data. Examples include determining the proportion of males, the mean number of servings of fresh fruits and vegetables per day, or the frequency of viral illnesses in a set of data collected from a group of individuals. Exploratory An exploratory question is one in which you analyze the data to see if there are patterns, trends, or relationships between variables. These types of analyses are also called “hypothesis- generating” analyses because rather than testing a hypothesis as would be done with an inferential, causal, or mechanistic question, you are looking for patterns that would support proposing a hypothesis. Inferential An inferential question would be a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data. Predictive A predictive question would be one where you ask what types of people will eat a diet high in fresh fruits and vegetables during the next year. In this type of question you are less interested in what causes someone to eat a certain diet, just what predicts whether someone will eat this certain diet. For example, higher income may be one of the final set of predictors, and you may not know (or even care) why people with higher incomes are more likely to eat a diet high in fresh fruits and vegetables, but what is most important is that income is a factor that predicts this behavior. Mechanistic This will lead to an answer that will tell us, if the diet does, indeed, cause a reduction in the number of viral illnesses, how the diet leads to a reduction in the number of viral illnesses. A question that asks how a diet high in fresh fruits and vegetables leads to a reduction in the number of viral illnesses would be a mechanistic question. 2. Exploratory Data Analysis Exploratory data analysis is the process of exploring your data, and it typically includes examining the structure and components of your dataset, the distributions of individual variables, and the relationships between two or more variables. The most heavily relied upon tool for exploratory data analysis is visualizing data using a graphical representation of the data. There are several goals of exploratory data analysis, which are: 1. To determine if there are any problems with your dataset. 2. To determine whether the question you are asking can be answered by the data that you have. 3. To develop a sketch of the answer to your question. 3. Using Models to Explore Your Data In a very general sense, a model is something we construct to help us understand the real world. But a simple summary statistic, such as the mean of a set of numbers, is not enough to formulate a model. A statistical model must also impose some structure on the data. At its core, a statistical model provides a description of how the world works and how the data were generated. The model is essentially an expectation of the relationships between various factors in the real world and in your dataset. What makes a model a statistical model is that it allows for some randomness in generating the data. 4. Comparing Model Expectations to Reality Inference is one of many possible goals in data analysis and so it’s worth discussing what exactly is the act of making inference. 1. Describe the sampling process 2. Describe a model for the population(populations is subset of my data) Drawing a fake picture:- To begin with we can make some pictures, like a histogram of the data. Reacting to Data: Refining Our Expectations Okay, so the model and the data don’t match very well, as was indicated by the histogram above. So what do do? Well, we can either 1. Get a different model 2. Get different data 5. Interpreting Your Results and Communication Conclusion:- Communication is fundamental to good data analysis. You gather data by communicating your results and the responses you receive from your audience should inform the next steps in your data analysis. The types of responses you receive include not only answers to specific questions, but also commentary and questions your audience has in response to your report. References, Additional Information:- MyBlogSite
... View more
Labels:
07-11-2016
04:39 PM
For Composite Key: <LEVELNAME>_<ENTITYNAME>_Key Note: For multiple key
needs put multiple keys in Fact tables. OOzie Job Naming: <VENDOR>_<ENTITY>_<LEVELNAME>_<FREQUENCY>_[<CALC>|<AGRT>|<DownStream>].xml File extension
for Hadoop: HQL files extension ".hql" Java files extension ".java" Property file extentsion ".properties" Shell script extension ".sh" Oozie config files ".xml" Data
definition files ".ddl"
... View more
07-07-2016
06:04 PM
1 Kudo
These convention are for all those business application who are now ready/planning to migrate Hadoop. So you dont need to invent convention wheel again, we already did a lot brain storming on this.
... View more
07-07-2016
06:04 PM
6 Kudos
I have worked with almost 20 to 25 applications. Whenever i start working first i have to understand each applications naming convention and i keep thinking why we all not follow single naming convention. As Hadoop is evolving rapidly therefore would like to share my naming convention so that may be if you come to my project will feel comfortable and so as I if you follow too. Database Names: If application serve to technology then database name would be like <APPID>_<TECHNOLOGY>_TBLS <APPID>_<TECHNOLOGY>_VIEW If application serve to vendor then database name would be like <APPID>_<VENDORNAME>_TBLS <APPID>_<VENDORNAME>_VIEW If database application further required to divide by module then database name would be like <APPLID>_<MODULE>_TBLS <APPLID>_<MODULE>_VIEW Fact Table Names: TFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores
lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 (We can set number according to our requirement) FREQUENCY:
HOURLY (range
from 201 to 399) DAILY (range
from 401 to 599) External Table Names: TEXXX_<FREQUENCY> Dim Table Names: TDXXX_<DIM_TYPE_NAME> XXX: Range from 001 to 999 Lookup\Config
tables TLXXX_<REF> XXX: Range from 001 to 999 Control tables: TCXXX_<TABLENAME> XXX: Range from 001 to 999 Temporary Tables: TMP_<JOBNAME>_<Name> Note: (This should be used for the tables which is
created and dropped by job while it’s executing) PRM_<JOBNAME>_<Name> Note: (This
should be used for the tables which are used to insert and drop data while it’s
executing) View Names: VFXXX_<FREQUENCY>_<AGRT> Note: AGGRT is will not be there for the table stores
lowest granularity table. It will be added only to aggregate data table. XXX: Range from 001 to 999 FREQUENCY:
HOURLY DAILY etc Column Names: Should not start with number Should not have any special chars except “_” Start with a Capital letter Few downstream databases have column limitation is 128 characters. Stored Procs or
HQL Query: PSXXX_[<FREQUENCY>|<CALC>|<AGRT>|<DownStream>] Example: PS001_ENGINEERING_HOURLY XXX: Range from
001 to 999 Macro: MCXXX_<MODULENAME> XXX: Range from
001 to 999 UDF(Hadoop): UDFXXX_<MODULENAME> XXX:
Range from 001 to 999 Index: Index Names TFXXX _ PRI _
IDX#_<NUSI/USI> IDX = constant for primary index # = secondary index sequential numeric number(1, 2, 3, 4, ...) PRI – primary index (used to
distribute data across amps and then for access performance NUSI- non unique secondary index used for access performance USI - unique secondary index used for access
performance Next Article i'll share more naming convention on Oozie, file naming and Data types...
... View more
Labels:
07-07-2016
06:03 PM
4 Kudos
We have Hive over Hbase table and lets say there are few columns with INT datatype, data loaded from Hive. Now if we would like to delete data based on values present in that particular column(INT), is not possible. It is because values are converted to Binary, even HBase API filter(SingleColumnValueFilter) would return wrong result if we query that particular column values from HBase. Problem to solve: How purge Hive INT datatype column data from HBase? This is the first textual
series containing the resolution of above problem. Next series i'll create a small
video on running code and cover other datatypes too. In such scenario we cant use
standard API and unable to apply filters on binary column values, Solution :- Below JRuby program code. So you have already heard many advantages of
storing data in HBase(specially binary block format) and create Hive table on
top of that to query your data. I am not going to explain use case for this,
why we required HBase over Hive but simple reason for batter
visibility/representation of data in tabular format. I have come across this
problem few days back when we required to purge HBase data after completion of
retention period and we struck to delete data from HBase table using HBase
API's and filters when particular column/columns is of INT data type from Hive.
Below is sample use case:- There are two type of storage
format when for Hive data in HBase:- 1. Binary 2. String Storing data in Binary block
in HBase has its own advantages. Below script to create sample tables in both
Hbase and Hive:- HBase:- 1. create 'tiny_hbase_table1', 'ck', 'o', {NUMREGIONS => 16, SPLITALGO => 'UniformSplit'} Hive:- CREATE EXTERNAL TABLE orgdata (
key INT,
kingdom STRING,
kingdomkey INT,
kongo STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b,o:kongo#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table1",
"hbase.table.default.storage.type" = "binary"
);
insert into orgdata(1,'London',1001,'victoria secret');
insert into orgdata values(2,'India',1001,'Indira secret');
insert into orgdata values(3,'Saudi Arabia',1001,'Muqrin');
insert into orgdata values(4,'Swaziland',1001,'King Mswati');
hbase(main):080:0> scan 'tiny_hbase_table1'
ROW COLUMN+CELL
\x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland
\x00\x00\x00\x01 column=o:kingdomKey, timestamp=1467806798430, value=\x00\x00\x03\xE9
\x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India
\x00\x00\x00\x02 column=o:kingdomKey, timestamp=1467806928329, value=\x00\x00\x03\xE9
\x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia
\x00\x00\x00\x03 column=o:kingdomKey, timestamp=1467806933574, value=\x00\x00\x03\xE9
\x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland
\x00\x00\x00\x04 column=o:kingdomKey, timestamp=1467807030737, value=\x00\x00\x03\xE9
4 row(s) in 0.0690 seconds
Now lets apply our HBase
filter we get no result:- hbase(main):001:0> scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('\x00\x00\x00\x01')
hbase(main):002:1" scan 'tiny_hbase_table1', {FILTER => "(PrefixFilter ('1')
If we dont know what is the
equivalent value of INT column like kingdomkey, its not possible to apply
filter. Now you can see we get wrong results and
with SingleColumnValueFilter would fail in this scenario, see below:- import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'tiny_hbase_table1', {LIMIT => 10, FILTER => SingleColumnValueFilter.new(Bytes.toBytes('o'), Bytes.toBytes('kingdomKey'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('1001')), COLUMNS => 'o:kingdom' }
ROW COLUMN+CELL
\x00\x00\x00\x01 column=o:kingdom, timestamp=1467806798430, value=Swaziland
\x00\x00\x00\x02 column=o:kingdom, timestamp=1467806928329, value=India
\x00\x00\x00\x03 column=o:kingdom, timestamp=1467806933574, value=Saudi Arabia
\x00\x00\x00\x04 column=o:kingdom, timestamp=1467807030737, value=Swaziland
4 row(s) in 0.3640 seconds Now Solution is below JRuby
program, using it you get proper results and inside program you can apply delete_row hbase command to delete candidate record as soon as you find in loop:- import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;
def delete_get_some()
var_table = "tiny_hbase_table1"
htable = HTable.new(HBaseConfiguration.new, var_table)
rs = htable.getScanner(Bytes.toBytes("o"), Bytes.toBytes("kingdomKey"))
output = ArrayList.new
output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
rs.each { |r| r.raw.each { |kv|
row = Bytes.toInt(kv.getRow)
fam = kv.getFamily
ql = Bytes.toString(kv.getQualifier)
ts = kv.getTimestamp
val = Bytes.toInt(kv.getValue)
rowval = Bytes.toInt(kv.getRow)
output.add "#{row} #{ql} #{val}"
}
}
output.each {|line| puts "#{line}\n"}
end
delete_get_some
ROW COLUMN+CELL
1 kingdomKey 1001
2 kingdomKey 1001
3 kingdomKey 1001
4 kingdomKey 1001
You can declare variable and
apply custom filter on values and delete rowkey based on readable values:- if val <= myVal and row.include? 'likeme^'
output.add "#{val} #{row} <<<<<<<<<<<<<<<<<<<<<<<<<<- Candidate for deletion"
deleteall var_table, rowend
Hope this solve a problem you are facing too. Let me know in case of any query and suggestions...
... View more
Labels: