About aervits

aervits · ‎02-05-2016

@Kuldeep Kulkarni has this been resolved? Please accept best answer or provide your own solution.

aervits · ‎02-05-2016

@zaenal rifai has this been resolved? Please accept best answer or provide your own solution.

aervits · ‎02-05-2016

@keerthana gajarajakumar before the ExlamationTopology name of topology provide full class name com.myclass.ClassName

aervits · ‎02-05-2016

@Mehdi TAZI please accept best answer or provide your own solution.

aervits · ‎02-05-2016

@Guilherme Braccialli are you still having issues with this? Please close.

aervits · ‎02-05-2016

@sedara this question is related to the following. link 8.2.3. DataNodes Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional filesystems. Thus, if a block is half full it needs only half of the space of the full block on the local drive. During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode, the DataNode automatically shuts down. The namespace ID is assigned to the filesystem instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus protecting the integrity of the filesystem. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster's namespace ID. After the handshake the DataNode registers with the NameNode. DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that. A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster. During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions. The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node. These commands are important for maintaining the overall system integrity and therefore it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations. article explanation link

aervits · ‎02-05-2016

@Rahul Tikekar are you still having issues? Please accept the best answer or provide your own solution.

aervits · ‎02-05-2016

@xueqin pang are you still having issues? Please accept the best answer or provide your own solution.

aervits · ‎02-05-2016

@Enis @Devaraj Das @vrodionov @nmaillard @Guilherme Braccialli please review and advise

aervits · ‎02-05-2016

# SANDBOX must have only Host-only network, quorum is sandbox.hortonworks.com /etc/hosts is 192.168.56.101 sandbox.hortonworks.com # create an hbase table from hive CREATE TABLE IF NOT EXISTS hbase_hive_table(key string, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:json") TBLPROPERTIES ("hbase.table.name" = "hbase_hive_table"); # in hbase shell access the table hbase(main):001:0> describe 'hbase_hive_table' Table hbase_hive_table is ENABLED hbase_hive_table COLUMN FAMILIES DESCRIPTION{NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s) in 0.2860 seconds # insert into HBase table through Hive INSERT OVERWRITE TABLE HBASE_HIVE_TABLE SELECT CODE, DESCRIPTION FROM SAMPLE_07; # access data in HBase through Hive SELECT * FROM HBASE_HIVE_TABLE; # access data in HBase through HBase shell hbase(main):001:0> scan 'hbase_hive_table', {LIMIT => 10} # create table in HBase first hbase(main):001:0> create 'JsonTable', 'cf' 0 row(s) in 1.4450 seconds => Hbase::Table - JsonTable hbase(main):002:0> describe 'JsonTable' Table JsonTable is ENABLED JsonTable COLUMN FAMILIES DESCRIPTION {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NON E', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE = > 'true'} 1 row(s) in 0.1000 seconds # run Java code to load data, code called HBaseLoad source code here # count rows in HBase > count 'JsonTable' Current count: 139000, row: fe671e34-b723-4134-9317-7f31fe2715dd 139861 row(s) in 9.9540 seconds => 139861 # create hbase mapped external table CREATE EXTERNAL TABLE hbase_json_table(key string, json string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:json") TBLPROPERTIES ("hbase.table.name" = "JsonTable"); # count using hive SELECT COUNT(*)FROM HBASE_JSON_TABLE; # query using get_json_object SELECT get_json_object(json, '$.id') AS ID, get_json_object(json, '$.person.last_name') AS LastName, get_json_object(json, '$.person.first_name') AS FirstName, get_json_object(json, '$.person.email') AS email, get_json_object(json, '$.person.location.address') AS Address, get_json_object(json, '$.person.location.city') AS City, get_json_object(json, '$.person.location.state') AS State, get_json_object(json, '$.person.location.zipcode') AS Zip, get_json_object(json, '$.person.text') AS Text, get_json_object(json, '$.person.url') AS URL FROM HBASE_JSON_TABLE; # query using json_tuple SELECT id, lastName, firstName, email, city, state, text, url FROM hbase_json_table A LATERAL VIEW json_tuple(A.json, 'id', 'person') B AS id, person LATERAL VIEW json_tuple(person, 'last_name', 'first_name', 'email', 'text', 'url', 'location') C as lastName, firstName, email, text, url, loc LATERAL VIEW json_tuple(loc, 'city', 'state') D AS city, state; ### Analytics over HBase snapshots ### # create snapshot hbase(main):006:0> snapshot 'hbase_hive_table', 'hbase_hive_table_snapshot'0 row(s) in 0.3390 seconds # use list_snapshots to list all available snapshots # create a restore location sudo -u hdfs hdfs dfs -mkdir /tmp/hbase_snapshots # register snapshot in Hive and query, (TABLE MUST BE MAPPED IN HIVE ALREADY) # NOTE: set command doesn't work in Ambari Views yet, run the following in a script https://issues.apache.org/jira/browse/HIVE-6584 # To query against a snapshot instead of the online table, specify the snapshot name via hive.hbase.snapshot.name. The snapshot will be restored into a unique directory under /tmp. This location can be overridden by setting a path via hive.hbase.snapshot.restoredir. set hive.hbase.snapshot.name=hbase_hive_table_snapshot; set hive.hbase.snapshot.restoredir=/tmp/hbase_snapshots; select * from hbase_hive_table; # set it back to point to table rather than snapshot and delete snapshot hive -e "set hbase.table.name=hbase_hive_table;" echo "delete_snapshot 'hbase_hive_table_snapshot'" | hbase shell # create Hive table with HBase snapshot and reference the HBase timestmap for column family Hive-2828 JIRA Jira for referencing each cell's timestamp is still not patched CREATE EXTERNAL TABLE hbase_json_table(key string, json string, time timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:json,:timestamp")TBLPROPERTIES ("hbase.table.name" = "JsonTable"); # create hive table using SerdDe DROP TABLE IF EXISTS json_serde_table;CREATE EXTERNAL TABLE json_serde_table ( id string, person struct<email:string, first_name:string, last_name:string, location:struct<address:string, city:string, state:string, zipcode:string>, text:string, url:string>)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'LOCATION '/tmp/json/'; # upload json file to a location on hdfs hdfs dfs -put data.json /tmp/json/ # query the table as you normally would SELECT id, person.first_name, person.last_name, person.email,person.location.address, person.location.city, person.location.state, person.location.zipcode, person.text, person.urlFROM json_serde_table LIMIT 5; # hbase mapped table with multiple values DROP TABLE IF EXISTS HBASE_TABLE_FROM_SERDE; CREATE EXTERNAL TABLE HBASE_TABLE_FROM_SERDE(key String, ID string, fn string, ln string, email string,address string, city string, state string, zip string, text string, url string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:id,cf:fn,cf:ln,cf:e,cf:addr,cf:city,cf:state,cf:zip,cf:txt,cf:url") TBLPROPERTIES ("hbase.table.name" = "serde_table"); # hbase mapped table with multiple values INSERT OVERWRITE TABLE hbase_table_from_serde SELECT id as key, id, person.first_name, person.last_name, person.email,person.location.address, person.location.city, person.location.state, person.location.zipcode, person.text, person.urlFROM json_serde_table LIMIT 5; # view in hive SELECT * FROM hbase_table_from_serde LIMIT 5; # view in hbase scan 'serde_table', {LIMIT => 10} get 'serde_table', '00043df9-7630-41c5-8b68-73fe5eb7d636'

Online	Offline
Last Visited	‎08-15-2019 06:35 AM

Member Since	‎10-01-2015 11:46 AM
Last Visited	‎08-15-2019 06:35 AM
Posts	3,933
Kudos received	1074

Cloudera Community

Re: Where can I get latest resource_management.c...

Re: How to Kerberize Flume?

Re: Load Hive Table form Pig Output File.

Re: HDP 2.6 Cluster Issues with Hive Metastore

Re: which HDP release will storm 1.1.0 be packaged...

Re: How to add custom alerts in Ambari

Re: How to move service ambari from node 1 to ano...

Re: How can I create a simple topology? I want to ...

Re: Hadoop services high availability

Re: Ambari stuck with "Install Pending" when creat...

Re: DataNode Block Report

Re: Ambari metrics display issue

Re: can not start zookeeper

Re: working with HBase and Hive (WIP)

working with HBase and Hive (WIP)