Member since
09-28-2015
16
Posts
26
Kudos Received
0
Solutions
10-04-2018
04:46 PM
3 Kudos
Objective
The objective of this
article is to present a workflow to capture and republish data in Kafka topics. Note that NiFi is NOT a Kafka Replication/Backup tool. This is just a workflow based approach to capture kafka data and publish it back to a fresh topic if necessary.
Although storage of Kafka data is not mandatory is several cases, it may be essential in cases such as
Archival – Dump Kafka Data in HDFS Corrections/Inspections – In cases where an invalid event
may disrupt a downstream system, to correct and restore the messages.
Start
by creating two processor groups – One for the backup workflow, second for the
restore workflow Backup
Workflow
Backup
is a long running workflow which archives data to HDFS. A file in HDFS may
contain a batch of events or a single event, based on how frequently messages
are getting published to the source kafka topic. Hence, user does not have to
stop/start this backup workflow frequently. Restore
Workflow
Restore
a manually triggered job/workflow in Nifi and it has to be modeled based on the
topic being restored.
Some considerations:
Restore
data to a temporary topic. A restore to original topic may duplicate the data
in the original topic.
If you
want to restore data to original topic, do that only after cleaning the
original topic (re-create the original topic)
Stop
backup workflow which re-publishing to original topic: To avoid duplication of
backed up data in HDFS by Nifi. An alternate approach would be to move or
rename the HDFS backup directory for the topic.
Steps:-
Stop
Backup KafkaConsumer processor.
Edit
Restore Workflow - ListHDFS processor - to update the HDFS directory to
the topic which needs to be restored.
Edit
Restore Workflow - PublishKafka processor - to update the topic which
needs to be restored.
Delete
and Recreate the restore topic (if it already exists)
Start
the Restore Workflow.
After
the restore is complete and verified -
Stop the restore workflow Move the backup directory (which has
the original backup files)
Start the backup process (this should
take all the data to a new directory to avoid duplication)
Backup Workflow
Some Technical Consideration for Backup Workflow:-
1.Backup location is HDFS
2.Kafka Key is to be saved along with the
Kafka Payload
3.Message Metadata like, partition info,
message offset should be stored along with the message
4.Messages format is either text or json
Variables Used
in the Backup Workflow
The following variables have been used the following
processors to avoid hardcoding values.
kafka.brokers
kafka.kerberos-service-name
kerberos.principal
kerberos.keytab
hadoop.configuration.resources
Destination Directory: /edl_data/backup/kafka/${kafka.topic}
Data Format
Each
file thus created may have a batch of Kafka messages. Each message would have
the following format. Note that in case there is a missing key or payload, that
part would just be left empty.
[topicname-partition-offset]<@>[message
key]
<@>[message payload]
Example:-
topic1-0-231<@> {"ID":"535"}<@> {"vwcustomer":{"id":"13631918","name":"Mikel","age":51,"date":"2018-10-04T15:16:06Z"}}
ConsumeKafka
This processor has a list of
topic(s), whose messages should be consumed and eventually backed up in HDFS.
Topic Names: Can have a list of topics that would be consumed by
this processor.
GroupID:
the nifi-kafka-consumer group id. if you want to consume all the messages,
please reset this to a new consumer group for a new backup
Max Poll Records: Why is this set to 1? To get
the key for each Kafka message. If we poll a batch of kafka messages, the
message key is lost and not stored as an attribute for the flow file. In order
to save the kafka key, max poll records should be set to 1. This way each
message is sent to a flow file having a kafka.key attribute.
UpdateAttribute
This processor captures the kafka-metadata
that would be used later to store this information along with the message key
and value. ReplaceText
This processor is used to
append the kafka.topic.metadata and the kafka.key to the beginning of the
message.
Note that the delimiter
used is separate metadata, key and value is "<@>". If this
delimiter is updated, make sure to update the same in the ExtractText processor
of the Restore Workflow.
MergeContent
If we do NOT use this
processor, then each kafka message (metadata<@>key<@>value)- would
be written to a separate HDFS file. Instead, we want to merge multiple kafka
messages to a given file in HDFS, if possible. We use MergeContent processor to
this. Note that we are merging content based on the attribute kafka.topic;
hence messages with same topic should end up together.
For
information on how batching would occur and other properties, Check:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeContent/index.html PutHDFS
Eventually, we backup the files to HDFS. Note that if the kafka.topic
destination directory does not exist, PutHDFS will create it. Restore Kafka Messages
Some Technical Consideration for Restore Workflow:-
1.Restore Messages from HDFS backup location to
Kafka Topic
2.Restore Key and payload 3.Order of restored messages: An attempt has been
made to preserve order while restoring kafka messages. Kafka maintains order
only on a partition. Hence, this Restore workflow is for topics with 1
partition.
4.If order of restored messages is not important,
than this workflow can be made significantly simpler.
5.As per the backup workflow, this is for messages
in text or json format.
ENSURE THAT BACKUP IS STOPPED BEFORE RESTORE IS STARTED.
This prevents a backup/restore loop thus duplicating the messages. ListHDFS and FetchHDFS
These processors fetch the hdfs files from the configured
directory location in ListHDFS. The files will not be deleted in the source
directory.
SplitText and ExtractText
These processors are used to extract the Kafka offset from
the metadata section stored in the backed up hdfs files.
The offset attribute extracted is used to sequence the
messages using “priority” attribute and “PriorityAttributePrioritizer”
Update Attribute, EnforceOrder and MergeContent
Update Attribute creates a new 'priority' attribute in every
flowfile (kafka message), and assigns the offset value to it.
This 'priority' attribute is used in the following relationships to prioritize processing of the messages.
Enforce Order is used to order flow files based on the
attribute "kafka.offset".
The wait timeout is set to 10 seconds. This means that if
there is a missing offset (order attribute) in the sequence, the processor will
wait for a 10 seconds window in which time, if it gets the event, it continues
to process and take the message to success relationship. If event does not
arrive within the window, the message will be taken to overtook relationship.
Refer:-
https://issues.apache.org/jira/browse/NIFI-3414
https://gist.github.com/ijokarumawak/88fc30a2300845b3c27a79113fc72d41
MergeContent batches the messages back into 1 flow file.
ExtractText, UpdateAttribute, PublishKafka Finally, we extract the metadata, key and value. Publish
kafka key and value using PublishKafka processor.
Note, that kafka topic names should be updated in
PublishKafka processor.
We use UpdateAttribute processor yet again to order messages
before publishing them.
... View more
Labels:
10-04-2018
04:50 AM
4 Kudos
Objective
-To store multiple row versions in HBase
to evaluate the impact on performance when doing reading all versions vs.
getting the latest version. To put this differently, would storing multiple
versions affect the performance when querying the latest version. -Using NiFi to be able to quickly ingest millions
or rows into HBase Warning
-Do not store more than a few versions in HBase.
This can have negative impacts. HBase is NOT designed to store more than a few
versions of a cell. Step 1: Create Sample Workflow using NiFi to ingest data into HBase table
-Create HBase Table Dataset: https://www.citibikenyc.com/system-data create 'venkataw:nycstations','nycstationfam'
0 row(s) in 1.3070 seconds
hbase(main):014:0> desc 'venkataw:nycstations'
Table venkataw.nycstations is ENABLED
venkataw.nycstations
COLUMN FAMILIES DESCRIPTION
{NAME => 'nycstationfam', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS =>
'0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1870 seconds
put 'venkataw:nycstations', 1224, 'nycstationfam:name', 'citiWunnava'
put 'venkataw:nycstations',1224,'nycstationfam:short_name','citiW'
put 'venkataw:nycstations',1224,'nycstationfam:lat','-90.12'
put 'venkataw:nycstations',1224,'nycstationfam:lon','.92'
put 'venkataw:nycstations',1224,'nycstationfam:region_id','9192'
put 'venkataw:nycstations',1224,'nycstationfam:capacity','100202'
put 'venkataw:nycstations',1224,'nycstationfam:rental_url','http://www.google.com/'
hbase(main):016:0> scan 'venkataw:nycstations'
ROW COLUMN+CELL
1224 column=nycstationfam:capacity, timestamp=1538594876306, value=100202
1224 column=nycstationfam:lat, timestamp=1538594875626, value=-90.12
1224 column=nycstationfam:lon, timestamp=1538594875643, value=.92
1224 column=nycstationfam:name, timestamp=1538594875555, value=citiWunnava
1224 column=nycstationfam:region_id, timestamp=1538594875660, value=9192
1224 column=nycstationfam:rental_url, timestamp=1538594902755, value=http://www.google.com/
1224 column=nycstationfam:short_name, timestamp=1538594875606, value=citiW
alter 'venkataw:nycstations', NAME=>'nycstationfam',VERSIONS => 10000
Step 2: NiFi Workflow to publish data to HBase table
The above NiFi workflow consumes messages
from a web server and published it to HBase. Configuration for the processors is as
follows: GetHTTP processor reads the REST endpoint
every 5 seconds. We
extract the stations object using SplitJson processor Finally
we use the PutHBaseJson processor we ingest the data to the destination HBase
Table created above. Notice that I am trying to randomly assign row identifier
so that eventually I get multiple rows versions for the same identifier The
PutHBaseJson processor uses the HBase Client Controller Servicer to connect to
HBase using Kerberos credentials. Step 3: Run queries to read the latest version and all available versions
I tried querying all versions vs. latest
versions in HBase with the following queries hbase(main):002:0> get 'venkataw:nycstations', 99886 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url']}
COLUMN CELL
nycstationfam:capacity timestamp=1507322481470, value=31
nycstationfam:lat timestamp=1507322481470, value=40.71286844
nycstationfam:lon timestamp=1507322481470, value=-73.95698119
nycstationfam:name timestamp=1507322481470, value=Grand St & Havemeyer St
nycstationfam:region_id timestamp=1507322481470, value=71
nycstationfam:rental_url timestamp=1507322481470, value=http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=471
nycstationfam:short_name timestamp=1507322481470, value=5267.08
nycstationfam:station_id timestamp=1507322481470, value=471
8 row(s) in 0.0600 seconds
get 'venkataw:nycstations', 99828 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url'],VERSIONS => 100}
{done for diff. rowids}
24 row(s) in 0.0200 seconds
16 row(s) in 0.0300 seconds
8 row(s) in 0.0310 seconds
232 row(s) in 0.1850 seconds
8 row(s) in 0.0570 seconds
152 row(s) in 0.0380 seconds
184 row(s) in 0.0420 seconds
208 row(s) in 0.1550 seconds
1 row:-
8 row(s) in 0.0050 seconds
8 row(s) in 0.0040 seconds
8 row(s) in 0.0060 seconds
all versions:-
14765 row(s) in 2.4350 seconds
14351 row(s) in 1.1620 seconds
14572 row(s) in 2.4210 seconds
In the above results * The green rows are for the latest version reads * The yellow rows are all version reads Notice how latest version reads are fairly consistent and have a smaller response times. Also notice as the number of versions (rows) increase, the response times for all-version reads keep increasing. So
based on this observation, as expected, it would seem like a query to get the latest
version would consistently perform well when compared to a query which returns ‘n’
versions.
... View more
Labels:
10-01-2017
02:21 AM
2 Kudos
The following link has some sample code to connect to a secure Hive Server2 via the default connection string and via knox. https://github.com/vspw/hiveJDBC It uses the following Jar files:- commons-configuration-xx.jar (UGI set configurations and metrics) commons-logging-xx.jar (*Not mandatory) hadoop-auth-xx.jar hadoop-common-xx.jar (UGI stuff, hadoop configurations) hive-jdbc-1.2xx-standalone.jar log4j-xx-api-xx.jar (for log4j) log4j-api-xx.jar (for log4j) log4j-core.xx.jar (for log4j) xercesImpl-xx.jar It also reads a bunch of properties like "keytabLocationWindows", "jdbcConnStringDirect", "hiveQuery" etc. from a properties file and logging is enabled with log4j2.xml. The class: "HiveJDBCKnox" , has some instructions to make knox connectivity work on windows platform as well.
... View more
Labels:
03-27-2017
03:24 PM
9 Kudos
Objective:-
Atlas, by default comes with certain types for Hive, Storm, Falcon etc. However, there might be cases where you would like to capture some custom metadata in Atlas. This can be metadata related to ETL processes, enterprise-operations etc.
The article explains how to create custom Atlas types and provides some insight on establishing lineage between said types.
Use Case:-
Consider a simple Use Case where Raw Textual data is analyzed via a ML process and the results are stored in HDFS. For instance, the raw source data is a dump of access logs on professors and research assistants referring research papers. The ML process would try to come up with recommendations on research papers for further reading for these end users. To capture metadata and lineage for this workflow, we would want to have three custom types in Atlas. a.) ResearchPaperAccessDataset: To capture the metadata for the input dataset. b.) ResearchPaperRecommendationResults: To capture the metadata for the resultant output after the ML process has completed its analysis. c.) ResearchPaperMachineLearning: To capture the metadata for the ML process itself, which analyzes the Input dataset. The eventual lineage we want to capture would look something like this:- Bonus: The last part of this article has some information to create new Traits using REST API and then to associate it with an existing atlas entity. Files:- The files being used in this article are present in github. a.) atlas_type_ResearchPaperDataSet.json b.) atlas_entity_ResearchPaperDataSet.json c.) atlas_type_RecommendationResults.json d.) atlas_entity_RecommendationResults.json e.) atlas_type_process_ML.json f.) atlas_entity_process_ML.json
Steps:-
1. Create Custom Atlas ResearchPaperAccessDataset Type:- https://github.com/vspw/atlas-custom-types/blob/master/atlas_type_ResearchPaperDataSet.json
[root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/types' -d @atlas_type_ResearchPaperDataSet.json
Enter host password for user 'admin':*****
{"requestId":"qtp84739718-14 - bed149b3-b360-4bf5-b46b-8f25ac7692c3","types":[{"name":"ResearchPaperAccessDataset"}]}
Notice the superType for "ResearchPaperAccessDataset" Type: ["DataSet"] "DataSet" in turn has superTypes of ["Referenceable","Asset"] "Asset" Type has attributes such as -> name, description, owner "Referenceable" Type has attributes such as -> qualifiedName Depending on whether these attributes are mandatory or not (based on the multiplicity required), the entity Type we create next, for "ResearchPaperAccessDataset" should have definitions for these attributes.
2. Create Entity for ResearchPaperAccessDataset Type:-
https://github.com/vspw/atlas-custom-types/blob/master/atlas_entity_ResearchPaperDataSet.json [root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/entities' -d @atlas_entity_ResearchPaperDataSet.json
{"requestId":"qtp84739718-15 - 827d5151-a6fb-4ccb-909f-f4ac5f8d8f26","entities":{"created":["40dc03dc-16d6-4281-826d-c4884cd1dad5"]},"definition":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference","id":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"40dc03dc-16d6-4281-826d-c4884cd1dad5","version":0,"typeName":"ResearchPaperAccessDataset","state":"ACTIVE"},"typeName":"ResearchPaperAccessDataset","values":{"name":"GeoThermal-1224","createTime":"2017-03-25T20:07:12.000Z","description":"GeoThermal Research Input Dataset 1224","resourceSetID":1224,"researchPaperGroupName":"WV-SP-INT-HWX","qualifiedName":"ResearchPaperAccessDataset.1224-WV-SP-INT-HWX","owner":"EDM_RANDD"},"traitNames":[],"traits":{}}}
3. Create Custom ResearchPaperRecommendationResults Type:- https://github.com/vspw/atlas-custom-types/blob/master/atlas_type_RecommendationResults.json [root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/types' -d @atlas_type_RecommendationResults.json
Enter host password for user 'admin':
{"requestId":"qtp84739718-15 - 9da58639-479f-41fb-819d-b11b4464011e","types":[{"name":"ResearchPaperRecommendationResults"}]}
4. Create Entity for ResearchPaperRecommendationResults Type:-
https://github.com/vspw/atlas-custom-types/blob/master/atlas_entity_RecommendationResults.json [root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/entities' -d @atlas_entity_RecommendationResults.json
Enter host password for user 'admin':
{"requestId":"qtp84739718-16 - b7ebe7d8-e671-4e94-a6c7-506947c7d5e5","entities":{"created":["43b6da13-31ee-4bbe-980e-84ed4b759f11"]},"definition":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference","id":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"43b6da13-31ee-4bbe-980e-84ed4b759f11","version":0,"typeName":"ResearchPaperRecommendationResults","state":"ACTIVE"},"typeName":"ResearchPaperRecommendationResults","values":{"name":"RecommendationsGeoThermal-4995149","createTime":"2017-03-25T21:00:12.000Z","description":"GeoThermal Recommendations Mar 2017","qualifiedName":"ResearchPaperRecommendationResults.4995149-GeoThermal","researchArea":"GeoThermal","hdfsDestination":"hdfs:\/\/xena.hdp.com:8020\/edm\/data\/prod\/recommendations","owner":"EDM_RANDD","recommendationsResultsetID":4995149},"traitNames":[],"traits":{}}} 5. Create a Special Process Type (ResearchPaperMachineLearning) which would complete the lineage information:-
https://github.com/vspw/atlas-custom-types/blob/master/atlas_type_process_ML.json
Notice the superTypes for "ResearchPaperMachineLearning" - ["Process"],
The "Process" type in turn constitutes superTypes "Referenceable" and "Asset".
And besides the attributes inherited from the above superTypes, "Process" has the following attributes:-
- inputs
- outputs
Our custom type (ResearchPaperMachineLearning) has attributes such as : operationType, userName, startTime and endTime.
Hence we need to collectively define all these types in the entity we create after we are done with creating this type.
[root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/types' -d @atlas_type_process_ML.json
Enter host password for user 'admin':
{"requestId":"qtp84739718-135 - 4f4cf931-0922-4d5c-b876-061f1bc1e7af","types":[{"name":"ResearchPaperMachineLearning"}]}
6. Create an entity for the Process Type:-
https://github.com/vspw/atlas-custom-types/blob/master/atlas_entity_process_ML.json [root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/entities' -d @atlas_entity_process_ML.json
Enter host password for user 'admin':****
{"requestId":"qtp84739718-18 - abbc3513-fa09-4a63-a8e5-af4b7b5f2d9a","entities":{"created":["4bd5263e-761b-4c0c-b629-c3d9fc87626f"]},"definition":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Reference","id":{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"4bd5263e-761b-4c0c-b629-c3d9fc87626f","version":0,"typeName":"ResearchPaperMachineLearning","state":"ACTIVE"},"typeName":"ResearchPaperMachineLearning","values":{"name":"ML_Iteration567019","startTime":"2017-03-26T20:20:13.675Z","description":"ML_Iteration567019 For GeoThermal DataSets","operationType":"DecisionTreeAndRegression","outputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"43b6da13-31ee-4bbe-980e-84ed4b759f11","version":0,"typeName":"DataSet","state":"ACTIVE"}],"endTime":"2017-03-26T20:27:23.675Z","inputs":[{"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Id","id":"40dc03dc-16d6-4281-826d-c4884cd1dad5","version":0,"typeName":"DataSet","state":"ACTIVE"}],"qualifiedName":"ResearchPaperMachineLearning.ML_Iteration567019","owner":"EDM_RANDD","clusterName":"turing","queryGraph":null,"userName":"hdpdev-edm-appuser-recom"},"traitNames":[],"traits":{}}} So after creating all the necessary Types and Entities we should be able to see the respective types created in Atlas UI and query entities and create new entities as usual. In this case we had a java application that used to create and deliver the entity json files for the above workflow after each iteration of the ML process completed successfully (Since the attributes values in the entities json file should be altered dynamically based on the iteration and results) You should also be able to see the types created thus far in the search objects. Creating a Trait and Associating tagging an Atlas Entity:- Note that we can create new Trait/Tag types in Atlas similar to how we have created our custom types. https://github.com/vspw/atlas-custom-types/blob/master/atlas_trait_type.json [root@zulu atlas]# curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/types' -d @atlas_trait_type.json Associating a trait to an existing Entity:-
curl -i -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' -u admin 'http://yellow.hdp.com:21000/api/atlas/entities/b58571af-1ef1-40e4-a89b-0a2ade4eeab3/traits' -d @associate_trait.json
associate_trait.json {
"jsonClass":"org.apache.atlas.typesystem.json.InstanceSerialization$_Struct",
"typeName":"PublicData",
"values":{
"name":"addTrait"
}
}
... View more
Labels:
05-24-2016
03:34 AM
Hi Vijay- Can you please check if the datanode disks are working and formatted properly. The problem might be related to the dfs.datanode.data.dir related disks not behaving as expected.
... View more
04-21-2016
07:08 PM
@marksf Whats the version of the mysql connector that you are using? Can you please verify if its 5.1.35 or greater?
... View more
04-21-2016
03:00 PM
Hi Saumil, https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Space_Reclamation As the documentations says, to enable trash collection for a certain period, you can set it to a value greater than zero. The fs.trash.interval can be set to 320 minutes (6 hours) or 1440 minutes (24 hours) depending on how long you would want to store your trash. The downside of storing more trash would be that the namenode would not be able to reclaim the blocks for the files. The fs.trash.checkpoint.interval can be set to something smaller than the fs.trash.interval (1 hour or 3 hours). The process which runs based on this interval would basically create new checkpoints and delete any older checkpoints that have expired based on fs.trash.inteval Hope this helps..
... View more
04-20-2016
05:30 PM
6 Kudos
The document lists the steps to be performed to enable users on windows workstation to access the HDP cluster hosted on a different realm.
Windows Workstation (realm: WIN.EXAMPLE.COM)
HDP (realm: HDP.EXAMPLE.COM) Accessing WebUI components for Namenode, YARN, MapReduce, Oozie etc.
Install and Setup MIT Kerberos Install Firefox Enable MIT Kerberos on Firefox Get Kerberos Ticket using MIT Kerberos Utility Open NN, RM UIs using Firefox Install and Setup MIT Kerberos
For 64-bit machines (includes both 32 and 64 bit libraries) http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-amd64.msi For 32-bit machines (includes only 32 bit libraries)
http://web.mit.edu/kerberos/dist/kfw/4.0/kfw-4.0.1-i386.msi Default location of configuration file in Windows machine is "C:\Program Files\MIT\Kerberos" directory.
Note: This is a hidden folder.
Copy the krb5.conf file (from the HDP KDC) to above mentioned location and rename krb5.conf to krb5.ini.
Configure the following environment properties for MIT Kerberos. - KRB5_CONFIG: Path for the kerberos ini file. - KRB5CCNAME: Path for the kerberos credential cache file.
Create a writable directory. Eg: c:\temp (or any path where user has access to and make a note of the path) for the krb5cache file specified in the path created by the utility. Save and reboot the machine.
Install Firefox Install Mozilla Firefox from below link and follow instructions on webpage for installation. https://www.mozilla.org/en-US/firefox/new/
Enable MIT Kerberos on Firefox Open Firefox, type about:config in URL and hit enter
Search for and change below parameters
network.negotiate-auth.trusted-uris = .domain.com network.negotiate-auth.using-native-gsslib = false network.negotiate-auth.gsslib = C:\Program Files\MIT\Kerberos\bin\gssapi32.dll network.auth.use-sspi = false network.negotiate-auth.allow-non-fqdn = true Get Kerberos Ticket using MIT Kerberos Utility 1. Click start button
2. Click All Programs
3. Click Kerberos for windows program group
4. Use "MIT Kerberos Ticket Manager" to obtain a ticket for the principal that will be used to connect to HDP cluster. a. Click Get Ticket b. Enter Principal and Password as below. Open NN, RM UIs using Firefox Open Firefox, ambari UI -> HDFS -> Quick Links -> NN UI or directly open the NN UI: http://xx-xxx-xxxx.aws.hdp.com:50070 If you don't have a kerberos ticket, then this opens a dialog box for MIT kerberos client requesting Principal and password for the HDP KDC user. (user@HDP.EXAMPLE.COM and ******). If your ticket is invalid or has expired, then you would see an "Unauthorized User" exception on the browser. Upon successful authentication this should open the NN UI and others.
Remember to refresh the ticket as it expires overtime (using getTicket in the MIT kerberos client utility). This authentication is also required to establish connection to Hive Server2 for operating on hive tables.
Appendix:-
https://bugzilla.mozilla.org/show_bug.cgi?id=628210 The above link has some debugging techniques which can be used to debug firefox authentication using gsslib.
... View more
03-09-2016
05:21 PM
1 Kudo
Hi Edgar, Can you please check if hs2 authorization is enabled. hive.server2.enable.authorization
... View more
10-02-2015
08:35 PM
We are trying to reduce the number of empty regions in a table (informs_search). This table has around 5900 regions (includes thousands of empty regions) and 8TB worth data. With an export – import approach on a sample data (16,819,569 rows). Backup informs serach disable 'informs_search' snapshot 'informs_search', 'informs_search_snpsht' clone_snapshot 'informs_search_snpsht', 'informs_search_backup' delete_snapshot 'informs_search_snpsht' enable ‘informs_search’ Export informs_search /usr/hdp/current/hbase-client/bin/hbase org.apache.hadoop.hbase.mapreduce.Export 'informs_search' /db/support/hexport/inform_search_bk 1 1 1443738964000 Truncate informs search truncate ‘informs_search’ Import informs_search hbase org.apache.hadoop.hbase.mapreduce.Import 'informs_search' /db/support/hexport/inform_search_bk Observations:-
Before we ran these steps , we had 9 regions (6+3) across two region servers After we ran these steps, we have 2 regions across 1 Region server ---------------------------------------------------------------------------------------------------------------------------------- * In Production, after running the same, would that reduce to 2 regions as well? * IS there anyway to predict/configure the resultant number of regions and regions servers? * Also, how many major compactions will it take so that data will be distributed across the region servers (and regions)?
... View more
Labels:
- Labels:
-
Apache HBase