Member since
05-28-2019
46
Posts
16
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4739 | 01-04-2017 07:03 PM | |
750 | 01-04-2017 06:00 PM |
12-22-2017
04:08 PM
Use livy to kick off spark jobs from nifi.
... View more
11-18-2017
02:07 AM
Lets assume the data is in LLAP cache. What are problems associated with changing the hdfs file when the data is in LLAP cache ?
... View more
- Tags:
- llap
Labels:
- Labels:
-
HDFS
10-23-2017
02:04 PM
Got it working. The scripts doesn't generate any data when the scale is set to 1. The error is one liner in the CONSOLE output, so it was difficult for me to find out. Even though the data did not get generated, the script continues to create the table rather than exiting.
... View more
10-22-2017
01:12 PM
Steps: https://github.com/cartershanklin/hive-druid-ssb Error: INFO : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Moving data to directory hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/ssb_druid from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/.hive-staging_hive_2017-10-19_09-48-11_929_1281256378284617965-1/-ext-10002
INFO : Starting task [Stage-4:DDL] in serial mode
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.io.FileNotFoundException: File /tmp/workingDirectory/.staging-hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10/segmentsDescriptorDir does not exist.
INFO : Resetting the caller context to HIVE_SSN_ID:ae1bcabc-646b-4e05-96cb-40fe4990a916
INFO : Completed executing command(queryId=hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10); Time taken: 6.296 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.io.FileNotFoundException: File /tmp/workingDirectory/.staging-hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10/segmentsDescriptorDir does not exist. (state=08S01,code=1)
Closing: 0: jdbc:hive2://sandbox.hortonworks.com:10500/default
[root@sandbox hive-druid-ssb]#
Failing load step: [root@sandbox hive-druid-ssb]# sh 00load.sh 1 sandbox.hortonworks.com:10500 sandbox.hortonworks.com root hadoop
Connecting to jdbc:hive2://sandbox.hortonworks.com:10500/default
Connected to: Apache Hive (version 2.1.0.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.metadata.username=${DRUID_USERNAME};
No rows affected (0.14 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.metadata.password=${DRUID_PASSWORD};
No rows affected (0.007 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.metadata.uri=jdbc:mysql://${DRUID_HOST}/druid;
No rows affected (0.006 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.indexer.partition.size.max=1000000;
No rows affected (0.004 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.indexer.memory.rownum.max=100000;
No rows affected (0.007 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.broker.address.default=${DRUID_HOST}:8082;
No rows affected (0.006 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.coordinator.address.default=${DRUID_HOST}:8081;
No rows affected (0.007 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.storage.storageDirectory=/apps/hive/warehouse;
No rows affected (0.004 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.tez.container.size=1024;
No rows affected (0.007 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500> set hive.druid.passiveWaitTimeMs=180000;
No rows affected (0.005 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10500>
0: jdbc:hive2://sandbox.hortonworks.com:10500> CREATE TABLE ssb_druid
0: jdbc:hive2://sandbox.hortonworks.com:10500> STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
0: jdbc:hive2://sandbox.hortonworks.com:10500> TBLPROPERTIES (
0: jdbc:hive2://sandbox.hortonworks.com:10500> "druid.datasource" = "ssb_druid",
0: jdbc:hive2://sandbox.hortonworks.com:10500> "druid.segment.granularity" = "MONTH",
0: jdbc:hive2://sandbox.hortonworks.com:10500> "druid.query.granularity" = "DAY")
0: jdbc:hive2://sandbox.hortonworks.com:10500> AS
0: jdbc:hive2://sandbox.hortonworks.com:10500> SELECT
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(c_city as string) c_city,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(c_nation as string) c_nation,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(c_region as string) c_region,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(d_weeknuminyear as string) d_weeknuminyear,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(d_year as string) d_year,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(d_yearmonth as string) d_yearmonth,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(d_yearmonthnum as string) d_yearmonthnum,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(lo_discount as string) lo_discount,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(lo_quantity as string) lo_quantity,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(p_brand1 as string) p_brand1,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(p_category as string) p_category,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(p_mfgr as string) p_mfgr,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(s_city as string) s_city,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(s_nation as string) s_nation,
0: jdbc:hive2://sandbox.hortonworks.com:10500> cast(s_region as string) s_region,
0: jdbc:hive2://sandbox.hortonworks.com:10500> lo_revenue,
0: jdbc:hive2://sandbox.hortonworks.com:10500> lo_extendedprice * lo_discount discounted_price,
0: jdbc:hive2://sandbox.hortonworks.com:10500> lo_revenue - lo_supplycost net_revenue
0: jdbc:hive2://sandbox.hortonworks.com:10500> FROM
0: jdbc:hive2://sandbox.hortonworks.com:10500> ssb_${SCALE}_flat_orc.customer, ssb_${SCALE}_flat_orc.dates, ssb_${SCALE}_flat_orc.lineorder,
0: jdbc:hive2://sandbox.hortonworks.com:10500> ssb_${SCALE}_flat_orc.part, ssb_${SCALE}_flat_orc.supplier
0: jdbc:hive2://sandbox.hortonworks.com:10500> where
0: jdbc:hive2://sandbox.hortonworks.com:10500> lo_orderdate = d_datekey and lo_partkey = p_partkey
0: jdbc:hive2://sandbox.hortonworks.com:10500> and lo_suppkey = s_suppkey and lo_custkey = c_custkey;
INFO : Compiling command(queryId=hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10): CREATE TABLE ssb_druid
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.datasource" = "ssb_druid",
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
cast(c_city as string) c_city,
cast(c_nation as string) c_nation,
cast(c_region as string) c_region,
cast(d_weeknuminyear as string) d_weeknuminyear,
cast(d_year as string) d_year,
cast(d_yearmonth as string) d_yearmonth,
cast(d_yearmonthnum as string) d_yearmonthnum,
cast(lo_discount as string) lo_discount,
cast(lo_quantity as string) lo_quantity,
cast(p_brand1 as string) p_brand1,
cast(p_category as string) p_category,
cast(p_mfgr as string) p_mfgr,
cast(s_city as string) s_city,
cast(s_nation as string) s_nation,
cast(s_region as string) s_region,
lo_revenue,
lo_extendedprice * lo_discount discounted_price,
lo_revenue - lo_supplycost net_revenue
FROM
ssb_1_flat_orc.customer, ssb_1_flat_orc.dates, ssb_1_flat_orc.lineorder,
ssb_1_flat_orc.part, ssb_1_flat_orc.supplier
where
lo_orderdate = d_datekey and lo_partkey = p_partkey
and lo_suppkey = s_suppkey and lo_custkey = c_custkey
INFO : We are setting the hadoop caller context from HIVE_SSN_ID:ae1bcabc-646b-4e05-96cb-40fe4990a916 to hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:__time, type:timestamp, comment:null), FieldSchema(name:c_city, type:string, comment:null), FieldSchema(name:c_nation, type:string, comment:null), FieldSchema(name:c_region, type:string, comment:null), FieldSchema(name:d_weeknuminyear, type:string, comment:null), FieldSchema(name:d_year, type:string, comment:null), FieldSchema(name:d_yearmonth, type:string, comment:null), FieldSchema(name:d_yearmonthnum, type:string, comment:null), FieldSchema(name:lo_discount, type:string, comment:null), FieldSchema(name:lo_quantity, type:string, comment:null), FieldSchema(name:p_brand1, type:string, comment:null), FieldSchema(name:p_category, type:string, comment:null), FieldSchema(name:p_mfgr, type:string, comment:null), FieldSchema(name:s_city, type:string, comment:null), FieldSchema(name:s_nation, type:string, comment:null), FieldSchema(name:s_region, type:string, comment:null), FieldSchema(name:lo_revenue, type:double, comment:null), FieldSchema(name:discounted_price, type:double, comment:null), FieldSchema(name:net_revenue, type:double, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10); Time taken: 4.972 seconds
INFO : We are resetting the hadoop caller context to HIVE_SSN_ID:ae1bcabc-646b-4e05-96cb-40fe4990a916
INFO : Setting caller context to query id hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10
INFO : Executing command(queryId=hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10): CREATE TABLE ssb_druid
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.datasource" = "ssb_druid",
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(d_year || '-' || d_monthnuminyear || '-' || d_daynuminmonth as timestamp) as `__time`,
cast(c_city as string) c_city,
cast(c_nation as string) c_nation,
cast(c_region as string) c_region,
cast(d_weeknuminyear as string) d_weeknuminyear,
cast(d_year as string) d_year,
cast(d_yearmonth as string) d_yearmonth,
cast(d_yearmonthnum as string) d_yearmonthnum,
cast(lo_discount as string) lo_discount,
cast(lo_quantity as string) lo_quantity,
cast(p_brand1 as string) p_brand1,
cast(p_category as string) p_category,
cast(p_mfgr as string) p_mfgr,
cast(s_city as string) s_city,
cast(s_nation as string) s_nation,
cast(s_region as string) s_region,
lo_revenue,
lo_extendedprice * lo_discount discounted_price,
lo_revenue - lo_supplycost net_revenue
FROM
ssb_1_flat_orc.customer, ssb_1_flat_orc.dates, ssb_1_flat_orc.lineorder,
ssb_1_flat_orc.part, ssb_1_flat_orc.supplier
where
lo_orderdate = d_datekey and lo_partkey = p_partkey
and lo_suppkey = s_suppkey and lo_custkey = c_custkey
INFO : Query ID = hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Session is already open
INFO : Tez session missing resources, adding additional necessary resources
INFO : Dag name: CREATE TABLE ssb_druid
STORED BY...c_custkey(Stage-1)
INFO : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896
INFO : Status: Running (Executing on YARN cluster with App id application_1508387771410_0016)
--------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------
Map 1 llap SUCCEEDED 0 0 0 0 0
Map 2 llap SUCCEEDED 0 0 0 0 0
Map 4 llap SUCCEEDED 0 0 0 0 0
Map 5 llap SUCCEEDED 0 0 0 0 0
Map 6 llap SUCCEEDED 0 0 0 0 0
Reducer 3 ...... llap SUCCEEDED 1 1 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/06 [==========================>>] 100% ELAPSED TIME: 3.16 s
--------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 2.49 seconds
INFO :
INFO : Query Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO : ----------------------------------------------------------------------------------------------
INFO : Compile Query 4.97s
INFO : Prepare Plan 2.05s
INFO : Submit Plan 0.72s
INFO : Start DAG 0.71s
INFO : Run DAG 2.49s
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 0.00 0 0 0 0
INFO : Map 2 0.00 0 0 0 0
INFO : Map 4 0.00 0 0 0 0
INFO : Map 5 0.00 0 0 0 0
INFO : Map 6 0.00 0 0 0 0
INFO : Reducer 3 1411.00 0 0 0 0
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : LLAP IO Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS ALLOCATION USED TOTAL_IO
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 0 0 0 0B 0B 0B 0B 0.00s
INFO : Map 2 0 0 0 0B 0B 0B 0B 0.00s
INFO : Map 4 0 0 0 0B 0B 0B 0B 0.00s
INFO : Map 5 0 0 0 0B 0B 0B 0B 0.00s
INFO : Map 6 0 0 0 0B 0B 0B 0B 0.00s
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : FileSystem Counters Summary
INFO :
INFO : Scheme: HDFS
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES BYTES_READ READ_OPS LARGE_READ_OPS BYTES_WRITTEN WRITE_OPS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 0B 0 0 0B 0
INFO : Map 2 0B 0 0 0B 0
INFO : Map 4 0B 0 0 0B 0
INFO : Map 5 0B 0 0 0B 0
INFO : Map 6 0B 0 0 0B 0
INFO : Reducer 3 0B 1 0 43B 1
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : Scheme: FILE
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES BYTES_READ READ_OPS LARGE_READ_OPS BYTES_WRITTEN WRITE_OPS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 0B 0 0 0B 0
INFO : Map 2 0B 0 0 0B 0
INFO : Map 4 0B 0 0 0B 0
INFO : Map 5 0B 0 0 0B 0
INFO : Map 6 0B 0 0 0B 0
INFO : Reducer 3 0B 0 0 0B 0
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : org.apache.tez.common.counters.DAGCounter:
INFO : NUM_SUCCEEDED_TASKS: 1
INFO : TOTAL_LAUNCHED_TASKS: 1
INFO : AM_CPU_MILLISECONDS: 2810
INFO : AM_GC_TIME_MILLIS: 13
INFO : File System Counters:
INFO : FILE_BYTES_READ: 0
INFO : FILE_BYTES_WRITTEN: 0
INFO : FILE_READ_OPS: 0
INFO : FILE_LARGE_READ_OPS: 0
INFO : FILE_WRITE_OPS: 0
INFO : HDFS_BYTES_READ: 0
INFO : HDFS_BYTES_WRITTEN: 43
INFO : HDFS_READ_OPS: 1
INFO : HDFS_LARGE_READ_OPS: 0
INFO : HDFS_WRITE_OPS: 1
INFO : org.apache.tez.common.counters.TaskCounter:
INFO : REDUCE_INPUT_RECORDS: 0
INFO : OUTPUT_RECORDS: 0
INFO : SHUFFLE_BYTES_DECOMPRESSED: 0
INFO : HIVE:
INFO : RECORDS_OUT_1_default.ssb_druid: 0
INFO : TaskCounter_Reducer_3_INPUT_Map_2:
INFO : REDUCE_INPUT_RECORDS: 0
INFO : SHUFFLE_BYTES_DECOMPRESSED: 0
INFO : TaskCounter_Reducer_3_OUTPUT_out_Reducer_3:
INFO : OUTPUT_RECORDS: 0
INFO : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Moving data to directory hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/ssb_druid from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/.hive-staging_hive_2017-10-19_09-48-11_929_1281256378284617965-1/-ext-10002
INFO : Starting task [Stage-4:DDL] in serial mode
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.io.FileNotFoundException: File /tmp/workingDirectory/.staging-hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10/segmentsDescriptorDir does not exist.
INFO : Resetting the caller context to HIVE_SSN_ID:ae1bcabc-646b-4e05-96cb-40fe4990a916
INFO : Completed executing command(queryId=hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10); Time taken: 6.296 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: java.io.FileNotFoundException: File /tmp/workingDirectory/.staging-hive_20171019094811_def7c784-7ab4-4803-943e-65c5b36f8d10/segmentsDescriptorDir does not exist. (state=08S01,code=1)
Closing: 0: jdbc:hive2://sandbox.hortonworks.com:10500/default
[root@sandbox hive-druid-ssb]#
... View more
Labels:
- Labels:
-
Apache Hive
10-02-2017
07:52 PM
3 Kudos
The use case is to get the updates using timestamp column from multiple Databases and do some simple transformations in nifi and stream the data into hive transactional tables using puthivestreaming processor in nifi. If not careful, this can easily lead to platform wide instability for hive.
ACID is for slowly changing tables and not for 100s of concurrent queries trying to update the same partition. ACID tables are bucketed tables. Have correct number of buckets and have uniform data distribution among buckets. Can easily lead to data skew and only one CPU core writes to a single bucket. Transactional manager and lock manager are stored in hive metastore. Transactional manager keeps the transactional state(open, commit and abort) and lock manager maintains the necessary locks for transactional tables. The recommendation is to separate Hive, oozie and ambari database and configure high availability for databases. Nifi can overwhelm hive ACID tables. Nifi will stream data using hive streaming API available with puthivestreaming processor. Default value for timer driven scheduling in nifi processors is 0 which will cause a hit on the hive metastore. The recommendation is to microbatch the data from nifi with scheduling time around 1 min or more(higher the better). Batching 5000-10000 records gave the best throughput. Compaction of the table necessary for ACID read performance tuning. Compaction can be automatic or on-demand. Make sure to enable email alerting on hive database when the count reached a threshold of around 100,000 transactions. hive=# select count(*) from TXNS where txn_state='a' and TXN_ID not in (select tc_txnid from txn_components);
count
------
3992
(1 row) One of the data source was overwhelming the metastore. After proper batching and scheduling, the metastore was able to clean by itself. Optimize the hive metastore DB for performance tuning. 10/02/2017 (7pm):
hive=# select txn_user, count(*) from txns where txn_state='a' group by txn_user ;
txn_user | count
hive | 73763
nifi | 1241297
(2 rows)
10/02/2017 (9am):
hive=# select txn_user, count(*) from txns where txn_state='a' group by txn_user ;
txn_user | count
hive | 58794
nifi | 26962
(2 rows)
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- hive-streaming
- NiFi
Labels:
09-20-2017
03:29 PM
Did u solve this ?
... View more
09-07-2017
12:53 AM
u can get each state of the yarn application from yarn logs
... View more
09-01-2017
02:05 PM
1 Kudo
Use nifi to make the API call to google analytics and push the data into HDFS preferably close to the HDFS block size to avoid small file problem. https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_user-guide/content/introduction.html
... View more
09-01-2017
01:43 AM
This article helps with configuration of HDF processors to integrate with HDP components
Integration for puthivestreaming processor:
Pre-requisites: In HDP ambari, enable the below properties for hive.
hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on = true
hive.compactor.worker.threads > 0
BUG: HiveStreaming processor not picking the updated hive-meta store principal from hive-site. xml
Resolution:
1. Copy hive-site.xml, core-site.xml, hdfs-site.xml to the conf directory of NiFi
2. Clear the Hive Configuration Resources property
3. Create an ExecuteStream processor on the canvas scheduled to run as often as the ticket needs to be refreshed (every hour should do for most setups) with the following groovy script, replacing nifi@HDF.COM with your principal and /etc/nifi.headless.keytab with your keytab
Steps: Create /etc/hdp/tgt.groovy and copy the code below into the file on all nifi nodes.
Note: Change the kerberos principal and keytab location as necessary
import org.apache.nifi.nar.NarClassLoader
import org.apache.nifi.nar.NarClassLoaders NarClassLoaders.instance.extensionClassLoaders.each { c ->
if (c instanceof NarClassLoader && c.workingDirectory. absolutePath.contains('nifi-hive')) {
def originalClassloader = Thread.currentThread().
getContextClassLoader();
Thread.currentThread().setContextClassLoader(c); try {
def configClass = c.loadClass('org.apache.hadoop.conf. Configuration', true)
def hiveConfigurator = c.loadClass('org.apache.nifi.util. hive.HiveConfigurator', true).newInstance();
def config = hiveConfigurator.getConfigurationFromFiles('')
hiveConfigurator.preload(config)
c.loadClass('org.apache.hadoop.security.UserGroupInformation'
, true).getMethod('setConfiguration', configClass).invoke(null, config)
c.loadClass('org.apache.hadoop.security.UserGroupInformation' , true).getMethod('loginUserFromKeytab', String.class, String.clas s).invoke(null, 'nifi@HDF.NET', '/etc/security/keytabs/nifi. headless.keytab')
log.info('Successfully logged in')
session.transfer(session.create(), REL_SUCCESS) } catch (Exception e) {
log.error('Unable to login with keytab', e)
session.transfer(session.create(), REL_FAILURE) } finally {
Thread.currentThread().setContextClassLoader
(originalClassloader);
} }
}
Do the following on all the nifi nodes,
chown nifi:nifi /etc/hdp/tgt.groovy; chmod +x /etc/hdp/tgt.groovy;
cp /etc/hdp/hive-site.xml /etc/nifi/conf/; cp /etc/hdp/core-site.xml /etc/nifi/conf/; cp /etc/hdp /hdfs-site.xml /etc/nifi/conf/;
chown nifi:nifi /etc/nifi/conf/hive-site.xml /etc/nifi/conf/core-site.xml /etc/nifi/conf/hdfs-site. xml;
Integration for hbase processor:
Add hostname and IP of all HDP nodes to /etc/hosts file in all nifi nodes
<Hostname1> <IpAddress1>
<Hostname2> <IpAddress2>
Integration for kafka processor:
Create a file under the location "/etc/hdp/zookeeper-jaas.conf" and copy the code below
Note: Change the kerberos principal and keytab location as necessary
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/security/keytabs/nifi.headless.keytab" storeKey=true
useTicketCache=false
principal="nifi@HDF.COM";
};
KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true
renewTicket=true
serviceName="kafka"
useKeyTab=true keyTab="/etc/security/keytabs/nifi.headless.keytab" principal="nifi@HDF.COM";
};
Add below configuration to advanced nifi-bootstrap.conf in HDP ambari and restart the nifi service.
java.arg.20=-Djava.security.auth.login.config=/etc/hdp/zookeeper-jaas.conf
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- NiFi
Labels:
08-22-2017
11:14 PM
I have tested my mac laptop in kerberized client clusters, so it is not kerberos client on local mac. krb5.conf is copied from kdc server. I copied the storm headless keytab from sandbox to my local.
... View more
08-22-2017
11:11 PM
krb5kdc requires 88 port only and it is open. I dont do kadmin stuff, so I didnt check on 749. It looks like a network problem between local and VM to me as well.
... View more
08-22-2017
01:40 PM
The sandbox services are working fine after kerberos setup. I am able to obtain tickets on the VM. The problem is currently with the obtaining a ticket on local mac to authenticate various UI's. I am using a headless keytab to obtain ticket on local mac. Error while obtaining ticket on local mac,
HW:Downloads nbalaji-elangovan$ env KRB5_TRACE=/dev/stdout kinit -kt /Users/nbalaji-elangovan/Downloads/storm.headless.keytab storm-sandbox@EXAMPLE.COM
2017-08-22T09:29:28 set-error: -1765328242: Reached end of credential caches
2017-08-22T09:29:28 set-error: -1765328243: Principal storm-sandbox@EXAMPLE.COM not found in any credential cache
2017-08-22T09:29:28 set-error: -1765328234: Encryption type des-cbc-md5-deprecated not supported
2017-08-22T09:29:28 Adding PA mech: ENCRYPTED_CHALLENGE
2017-08-22T09:29:28 Adding PA mech: ENCRYPTED_TIMESTAMP
2017-08-22T09:29:28 krb5_get_init_creds: loop 1
2017-08-22T09:29:28 KDC sent 0 patypes
2017-08-22T09:29:28 fast disabled, not doing any fast wrapping
2017-08-22T09:29:28 Trying to find service kdc for realm EXAMPLE.COM flags 0
2017-08-22T09:29:28 configuration file for realm EXAMPLE.COM found
2017-08-22T09:29:28 submissing new requests to new host
2017-08-22T09:29:28 connecting to host: udp ::1:kerberos (sandbox-hdf.hortonworks.com) tid: 00000001
2017-08-22T09:29:28 Queuing host in future (in 3s), its the 2 address on the same name: udp 127.0.0.1:kerberos (sandbox-hdf.hortonworks.com) tid: 00000002
2017-08-22T09:29:28 writing packet: udp ::1:kerberos (sandbox-hdf.hortonworks.com) tid: 00000001
2017-08-22T09:29:28 reading packet: udp ::1:kerberos (sandbox-hdf.hortonworks.com) tid: 00000001
2017-08-22T09:29:28 host completed: udp ::1:kerberos (sandbox-hdf.hortonworks.com) tid: 00000001
2017-08-22T09:29:28 set-error: -1765328378: Client unknown
2017-08-22T09:29:28 krb5_sendto_context EXAMPLE.COM done: 0 hosts 1 packets 1 wc: 0.067794 nr: 0.001019 kh: 0.000823 tid: 00000002
2017-08-22T09:29:28 krb5_get_init_creds: loop 2
2017-08-22T09:29:28 krb5_get_init_creds: processing input
2017-08-22T09:29:28 krb5_get_init_creds: got an KRB-ERROR from KDC
2017-08-22T09:29:28 set-error: -1765328378: Client (storm-sandbox@EXAMPLE.COM) unknown
2017-08-22T09:29:28 krb5_get_init_creds: KRB-ERROR -1765328378/Client (storm-sandbox@EXAMPLE.COM) unknown
kinit: krb5_get_init_creds: Client (storm-sandbox@EXAMPLE.COM) unknown
krb5.conf on local mac,
HW:Downloads nbalaji-elangovan$ cat /etc/krb5.conf
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = EXAMPLE.COM
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
EXAMPLE.COM = {
admin_server = sandbox-hdf.hortonworks.com
kdc = sandbox-hdf.hortonworks.com
}
/etc/hosts file on local mac, HW:Downloads nbalaji-elangovan$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
#127.0.0.1 sandbox.hortonworks.com
255.255.255.255 broadcasthost
::1 localhost
127.0.0.1 localhost sandbox-hdf.hortonworks.com
#172.17.0.2 sandbox-hdf.hortonworks.com
## vagrant-hostmanager-start id: e133f37a-cb74-4747-9f85-151944869ced
#192.168.66.121 node1
## vagrant-hostmanager-end
... View more
Labels:
- Labels:
-
Apache Storm
-
Kerberos
06-04-2017
08:17 PM
@Graham Martin Does the same steps work for CPU servers ?
... View more
06-03-2017
11:55 PM
Hi @zhoussen I am also trying to find an approach for tensorflow. Do you figure it out ?
... View more
06-03-2017
10:04 PM
I have read the datalake 3.0 blog on tensorflow. Please provide suggestions for customers currently on HDP 2.5.3 cluster. HDP cluster is installed on RHEL 7.3 , how about running tensorflow in docker on HDP edge node ? Does anyone have experience running tensorflow on spark ?
... View more
Labels:
05-23-2017
02:57 AM
Got it working Step 1 : yum install -y R R-devel libcurl-devel openssl-devel Step 2: In spark context, Run install.packages("knitr") Step 3 : ln -s /etc/hbase/conf/hbase-site.xml /etc/spark/conf/hbase-site.xml Step 4: /usr/hdp/2.5.3.0-37/spark/bin/sparkR --master local \
--packages com.hortonworks:shc:1.0.0-1.6-s_2.10 \
--repositories http://repo.hortonworks.com/content/groups/public/ \
--conf "spark.executor.extraClassPath=/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar:/usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-common-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-client-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-server-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/hbase/lib/guava-12.0.1.jar:/etc/hbase/conf/hbase-site.xml" \
--conf "spark.driver.extraClassPath=/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar:/usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-common-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-client-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-server-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/hbase/lib/guava-12.0.1.jar:/etc/hbase/conf/hbase-site.xml"
... View more
05-23-2017
02:48 AM
sparkR doesn't work only for hive table with hbase as storage. The hive managed table works as expected. /usr/hdp/2.5.3.0-37/spark/bin/sparkR --master local \
> --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 \
> --repositories http://repo.hortonworks.com/content/groups/public/ \
> --conf "spark.executor.extraClassPath=/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar:/usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-common-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-client-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-server-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/hbase/lib/guava-12.0.1.jar:/etc/hbase/conf/hbase-site.xml" \
> --conf "spark.driver.extraClassPath=/usr/hdp/current/hive-client/lib/hive-hbase-handler.jar:/usr/hdp/2.5.3.0-37/spark/lib/spark-examples-1.6.2.2.5.3.0-37-hadoop2.7.3.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.5.3.0-37.jar:/usr/hdp/current/phoenix-client/phoenix-client.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-common-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-client-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-server-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol-1.1.2.2.5.3.0-37.jar:/usr/hdp/2.5.3.0-37/hbase/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/hbase/lib/guava-12.0.1.jar:/etc/hbase/conf/hbase-site.xml"
... View more
05-23-2017
02:46 AM
17/05/22 21:01:11 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:327)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:302)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:167)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:162)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at org.apache.hadoop.hbase.client.MetaScanner.allTableRegions(MetaScanner.java:324)
at org.apache.hadoop.hbase.client.HRegionLocator.getAllRegionLocations(HRegionLocator.java:89)
at org.apache.hadoop.hbase.util.RegionSizeCalculator.init(RegionSizeCalculat
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Hive
-
Apache Spark
05-08-2017
04:23 PM
I dont see any messages on Hbase regionserver logs when I initiate the request. All ports are open between HDF and HDP. Telnet to regionserver port 16020 works from nifi nodes. Hbase setup has 2 masters and 6 regionservers with kerberos and ranger enabled.
... View more
05-04-2017
01:37 PM
Fails with below error in Nifi: ExecuteSQL[id=9f743ca0-425b-16a5-acdb-64784f4e18db] Unable to execute SQL select query select * from TABLE1 due to org.apache.nifi.processor.exception.ProcessException: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Thu May 04 09:32:46 EDT 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68339: row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=uXX,16020,1490984879149, seqNum=0
). No FlowFile to route to failure: org.apache.nifi.processor.exception.ProcessException: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Thu May 04 09:32:46 EDT 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68339: row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=XXX,16020,1490984879149, seqNum=0 JDBC string: jdbc:phoenix:XXX:2181:nifi@USE.UCDP.NET:/etc/security/keytabs/nifi.headless.keytab
... View more
05-03-2017
12:32 AM
This will be used as a lookup table. I am looking for this integration as HDF 2.1.1 doesnt have a processor to get the value of a specific rowkey in hbase.
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Phoenix
03-13-2017
04:55 AM
Worked. Thanks
... View more
03-12-2017
05:12 AM
That doesnt work, I tried it
... View more
03-12-2017
04:51 AM
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read \
.format("org.apache.phoenix.spark") \
.option("table", "TABLE1") \
.option("zkUrl", "<host>:2181:/hbase-secure") \
.load()
print(df)
Run using: spark-submit --master local --jars /usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.5.3.0-37.jar --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/phoenix-client.jar" spark_phoenix.py Error: 17/03/11 23:20:03 INFO ZooKeeper: Initiating client connection, connectString=<host>:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@653a27b2
17/03/11 23:20:03 INFO ClientCnxn: Opening socket connection to server <host>/10.18.106.4:2181. Will not attempt to authenticate using SASL (unknown error)
17/03/11 23:20:03 INFO ClientCnxn: Socket connection established to <host>/10.18.106.4:2181, initiating session
17/03/11 23:20:03 INFO ClientCnxn: Session establishment complete on server <host>/10.18.106.4:2181, sessionid = 0x15a7b6220540b7e, negotiated timeout = 40000
Traceback (most recent call last):
File "/d3/app/bin/spark_phoenix.py", line 10, in <module>
.option("zkUrl", "<host>:2181:/hbase-secure") \
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 139, in load
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/2.5.3.0-37/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/usr/hdp/2.5.3.0-37/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.sql.SQLException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
Sat Mar 11 23:20:03 EST 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68236: row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=usor7dhc01w06.use.ucdp.net,16020,1488988574688, seqNum=0
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2590)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$13.call(ConnectionQueryServicesImpl.java:2327)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:78)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2327)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:233)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:142)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:277)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:106)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:57)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
... View more
Labels:
- Labels:
-
Apache Phoenix
-
Apache Spark
02-25-2017
08:09 PM
Ya I can use phoenix and store the offset in nosql. I am actually looking for an example code in pyspark.
... View more
02-25-2017
06:00 AM
1 Kudo
store the kafka offset in ambari postgres or hive mysql database and start consuming from the stored offset in the next microbatch
... View more
Labels:
- Labels:
-
Apache Kafka
-
Apache Spark
02-20-2017
08:11 PM
1 Kudo
How to submit a python spark job with kerberos keytab and principal ?
... View more