Member since
05-16-2016
26
Posts
20
Kudos Received
0
Solutions
02-06-2018
04:59 PM
5 Kudos
In this tutorial we will walkthrough the steps to enable YARN Node labels so that we can run LLAP on a specific set of nodes in the cluster. (Stack Version HDP 2.6.3) In the first step we will check if we have any NodeLabels created on the cluster and the command is as follows [$] yarn cluster --list-node-labels 18/02/05 10:32:29 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8050
18/02/05 10:32:31 INFO client.AHSProxy: Connecting to Application History server at stocks.hdp/192.168.1.200:10200 Node Labels: This shows that there are no node labels configured on the cluster. We now try to add a new node label, we will use this node label to assign to LLAP queue, so that LLAP runs on that specific node(s). For now we will keep the node label exclusive, so that only users in LLAP queue can access these nodes [$] yarn rmadmin -addToClusterNodeLabels "interactive(exclusive=true)” 18/02/05 10:33:25 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141
addToClusterNodeLabels: java.io.IOException: Node-label-based scheduling is disabled. Please check yarn.node-labels.enabled We get an exception saying Node-label-based scheduling is disabled, you may or may not get this error depending on your cluster config. If you do get this error, you will have to enable yarn.node-labels.enabled property in yarn-site.xml.
(Note: If user don’t specify “(exclusive=…)”, execlusive will be true by default) In Ambari you can use YARN config to enable node-labels, while we are at it we will also enable pre-emption. If Premption is enabled, higher-priority application do not have to wait because lower priority application have taken up the available capacity. With Preemption enabled, under-served queues can begin to claim their allocated cluster resources almost immediately, without having to wait for other queues applications to finish running. We can now create a node label [yarn@stocks ~]$ yarn rmadmin -addToClusterNodeLabels "interactive(exclusive=true)" 18/02/05 17:00:42 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 [yarn@stocks ~]$ yarn rmadmin -addToClusterNodeLabels "high-mem(exclusive=true)" 18/02/05 17:00:54 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 [yarn@stocks ~]$ yarn rmadmin -addToClusterNodeLabels "balanced(exclusive=true)" 18/02/05 17:02:37 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 We list the Node Labels again [yarn@stocks ~]$ yarn cluster --list-node-labels 18/02/06 09:41:34 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8050 18/02/06 09:41:35 INFO client.AHSProxy: Connecting to Application History server at stocks.hdp/192.168.1.200:10200 Node Labels: <high-mem:exclusivity=true>,<interactive:exclusivity=true>,<balanced:exclusivity=true> Now we have Node Label for high-men which we will use for spark jobs, interactive label will be used for LLAP and balanced will be used for the rest of the jobs.
We can now go back to Ambari and view the YARN Queue Manager, node labels will now be available for the YARN Queues. We assign Node Label interactive to LLAP and Node Label high-mem to spark queue, and give balanced to the default queue. 100% capacity of each Node Label to the queue (exclusive, adjust based on your workload or if multiple queues are sharing the Node Label). Node Label Assignment for LLAP Queue Node Label Assignment for Spark Queue Node Label Assignment for Default Queue If you need to remove node labels you can use below command
[yarn@stocks ~]$ yarn rmadmin -removeFromClusterNodeLabels “<Node Label Name1>,<Node Label Name2>”
Make sure the Node labels are not assigned to any queue otherwise you will get an exception like “cannot remove nodes label=<label>, because queue=blahblah is using this label. Please remove label on queue before remove the label”
If we look at the YARN Resource Manager UI, you will now be able to see the Node Labels Note that none of the node labels have any active NodeManagers assigned to them and hence no resources < memory:0, vCores:0>, lets assign some nodes to each of the labels and revisit the YARN UI. Since we have only 3 nodes in my environment, we will assign a single node to each NodeLabel, you can decide how many nodes you want to assign to each label based on the amount of memory and vCores for each type of workload. [yarn@stocks ~]$ yarn rmadmin -replaceLabelsOnNode "dn1.stocks.hdp=interactive" 18/02/06 10:10:26 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 [yarn@stocks ~]$ yarn rmadmin -replaceLabelsOnNode "dn2.stocks.hdp=balanced" 18/02/06 10:10:44 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 [yarn@stocks ~]$ yarn rmadmin -replaceLabelsOnNode “stocks.hdp=interactive"
18/02/06 10:10:55 INFO client.RMProxy: Connecting to ResourceManager at stocks.hdp/192.168.1.200:8141 Now when we revisit the Yarn UI, we will see the resources assigned to each NodeLabel We now start LLAP on the cluster, assign llap queue which has the two nodes assigned with Node Labels, memory per daemon ~ 78Gb, in-memory cache per daemon ~ 29GB, and number of executors per daemon = 12. If we look at the Running Applications in YARN we will see 1 TEZ AM Container and 1 Slider App Master and 2 LLAP Daemons running
... View more
- Find more articles tagged with:
- Data Processing
- hiveserver2
- How-ToTutorial
- llap
- Spark
- YARN
- yarn-node-labels
Labels:
11-19-2017
07:31 PM
Its probably a Spark config issue, can you share the detail log, the information you share doesn't give enough information to identify root cause
... View more
11-19-2017
07:28 PM
Check this post https://community.hortonworks.com/questions/34815/zeppelin-unable-to-run-multiple-hive-queries.html
... View more
11-19-2017
07:25 PM
Can you share the YARN queue configs for "data-science-queue" To debug can you set the pyspark args in your python code increasing the number of initial executors to see if it consumes more than 23 vcores. memory = '4g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
... View more
11-19-2017
07:17 PM
Spark uses lazy execution so when you execute the final step of converting dataframe to R-Dataframe that is when all the above code executes. a) Cache the table after you register the temp table sqlContext.cacheTable("someTable") b) Check out the Spark UI to see the job steps and find out which step is taking the longest (probably the oracle query) https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
... View more
11-19-2017
06:56 PM
You might wanna add a main method in your java class object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("WordCount")
val sc = new SparkContext(conf)
val data = sc.textFile("C:\\Users\\scala.txt")
val result = data.flatMap(_.split(" ")).map(word => (word , 1)).reduceByKey(_ + _)
result.collect.foreach(println)
}
... View more
08-12-2017
05:10 AM
3 Kudos
You can modify the Hive View 2.0 Settings and set "use.hive.interactive.mode" to true and restart the Hive View to run in llap mode. screen-shot-2017-08-12-at-10848-am.png
... View more
07-15-2017
01:35 AM
Can you check the nifi-app.log this is in the <nifi-directory>/logs/nifi-app.log
... View more
07-14-2017
01:55 AM
This Article will show how to read csv file which do not have header information as the first row. We will then specify the schema for both DataFrames and then join them together.
import org.apache.spark.sql.types._
val pathA = "hdfs:/tpc-ds/data/store_sales"
val pathB = "hdfs:/tpc-ds/data/store/"
// For Spark 2.x use -> val df = spark.read.option("header", true).csv(path)
val A_df = sqlContext.read.format("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.option("delimiter","|")
.load(pathA)
// Assign column names to the Store Sales dataframe
val storeSalesDF = A_df.select(
A_df("_c0").cast(IntegerType).as("SS_SOLD_DATE_SK"),
A_df("_c1").cast(IntegerType).as("SS_SOLD_TIME_SK"),
A_df("_c2").cast(IntegerType).as("SS_ITEM_SK"),
A_df("_c7").cast(IntegerType).as("SS_STORE_SK")
)
val B_df = sqlContext.read.format("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.option("delimiter","|")
.load(pathB)
// Assign column names to the Region dataframe
val storeDF = B_df.select(
B_df("_c0").cast(IntegerType).as("S_STORE_SK"),
B_df("_c1").cast(StringType).as("S_STORE_ID")
B_df("_c5").cast(StringType).as("S_STORE_NAME")
)
val joinedDF = storeSalesDF.join(storeDF,
storeSalesDF("SS_STORE_SK") === storeDF("S_STORE_SK")
)
joinedDF.take(5)
... View more
- Find more articles tagged with:
- csv
- Data Science & Advanced Analytics
- How-ToTutorial
- join
- Schema
- Spark
Labels:
07-14-2017
01:30 AM
There is no function setCassandraConf in org.apache.spark class, check out below link for connecting to Cassandra https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md
... View more
07-14-2017
01:24 AM
Where is data located? Hive, HDFS, can you share your cluster specs, how many nodes (# of cores / RAM)
... View more
07-13-2017
05:11 PM
Have you tried using sql explode function? i haven't tried this with SparkR, used it previously to flatten hierarchal json structure. import org.apache.spark.sql.functions.explode
val jsonRDD = sc.wholeTextFiles("mydataset.json")
val jsonDF = sqlContext.read.json(jsonRDD.values)
jsonDF.createOrReplaceTempView("jd")
val sqlDF = spark.sql("SELECT explode(data) FROM jd")
... View more
07-12-2017
07:28 PM
@Srinivasarao Daruna, can you give details of the HW and Spark config
... View more
07-12-2017
07:26 PM
Have you tried Caching the tables (subset) before executing the queries? Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so: CACHE LAZY TABLE Sales_Data_1998
... View more
06-29-2017
05:48 PM
import re
from pyspark.sql import Row
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\d+)'
# Returns a dictionary containing the parts of the Apache Access Log.
def parse_apache_log_line(logline):
match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)
if match is None:
# Optionally, you can change this to just ignore if each line of data is not critical.
# For this example, we want to ensure that the format is consistent.
raise Exception("Invalid logline: %s" % logline)
return Row(
ipAddress = match.group(1),
clientIdentd = match.group(2),
userId = match.group(3),
dateTime = match.group(4),
method = match.group(5),
endpoint = match.group(6),
protocol = match.group(7),
responseCode = int(match.group(8)),
contentSize = long(match.group(9)))
log_files = "hdfs://dataset/apache_logs/"
raw_log_files = sc.textFile(log_files)
raw_log_files.count()
parsed_log_files = raw_log_files.map(parse_apache_log_line)
parsed_log_files.toDF().registerTempTable("log_data")
%scala
// HELPER FUNCTION - Register ParseDate UDF to use with queries later
def parseDate(rawDate:String):Long = {
val dtParser = new java.text.SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss")
val splitted = rawDate.split(" ")
val futureDt = splitted(0)
val offset = splitted(1).asInstanceOf[String].toLong
val hourOffset = (offset.toInt / 100)
val minuteOffset = (offset - hourOffset * 100).toInt
val totalOffset = hourOffset * 60 * 60 + minuteOffset * 60
(dtParser.parse(futureDt).getTime() / 1000) + totalOffset
}
val example = "21/Jun/2014:10:00:00 -0730"
parseDate(example)
sqlContext.udf.register("parseDate", parseDate(_:String):Long)
%sql
-- Used the parsed logs and the date helper UDF to execute SQL Query
select responseCode, ipAddress, to_date(cast(parseDate(dateTime) as timestamp)) as date, count(*) as NoOfRequests, sum(contentSize) as TotalContentSize
from log_data
group by responseCode,ipAddress,to_date(cast(parseDate(dateTime) as timestamp))
order by count(*) desc
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- date
- FAQ
- logs
- parsers
- pyspark
- Spark
- spark-sql
Labels:
06-19-2017
06:35 PM
2 Kudos
Hi Abhi, Below tutorial from Micheal goes through how to modify json using ReplaceText and/or updateAttribute Processor for twitter data. https://community.hortonworks.com/articles/57803/using-nifi-gettwitter-updateattributes-and-replace.html Regards, Muji
... View more
05-12-2017
05:44 PM
1 Kudo
Have you checked out, below gist gives an example of how to call the API, you can use nifi to enrich the data periodically before you join. https://gist.github.com/ramhiser/f09a71d96a4dec80994c
... View more
05-12-2017
05:35 PM
1 Kudo
Better option would be to use Nifi for hive-streaming, Nifi has pre-built processor for streaming data into Hive, check out the post below for an example https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html
... View more
05-12-2017
05:33 PM
1 Kudo
Looks like there an exception causing SparkContext to shutdown. Can you check which application has "FAILED" application in the YARN Resource Manager UI. Click through and you can find the logs of individual containers, which should show some failure. (My hunch is its probably an unhandled exception while reading from the jdbc connection)
... View more
05-12-2017
05:27 PM
Can you share example of when filter is executing slower using DataFrame compared to RDD?. Which version of Spark, are you using Python or Scala.
... View more
02-07-2017
07:36 PM
2 Kudos
HAWQ is built using Postgres 8.2 (additions and backporting done recently for some features but fork was from 8.2)
Below is a document which highlights the difference between DB2 SQL and Postgres SQL
https://wiki.postgresql.org/images/d/d1/DB2UDB-to-PG.pdf Also there are Automated tools which would migrate SQL from DB2 to Postgres. Ispirer (http://www.ispirer.com/products/db2-luw-to-postgresql-migration)
... View more
01-26-2017
04:27 PM
Getting error "Stack creation failed, cannot create cluster" when trying to create a cluster on GCP Followed instructions (http://sequenceiq.com/cloudbreak-docs/latest/gcp/) Used the following image from S3 curl -Ls s3.amazonaws.com/public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_1.6.2_$(uname)_x86_64.tgz | sudo tar -xz -C /bin cbd Log Snipper Below, created credential, enable API's, create P12 key for the service account. /cbreak_cloudbreak_1 | 2017-01-26 15:03:25,408 [http-nio-8080-exec-9] create:166 INFO c.s.c.s.c.AmbariClusterService - [owner:7b4ff2b0-12a6-4a9b-ba22-86ab71158a4f] [type:STACK] [id:2] [name:test1] Cluster requested [BlueprintId: 1]
/cbreak_cloudbreak_1 | 2017-01-26 15:03:25,417 [http-nio-8080-exec-9] toResponse:22 ERROR c.s.c.c.m.SpringBadRequestExceptionMapper - [owner:cloudbreak] [type:cloudbreakLog] [id:undefined] [name:cb] Stack creation failed, cannot create cluster. Looks like a credential issue not able to pinpoint what? no entries in GCP logs, Any Ideas??
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
11-21-2016
09:37 PM
3 Kudos
The scope of this article is to explore ways to simulate Slowly changing dimension (SCD-Type2), given the append-only nature of HAWQ, the idea is to capture delta from source systems and have a flag in each row which shows whether the row is A-Append, U-Update or D-Delete, this will ensure that we can capture this information in our base tables in HAWQ.
In an Enterprise Data Lake, you will come across scenarios where you have to capture updates/deletes from Source Systems, since HAWQ is append-only, we do need an alternative way to simulate this, which we discuss in this article.
For each source table, we will create a corresponding base table, and then create view on top of it for access. The base table will have 2 additional attributes “update_version” and “delete_flag”, the “update_version” of type “bigserial” which will be an auto increment number which will track which version of the row is the latest, largest number would signify most recent record. The “delete_flag” would determine that the row has been deleted, older versions will still be retained and the base table would have the history of all changes mades to the record in the source system. The script create table DDL with parquet orientation and snappy compression (these are used to have a good balance of performance and compression) , these can be modified overall or for individual tables as required.
-- Create Base Table --
DROP TABLE IF EXISTS customer_table CASCADE;
CREATE TABLE base_table (
update_version BIGSERIAL,
delete BOOLEAN DEFAULT FALSE,
key VARCHAR(255),
fieldA text,
fieldB text,
fieldC text
)
WITH (APPENDONLY=TRUE, ORIENTATION=PARQUET, COMPRESSTYPE=SNAPPY);
-- Create View on Base Table
CREATE OR REPLACE VIEW customer_view AS SELECT key, fieldA, fieldB, fieldC FROM (SELECT DISTINCT ON (key)
key, delete, fieldA, fieldB, fieldC
FROM customer_table
ORDER BY key, update_version DESC, delete, fieldA, fieldB, fieldC) AS latest WHERE delete IS FALSE;
In this article we will take SQL Server as an example as a source system, the first step is to create tables in HAWQ from SQL Server, usually this goes into hundreds and sometime couple of thousand tables. The helper T-SQL script below reads the source tables and converts the data types to HAWQ data types and generates the DDL, this will help migrate the table DDL from SQL Server to HAWQ. -- The Script uses a metadata table in the public schema to capture the primary key information from SQL Server, create this table before executing the view generation script
CREATE TABLE public.metadata(
schemaName text,
tableName text,
key text
); Generate HAWQ DDL from SQL Server Tables using T-SQL Script.
if exists (select * from sys.objects where type = 'P' AND name = 'generate_HAWQ_DDL')
drop procedure generate_HAWQ_DDL;
GO
create procedure generate_HAWQ_DDL
as
DECLARE @tblName varchar(100), @SchemaName varchar(100),@ColName varchar(200), @ColType varchar(200), @i int,@KeyCol varchar(200), @KeyFull varchar(2000)
PRINT '---Starting DDL Generation ---'
PRINT 'CREATE TABLE public.metadata(schemaname text, tablename text, key text);'
DECLARE tbl_cursor CURSOR FOR
select table_schema, table_name
from information_schema.tables
where TABLE_TYPE = 'BASE TABLE'
OPEN tbl_cursor
FETCH NEXT FROM tbl_cursor INTO @SchemaName, @tblName
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'CREATE TABLE ' + @SchemaName + '.' + @tblName + '('
PRINT ' update_version BIGSERIAL,'
PRINT ' delete_flag BOOLEAN DEFAULT FALSE,'
DECLARE col_cursor CURSOR FOR
SELECT sub.column_name,
CASE WHEN sub.datatype = 'char' THEN 'character'
WHEN sub.datatype = 'nchar' THEN 'character'
WHEN sub.datatype = 'datetime' THEN 'timestamp'
WHEN sub.datatype = 'datetime2' THEN 'timestamp'
WHEN sub.datatype = 'datetimeoffset' THEN 'timestamptz'
WHEN sub.datatype = 'decimal' THEN 'numeric'
WHEN sub.datatype = 'float' THEN 'float8'
WHEN sub.datatype = 'real' THEN 'float8'
WHEN sub.datatype = 'int' THEN 'integer'
WHEN sub.datatype = 'bit' THEN 'boolean'
WHEN sub.datatype = 'nvarchar' THEN 'varchar'
WHEN sub.datatype = 'smalldatetime' THEN 'timestamp'
WHEN sub.datatype = 'smallmoney' THEN 'numeric'
WHEN sub.datatype = 'money' THEN 'numeric'
WHEN sub.datatype = 'sysname' THEN 'varchar'
WHEN sub.datatype = 'tinyint' THEN 'smallint'
WHEN sub.datatype = 'uniqueidentifier' THEN 'varchar(36)'
ELSE sub.datatype END + CASE WHEN sub.datatype in ('nchar', 'char', 'varchar', 'nvarchar', 'sysname')
AND sub.length <> -1 THEN '(' + cast(sub.length as varchar) + ')'
ELSE '' END as datatype
FROM (SELECT REPLACE(REPLACE(LOWER(sc.name), '\"', ''), '.', '_') column_name,
st.name as datatype,
sc.max_length as length,
sc.column_id
FROM sys.objects so
JOIN sys.columns sc ON so.object_id = sc.object_id
JOIN sys.schemas su ON so.schema_id = su.schema_id
JOIN sys.types st ON sc.system_type_id = st.system_type_id
AND st.system_type_id = st.user_type_id
WHERE so.type in ('U', 'V')
AND su.name = @SchemaName
AND so.name = @tblName) sub
WHERE sub.datatype not in ('binary', 'image', 'timestamp', 'xml', 'varbinary', 'text', 'ntext', 'sql_variant', 'hierarchyid')
ORDER BY sub.column_id
OPEN col_cursor
SET @i = 0
FETCH NEXT FROM col_cursor INTO @ColName, @ColType
WHILE @@FETCH_STATUS = 0
BEGIN
IF (@i = 0 )
PRINT @ColName + ' ' + @ColType
ELSE
PRINT ',' + @ColName + ' ' + @ColType
FETCH NEXT FROM col_cursor INTO @ColName, @ColType
SET @i= @i + 1
END
CLOSE col_cursor
DEALLOCATE col_cursor
PRINT ') '
PRINT ' WITH (APPENDONLY=TRUE, ORIENTATION=parquet, COMPRESSTYPE=snappy);'
DECLARE key_cursor CURSOR FOR
SELECT LOWER(c.name) as COLUMN_NAME
FROM sys.objects o
JOIN sys.schemas s ON o.schema_id = s.schema_id
JOIN sys.indexes i ON o.object_id = i.object_id
JOIN sys.index_columns ic ON o.object_id = ic.object_id AND i.index_id = ic.index_id
JOIN sys.columns c ON o.object_id = c.object_id AND ic.column_id = c.column_id
WHERE o.name = @tblName
AND s.name = @SchemaName
AND i.is_primary_key = 1
ORDER BY ic.key_ordinal
OPEN key_cursor
FETCH NEXT FROM key_cursor INTO @KeyCol
SET @i = 0;
SET @KeyFull = '';
WHILE @@FETCH_STATUS = 0
BEGIN
if (@i = 0)
SET @KeyFull = @KeyCol
else
SET @KeyFull = @KeyFull + ',' + @KeyCol
SET @i = @i + 1
FETCH NEXT FROM key_cursor INTO @KeyCol
END
PRINT ' INSERT INTO public.metadata (schemaName, tableName, key) VALUES(''' + @SchemaName +
''',''' + @tblName + ''','''+ @KeyFull + ''');'
CLOSE key_cursor
DEALLOCATE key_cursor
FETCH NEXT FROM tbl_cursor INTO @SchemaName, @tblName
END
CLOSE tbl_cursor
DEALLOCATE tbl_cursor
GO;
-- Execute the Procedure ---
EXEC generate_HAWQ_DDL;
Once the base tables have all been create in HAWQ, when the source data is being inserted into HAWQ, we need to make sure we update the delete flag to true or false based on the source CDC operation, and let the update_version bigserial auto-increment. Don’t forget to analyze the base tables once the data is populated and over time after more than 10% of the table has changed. Below is the helper script which will generate the views on top of the base tables, execute this on HAWQ, using either psql or pgAdmin. This will generate a script with the view DDL’s, execute the output script to create all the views. We’ve added a prefix “_table” to all base tables so that the views correspond to the source system table name, so that changes to SQL queries are minimized. Helper PL/PGSQL Script to Generate Views on top of Base Tables create or replace function gen_meta_cdc4()
returns void as
$body$
declare
SQL text;
kv RECORD;
cv RECORD;
col_list text;
i int;
mk text;
begin
FOR kv IN select * from information_schema.tables WHERE table_schema NOT IN ('pg_catalog','information_schema','hawq_toolkit') AND table_type = 'BASE TABLE' LOOP
col_list := '';
FOR cv IN SELECT column_name
FROM information_schema.columns X
WHERE x.table_schema = kv.table_schema and x.table_name = kv.table_name LOOP
col_list := col_list || ',' || cv.column_name;
END LOOP;
--RAISE NOTICE 'Column List:% for Table:% ', kv.table_name, col_list ;
SELECT key into mk FROM public.metadata WHERE tablename = kv.table_name AND schemaname = kv.table_schema ;
RAISE NOTICE 'CREATE OR REPLACE VIEW %.%_view AS SELECT % FROM (SELECT DISTINCT ON (%) % FROM % ORDER BY % , %) AS latest WHERE delete_flag IS FALSE ;',kv.table_schema,kv.table_name, substring(col_list, 2),mk, substring(col_list, 2), kv.table_name, mk, substring(col_list, 2);
END LOOP;
end
$body$
LANGUAGE 'plpgsql';
-- SELECT gen_meta_cdc4();
- Credits to Antonio Petrole, for the view generation script and Micheal Andre Pearce on the design
... View more
- Find more articles tagged with:
- architecture
- data-migration
- Design & Architecture
- hawq
- How-ToTutorial
- ingestion
08-18-2016
01:07 PM
2 Kudos
I have a customer who’s going through a hardware upgrade cycle on their clusters and looking for guidance regarding the use of IPA Server for dns, ldap, Kerberos in multiple environments. The client has multiple HDP clusters....Dev, QA, Prod. Wondering if it's better to set up a separate IPA server for each environment, of if one IPA can be used in a multi-tenancy mode across environments/realms. Curious to know how most of HDP customers handle this. Do they set up separate domains for each environment...e.g. dev.client.com, qa.client.com, prod.client.com that map to different Kerberos realms?? Seems simpler to have one realm managed by an IPA server. Multi-tenancy documentation for IPA Server seems pretty sparse. Have found only one source, and it covers version 3 and we are using version 4 as it has some fixes related to setting up trusts with Active Directory. If IPA multi-tenancy works would that be an alternate worth looking into?
... View more
Labels: