Member since
10-05-2015
5
Posts
20
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2695 | 12-20-2016 04:38 PM |
12-20-2016
04:38 PM
1 Kudo
The QlikView Hybrid Solution is also referred to as "QlikView Direct Discovery", whereby data is not cached in QlikView but is, instead, loaded on demand, directly from the source. This allows Qlik to run against the vast amount of data in a Hadoop Data Lake without memory restrictions. One simple modification to the load script would be replacing the word "SELECT" with "DIRECT QUERY" within the SQL statement. You can find the documentation from Qlik at https://help.qlik.com/en-US/connectors/Subsystems/ODBC_connector_QV_help/Content/1.0/ApacheHive/Load-Apache-hive-data-with-Edit-script.htm. There are other functions related to the "DIRECT QUERY" option and you can find further documentation at https://help.qlik.com/en-US/qlikview/12.1/Subsystems/Client/Content/DirectDiscovery/direct-discovery-introduction.htm.
... View more
04-25-2016
04:32 PM
6 Kudos
Incremental
Updates: A
Strategy for Appending Records in Hive As part of the Stinger.next initiative, we expect
to see Hive supporting full CRUD operations (Insert, Select, Update,
Delete). While some transaction support exists for Hive today, it is still evolving with regard to performance and scalability. In the meantime, customers may still need to work with the traditional options for Hive table
integration: OVERWRITE or APPEND. The OVERWRITE option requires moving the
complete record set from source to Hadoop.
While this approach may work for smaller data sets, it may be
prohibitive with regard to scale. The APPEND option can limit data movement to
only new or updated records. As true
Inserts and Updates are not yet available in Hive, we need to consider a
process of preventing duplicate records as Updates are appended to the
cumulative record set. In this article, we will look at a strategy for
appending Updates and Inserts from delimited and RDBMS sources to existing Hive
table definitions. While there are
several options within the Hadoop platform for achieving this goal, our focus
will be on a process that uses standard SQL within the Hive toolset. Hive
provides multiple table definition options – External, Local and View External Tables are the combination of Hive
table definitions and HDFS managed folders and files. The table definition exists independent from
the data, so that, if the table is dropped, the HDFS folders and files remain
in their original state. Local Tables are Hive tables that are directly
tied to the source data. The data is physically tied to the table definition
and will be deleted if the table is dropped. Views, just as with traditional RDBMS, are
stored SQL queries that support the same READ interaction as HIVE tables, yet
do not store any data of their own.
Instead, the data is stored and sourced from the HIVE tables referenced
in the stored SQL query. The following process outlines a workflow that
leverages all of the above in four steps: The tables and views that will be a part of the
Incremental Update Workflow are: base_table: A HIVE Local table that initially holds all
records from the source system. After
the initial processing cycle, it will maintain a copy of the most up-to-date
synchronized record set from the source.
At the end of each processing cycle, it is overwritten by the
reporting_table (as explained in the Step 4:
Purge). incremental_table: A HIVE External table that holds the
incremental change records (INSERTS and UPDATES) from the source system. At the end of each processing cycle, it is cleared
of content (as explained in the Step 4:
Purge). reconcile_view: A HIVE View that combines and reduces the
base_table and incremental_table content to show only the most up-to-date
records. It is used to populate the
reporting_table (as explained in Step 3:
Compact). reporting_table: A HIVE Local table that holds the most
up-to-date records for reporting purposes.
It is also used to overwrite the base_table at the end of each
processing run. Step 1: Ingest Depending on whether direct access is available
to the RDBMS source system, you may opt for either a File Processing
method (when no direct access is available) or RDBMS Processing (when
database client access is available). Regardless of the ingest option, the processing
workflow in this article requires: One-time, initial load to move all data from
source table to HIVE. On-going, “Change Only” data loads from the
source table to HIVE. Below, both File Processing and Database-direct
(SQOOP) ingest will be discussed. File
Processing For this article, we assume that a file or set
of files within a folder will have a delimited format and will have been
generated from a relational system (i.e.
records have unique keys or identifiers). Files will need to be moved into HDFS using
standard ingest options:
WebHDFS
– Primarily used when integrating with applications, a Web URL provides an
Upload end-point into a designated HDFS folder. NFS
– Appears as a standard network drive and allows end-users to use standard
Copy-Paste operations to move files from standard file systems into HDFS. Once the initial set of records are moved into
HDFS, subsequent scheduled events can move files containing only new Inserts
and Updates. RDBMS
Processing SQOOP is the JDBC-based utility for integrating
with traditional databases. A SQOOP
Import allows for the movement of data into either HDFS (a delimited format can
be defined as part of the Import definition) or directly into a Hive table. The entire source table can be moved into HDFS
or Hive using the “--table” parameter. sqoop import --connect jdbc:teradata://{host name or ip address}/Database=retail
--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username
dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table
-m 1 After the initial import, subsequent imports
can leverage SQOOP’s native support for “Incremental Import” by using the
“check-column”, “incremental” and “last-value” parameters. sqoop import --connect jdbc:teradata://{host name or ip address}/Database=retail
--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username
dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table
-m 1--check-column modified_date --incremental lastmodified --last-value {last_import_date} Alternately, you can leverage the “query”
parameter, and have SQL select statements limit the import to new or changed
records only. sqoop import --connect jdbc:teradata://{host name or ip address}/Database=retail
--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username
dbc --password dbc --target-dir /user/hive/incremental_table -m 1 --query
'select * from SOURCE_TBL where modified_date > {last_import_date} AND $CONDITIONS’ Note: For the initial load, substitute “base_table”
for “incremental_table”. For all
subsequent loads, use “incremental_table”. Step
2: Reconcile In order to support an on-going reconciliation
between current records in HIVE and new change records, two tables should be
defined, base_table and incremental_table. base_table The example below shows DDL for the Hive table
“base_table” that will include any delimited files located in HDFS under the '/user/hive/base_table'
directory. This table will house the
initial, complete record load from the source system. After the first processing run, it will house
the on-going, most up-to-date set of records from the source system: CREATE TABLE base_table ( id string, field1 string, field2 string, field3 string, field4 string, field5 string, modified_date string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/base_table'; incremental_table The example below shows an external Hive table
“incremental_table” that will include any delimited files with incremental
change records, located in HDFS under the '/user/hive/incremental_append'
directory: CREATE EXTERNAL TABLE incremental_table ( id string, field1 string, field2 string, field3 string, field4 string, field5 string, modified_date string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/incremental_table'; reconcile_view This view combines record sets from both the Base
(base_table) and Change (incremental_table) tables and is reduced to only the
most recent records for each unique “id”.
This view (reconcile_view) is defined as follows: CREATE VIEW reconcile_view AS SELECT t2.id, t2.field1, t2.field2, t2.field3,
t2.field4, t2.field5, t2.modified_date FROM (SELECT *,ROW_NUMBER() OVER (PARTITION BY id
ORDER BY modified_date DESC) rn FROM
(SELECT * FROM base_table UNION
ALL SELECT * FROM incremental_table) t1) t2 WHERE rn = 1; Example: The sample data below represents the UNION of
both the base_table and incremental_table.
Note, there are new updates for “id” values 1 and 2, which are found as
the last two records in the table. The record for “id” 3 remains
unchanged. The reconcile_view should only show one record
for each unique “id”, based on the latest “modified_date” field value. The resulting query from “select * from reconcile_view”
shows only three records, based on both unique “id” and “modified_date” Step
3: Compact The reconcile_view now contains the most
up-to-date set of records and is now synchronized with changes from the RDBMS
source system. For BI Reporting and
Analytical tools, a reporting_table can be generated from the reconcile_view. Before creating this table, any previous
instances of the table should be dropped as in the example below. reporting_table DROP
TABLE reporting_table; CREATE TABLE reporting_table AS SELECT * FROM reconcile_view; Moving the Reconciled View (reconcile_view) to
a Reporting Table (reporting_table), reduces the amount of processing needed
for reporting queries. Further, the data stored in the Reporting Table (reporting_table) will also be static; unchanging until the next processing cycle. This provides consistency in reporting
between processing cycles. In contrast,
the Reconciled View (reconcile_view) is dynamic and will change as soon as new
files (holding change records) are added to or removed from the Change table
(incremental_table) folder /user/hive/incremental_table. Step
4: Purge To prepare for the next series of incremental
records from the source, replace the Base table (base_table) with only the most
up-to-date records (reporting_table).
Also, delete the previously imported Change record content
(incremental_table) by deleting the files located in the external table
location ('/user/hive/incremental_table'). From
a HIVE client: DROP
TABLE base_table; CREATE
TABLE base_table AS SELECT
* FROM reporting_table; From a HDFS client: hadoop
fs –rm –r /user/hive/incremental_table/* Final
Thoughts: While there are several possible approaches to
supporting incremental data feeds into Hive, this example has a few key
advantages:
By
maintaining an External Table for updates only, the table contents can be
refreshed by simply adding or deleting files to that folder.
The
four steps in the processing cycle (Ingest, Reconcile, Compact and Purge) can
be coordinated in a single OOZIE workflow.
The OOZIE workflow can be a scheduled event that corresponds to the data
freshness SLA (i.e. Daily, Weekly,
Montly, etc.)
In
addition to supporting INSERT and UPDATE synchronization, DELETES can be synchronized
by adding either a DELETE_FLAG or DELETE_DATE field to the import source. Then, use this field as a filter in the Hive
reporting table to hide deleted records. Ex. CREATE VIEW reconcile_view AS SELECT t2.id, t2.field1, t2.field2, t2.field3,
t2.field4, t2.field5, t2.modified_date, t2.delete_date FROM (SELECT *,ROW_NUMBER() OVER (PARTITION BY id
ORDER BY modified_date DESC) rn FROM
(SELECT * FROM base_table UNION
ALL SELECT * FROM incremental_table) t1) t2 WHERE rn = 1 and delete_date = "";
... View more
Labels:
04-13-2016
04:58 PM
12 Kudos
Virtual Integration of Hadoop with External Systems The challenge of merging data from disparate systems has
been a leading driver behind investments in data warehousing systems, as well
as, in Hadoop. While data warehousing
solutions are ready-built for RDBMS integration, Hadoop adds the benefits of
infinite and economical scale – not to mention the variety of structured and
non-structured formats that it can handle.
Whether using a data warehouse or Hadoop or both, physical data movement
and consolidation is the primary method of integration. However, there will be data in external systems that need to
remain siloed for a variety of reasons (legal, political, economic, etc.). There may also be challenges with
synchronizing rapidly changing data from a system of record to a consolidated
platform (DW or Hadoop). This introduces
the need for “data federation” or “data virtualization”, where data is
integrated without copying data between systems. Best stated in the wiki, “A federated database system
is a type of meta-database management system (DBMS), which transparently maps
multiple autonomous database systems into a single federated database.“ https://en.wikipedia.org/wiki/Federated_database_system The following article outlines 3 patterns to address the
challenge of Data Federation with Hadoop and considers Pros and Cons associated
with each option. Pattern 1: Off the Shelf Data Virtualization Products For a true, OOTB option, several products have been
specifically designed to address data federation challenges. Defining their domain as “data virtualization”,
these products provide an external, metadata layer above multiple source
systems. Within these tools, operators can use metadata to model
virtual tables or views that combine data from a variety of sources (RDBMS,
Files, Web Services) including Hadoop-specific services (Hive, HBase,
HDFS). After the combined view has been
created, the results can be expressed as either a relational model or via Web
Services. The intelligence built-in to these products is focused on
reducing the movement of data across a network and intelligent query
re-write. Predicate pushdown (i.e. the
ability to limit data via filters) serves to both limit data movement and also
reduce stress on source systems. Query
re-write takes a query against a single “virtual” table and breaks it up into
multiple, optimal sub-queries in the language native to each integrated source
system. Consider this option if you require true separation of the
semantic layer from underlying source systems, especially useful when source
systems change frequently or during system migrations. Notable Products: Cisco Data
Virtualization, Denodo, Informatica Pros: Handles a variety of sources Built-In UI for designing and publishing
federated views Automates query rewrite to source systems Can express federated views as a tabular schema
or as web services 100% External; allowing metadata to live independent
of data sources. Cons: Adds another system or layer to be managed
within the enterprise Requires additional hardware Requires licenses Pattern 2: Built-In Federation Features from a Key
System While not built specifically for federation, some data
warehouse or analytics platforms offer pass-through integration with other
systems. Meaning, without copying data
between systems, metadata can be shared and data can be pulled “on demand” for
reporting. Virtual tables from external
sources appear side by side with native, locally stored data, so that the end
user is abstracted to back-end storage locations. Instead, they work with the familiar
interface of a key system within the enterprise. Similar to data virtualization products, predicate pushdown
is also supported. However, as these products are not
specifically designed for data virtualization, the number of supported source
systems can be limited as will variety (primarily RDBMS). If your BI reporting is primarily focused on a key data
warehousing or analytics application, consider Hadoop integration through the
lens of the primary source system. Notable Products: Teradata QueryGrid,
SAP HANA Vora Pros: Built-In UI for designing and publishing
federated views Automates query rewrite to source systems Familiar interface for BI and metadata
management Leverage existing tools and hardware Cons: May require additional licenses Support for external sources may be limited Federation strategy is bound to a specific
source Pattern 3: Federation Features of HDP Platform Services There are services within the Hortonworks Hadoop
distribution (HDP) that can also be used to integrate external data. Unlike the option of integrating through the
lens of a primary source (e.g. Teradata or SAP), you can leverage a storage
neutral service, like Spark, to integrate native Hadoop stores (Hive, HBase and
HDFS) with external RDBMS. Using Spark, internal or external data can be loaded as RDDs
(resilient distributed datasets), then exposed as SparkSQL tables using an
embedded JDBC service for BI Reporting. The ability to register a Spark context with a Thrift Server makes this possible. Further, for optimal performance, external tables can be “cached” into
local memory – which reduces continuous network traffic and buffers source
systems from direct query. While
technically, the in-memory data is a “copy” from an external source, it is not
persisted to local disk and does not need to be managed as a permanent asset. This approach is less “turn-key” than the previous options
in that there is no easy-to-use or familiar UI available for data
modeling. Instead, it involves some
minor coding in Spark to attach-to and serve up the content from remote and
local sources. But it does have the
benefit of running natively in Hadoop and leveraging power of distributed
computing. Also, Spark is a service
engine, existing independent of storage, so you are not binding your federation
approach to a single source. Consider this option when you need to serve both Data
Science and BI Analytics requirements and/or when Hadoop-based storage options (Hive,
HBase, HDFS) are a primary source of data for reporting. Notable Products: Spark, SparkSQL Pros: Distributed, in-memory option for best
performance Serves Data Science team in Spark and external
BI Reporting tools. Leverage existing tools and hardware No licenses Cons: SQL support is less mature than data warehouse
or data virtualization options Some programming required to stage external
tables Conclusions: Each approach to integrating Hadoop with external systems
has merit and should be measured against requirements and long-term strategy. If you want a true, off-the-shelf data
federation capability, there are products that are already built for you. If you are heavily invested in an existing data
warehouse or analytics approach (e.g. Teradata, SAP), look to your preferred
vendors for guidance. If your long-term strategy positions Hadoop as a
“Data Lake” or central data management platform, look to leverage your
investment and consider Spark. It’s not
just for Data Scientists anymore (although they love it).. SparkSQL Federation article and demo links: https://community.hortonworks.com/content/kbentry/29928/using-spark-to-virtually-integrate-hadoop-with-ext.html https://community.hortonworks.com/content/repo/29883/sparksql-data-federation-demo.html Further reading and associated links: Cisco Data Virtualization (http://www.cisco.com/c/en/us/services/enterprise-it-services/data-virtualization.html) Denodo (http://www.denodo.com/en) Teradata QueryGrid (http://www.teradata.com/products-and-services/query-grid/#tabbable=0&tab1=0&tab2=0) SAP Vora (http://www.cio.com/article/3043777/analytics/saps-hana-vora-bridges-divide-between-enterprise-and-hadoop-data.html) Spark Thrift Server and query
from an external BI tool using JDBC (https://www.linkedin.com/pulse/query-internal-rdd-data-spark-streaming-from-outside-bo-yang).
... View more
Labels: