Member since
07-31-2019
346
Posts
259
Kudos Received
62
Solutions
03-27-2018
03:56 PM
8 Kudos
AD (After Druid) In my opinion, Druid creates a new analytic service that I term the real-time EDW. In traditional EDWs the process of getting from the EDW to an optimized OLAP could take hours depending on the size of the EDW. Having a cube take six or more hours was not unusual for a lot of companies. The process also tended to be brittle and needed to be closely monitored. In addition to real-time, Druid also facilitates long-term analytics which effectively provides a Lambda architecture for your data warehouse. Data streams into Druid and is held in-memory for a configurable amount of time. While in-memory the data can be queried and visualized. After a period of time the data is then passed to long-term (historical) storage as segments on HDFS. These segments can also be part of the same visualization as the real-time data. As mentioned previously, all data in Druid contains a timestamp. The other data elements consist of the same properties as traditional EDWs: dimensions and measures. The timestamp simplifies the aggregation and Druid is completely denormalized into a single table. Remember dimensions are descriptions or attributes and measures are always additive numbers. Since this is always true, it is easy for Druid to infer in the data what are dimensional attributes and what are measures. For each timestamp duration Druid can, in real-time, aggregate facts along all dimensional attributes. This makes Druid ideal for topN, timeseries, and group-by with group-by being the least performant. The challenges around Druid and other No-SQL type technologies like MongoDB is in the visualization layer as well as the architectural and storage complexities. Durid stores json data and the json data can be difficult to manage and visualize in your standard tools such as Tableau or PowerBI. This is where the integration between Druid and Hive becomes most useful. There is a three-part series describing the integration: Druid and HIve Part 1 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ Druid and Hive Part 2 https://hortonworks.com/blog/sub-second-analytics-hive-druid/ Druid and Hive Part 3 https://hortonworks.com/blog/connect-tableau-druid-hive/ The integration provides a single pane of glass against real-time pre-aggregated cubes, standard Hive tables, and historical OLAP data. More importantly, the data can be accessed through standard ODBC and JDBC visualization tools as well as managed and secure3d through Ambari, Ranger, and Atlas. Druid provided out-of-the-box lambda architecture for time-series data and, coupled with Hive, we now provide for the flexibility and ease-of-access associated with standard RDBMS’s.
... View more
Labels:
03-27-2018
01:39 PM
8 Kudos
Druid is an OLAP solution for streaming event data as well as OLAP for long-term storage. All Druid data requires a timestamp. Druid’s storage architecture is based off the timestamp similar to how HBase stores by key. Following are some key benefits of Druid: Real-time EDW on event data (time-series) Long-term storage leveraging HDFS High availability Extremely performant querying over large data sets Aggregation and Indexing High-level of data compression Hive integration Druid provides a specific solution for specific problems that could not be handled by any other technology. With that being said, there are instances where Druid may not be a good fit: Data without a timestamp No need for real-time streaming Normalized (transactional) data (no joins in Druid) Small data sets No need for aggregating measures Non-BI queries like Spark or streaming lookups Why Druid? BD (Before Druid) In traditional EDWs data is broken into dimensional tables and fact tables. Dimensions describe an object. For example, a product dimension will have colors, sizes, names, and other descriptors of product. Dimensions are always descriptors of something whether it is a product, store, or something that is part of every EDW, date. In addition to dimensions, EDWs have facts, or measures. Measures are always numbers that can be added. For example, the number 10 can be measure but averages cannot. The reason is that you can add 10 to another number, but adding 2 averages does not make numerical sense. The reason for dimensions and facts is two-fold; firstly, it was a means to denormalize the data and reduce joins. Most EDW’s are architected so that you will not need more than 2 joins to get any answer; secondly, dimensions and facts easily map to business questions (see Agile Data Warehouse Design in the reference section). For example, take the following question: “How many product x were purchased last month in store y”? We can dissect this sentence in the following way. product, month, and store are all dimensions while the question “how many” is the fact or measure. For that single question you can begin building your star schema: Figure 1: Star Schema The fact table will have a single row for each unique product sold in a particular store for particular time frame. The difference between an EDW and OLAP is that an OLAP system will pre-aggregate this answer. Prior to the query you will run a process that anticipates this question and will add up all the sales totals for all the products for all time ranges. This is fundamentally why in traditional EDW development all possible questions needed to be flushed out prior to building the schemas. The questions being asked define how the model is designed. This makes traditional EDW development extremely difficult, prone to errors, and expensive. Interviewing LOBs to find what questions they may ask the system or, more likely, looking at existing reports and trying reproduce the data in an EDW design was only the first step. Once the EDW was built you still had to work on what is called the “semantic layer”. This is the point where you instruct the OLAP tool how to aggregate the data. Tools like SQL Server Analysis Server (SSAS) are complicated tools and require a deep understanding of OLAP concepts. They are based off the Kimball methodology and therefore to some extent require the schema to look as much like a star schema as possible. Figure 2: SSAS In these tools the first thing you needed to do was define hierarchies. The easiest hierarchy to define is date. Date always follows the pattern: year,month,day,hour,seconds. Other hierarchies include geography:country, state, county, city, zip code. Hierarchies are important in OLAP because they describe how the user will drill through the data and how the data will be aggregated at each level of the hierarchy. The semantic layer is also where you define what the analyst will actually see in their visualization tools. For example, exposing and EDW surrogate key would only confuse the analyst. In the hadoop space the semantic layer is handled by vendors and software like Jethrodata, AtScale, Kyvos, and Kylin (open source).
... View more
08-14-2017
02:43 PM
5 Kudos
Many organizations still ask the
question, “Can I run BI (Business Intelligence) workloads on Hadoop?” These workloads range from short, low-latency
ad-hoc queries to canned or operational reporting. The primary concerns center around user
experience. Will a query take too long
to return an answer? How quickly can I
change my mind with a report and drill down other dimensional attributes? For
almost 20 years vendors have engineered highly customized solutions to solve
these problems. Many times these
solutions require fine-tuned appliances that tightly integrate hardware and
software in order to squeeze out every last drop of performance. The challenges with these solutions
are mainly around cost and maintenance. These solutions become cost-prohibitive
at scale and require large teams to manage and operate. The ideal solution is
one that affordably scales but retains the same performance advantages as your
appliance. Your analysts should not see the difference between the costly
appliance and the more affordable solution. Hadoop is the solution and this
article aims to dispel the myth that BI workloads cannot run on Hadoop by
pointing to the solution components. When I talk to customers the first
thing they say when asking about SQL workloads on Hadoop is Hive is slow. This is largely to do with both competitors
FUD as well the history of Hive. Hive
grew up as a batch SQL engine because the early use cases where only concerned with
providing SQL access to MapReduce so that users would not need to know Java. Hive was seen as a way to increase the use of
a cluster over a larger user base. It
really wasn’t until the Hortonworks Stinger initiative
that a serious effort was made to make Hive into a faster query tool. The two main focuses of the Stinger effort
was around file format (ORC) and moving away from MapReduce to Tez. To be
clear, no one runs Hive on MapReduce anymore. If you are, you are doing it
wrong. Also, if
you are running Hive queries against CSV files or other formats then you are
also doing it wrong. Here is a great primer
to bookmark and make sure anyone working on Hive in your organization reads. Tez certainly did not alleviate the
confusion. Tez got Hive in the race but not across the finish line. Tez provided Hive with a more interactive
querying experience over large sets of data but what it did not provide is good
query performance for the typical ad-hoc, drilldown type querying we see in
most BI reporting. Do to the manner in which Tez and YARN spin up
and down containers and how containers are allocated on a per job basis, there
were limiting performance factors as well as concurrency issues. Hortonworks created LLAP
to solve these problems. Many customers
are confused by LLAP because they think it is a replacement for Hive. A better way to think about it is to look at
Hive as the query tool (the tool allowing you to use SQL language) and LLAP as
the resource manager for your query execution.
For the business user to use LLAP they do not need to change anything. You simply connect to the Hiveserver2
instance (you can use ODBC,
JDBC,
or the Hive
View) that has LLAP
enabled and you are on your way. The primary design purpose for LLAP
was to provide fast performance for ad-hoc querying over semi-large datasets
(1TB-10TB) using standard BI tools such as Tableau, Excel, Microstrategy, or
PowerBI. In addition to performance,
because of the manner in which LLAP manages memory and utilizes Slider, LLAP
also provides for a high level of concurrency without the cost of container
startups. In summary, you can run ad-hoc
queries today on HDP by using Hive with LLAP: Geisinger
Teradata offload https://www.youtube.com/watch?v=UzgsczrdWbg Comcast SQL
benchmarks https://www.youtube.com/watch?v=dS1Ke-_hJV0 Your company can now begin
offloading workloads from your appliances and running those same queries on
HDP. In the next articles I will address
the other components for BI workloads: ANSI compliance and OLAP. For more information around Hive, feel free to
checkout the following book: https://github.com/Apress/practical-hive
... View more
Labels:
07-06-2017
12:03 PM
@Frank Welsch the LLAP blogs links back to your HCC article.
... View more
10-12-2016
04:43 PM
7 Kudos
Pre-requisites
Power BI Desktop Edition https://powerbi.microsoft.com/en-us/desktop/ HDP Installation either cluster or sandbox. Can
be cloud or on-prem Hortonworks ODBC
Driver for Apache Hive (v2.1.5). Scroll down to the Add-Ons section http://hortonworks.com/downloads/#sandbox Data set. I used
TPC-DS which can be installed from here https://github.com/hortonworks/hive-testbench Note: This
exercise assumes a non-Kerberoized cluster Step 1 –
Setup the Environment How to setup the environment is up to you. You have
lots of options. I ended up using Windows Server 2016 Evaluation Edition
running on VirtualBox. For HDP I’m running a 4 node cluster in the cloud. Here
are a few possible scenarios:
Leverage existing
infrastructure. You will need hardware for HDP as well as a Windows or Mac box
to run Power BI desktop Run both instances in
a virtual environment using VMware or VirtualBox and install the HDP 2.5
sandbox. Run everything in the
cloud with either Azure IaaS, HDInsight, or AWS. Run Power BI on-prem
and utilize a cloud instance of HDP. (be careful of data transfer costs when
loading the BI model). Step 2 –
Load the Data Again, how you get data is up to you. I used TPC-DS
data from the hive-testbench (see pre-reqs). I used a 5 GB dataset but this is
configurable. Note: depending on the size
of your cluster generating the data set as well as loading data into Power BI
can take a long time. Be patient. Another option would be to use Apache Sqoop to load
data from an existing RDBMS data source. Following are instructions for moving
data from SQL Server AdventureWorks to Hive: Setup SQL Server (Windows)
Install SQL Create SQL Server
account hadoop\hadoop Disable shared memory Configure Hadoop (Linux)
Download sql driver
to /usr/local. curl -L
'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz'
| tar xz Copy driver to
library cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/hdp/current/sqoop-server/lib List databases to
check connection sqoop list-databases --connect
jdbc:sqlserver://<IP Address>:<Port Number> --username
<Username> --password <Password> sqoop list-databases --connect
jdbc:sqlserver://192.168.56.101:1433 --username hadoop --password hadoop Import table sqoop import --connect
"jdbc:sqlserver://<IP Address>;database=<Database
Name>;username=<Username>;password=<Password>" --table
<Table Name> --hive-import -- --schema <Schema Name> sqoop import --connect "jdbc:sqlserver://192.168.56.101:1433;database=AdventureWorksDW2014;username=hadoop;password=hadoop"
--table DimCustomer --hive-import My TPC-DS data set looks like this. If you are
familiar with data warehouse schemas, you will quickly notice the dataset is comprised
of dimension tables and measure (fact) tables. Step 3 –
Connecting to Hive Hiveserver2 allows for a connecting via standard
ODBC. To do this you will need to download and install the ODBC driver on the
system running Power BI desktop. The documentation can be found here http://hortonworks.com/wp-content/uploads/2016/08/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf After the driver installs you will need to setup a
System DSN. Open up the 64 bit ODBC Data Source Administrator and click on the
System DSN tab. You will want to highlight the Sample Hortonworks
Hive DSN and click Configure Following
shows the settings for my DSN: The primary areas you will want to consider is the
IP address of the HDP cluster, the Authentication Mechanism which is set to
User Name, and finally the name of the user connecting to Hive – in this case
it is the user hive. Test the connection
to make sure you can connect. Once successful you can now open up Power BI and
configure the ODBC connection to Hive.
In Power BI click on the Get Data in the ribbon and select More… As you can see Power BI has a wealth of data
connection options. For our purposes we are going to select ODBC. From the ODBC wizard you will need to select
the ODBC configuration setup in the previous step. Once connected you are now ready to load table data.
Step 4 –
Loading and Visualizing Hive data in Power BI If your connection was successful in Step 3 you
should now be able to drill down to your Hive schema using the Power BI
navigator. You should see all the tables listed. It is here you can pick and
choose which tables you want to view in your reports. I’d suggest only
selecting a few by focusing on a single fact table with a few key dimensions.
The load time especially on a small environment can be quite long. As the data loads you will get a loading screen as
Power BI connects to Hive and executes queries to pull the data. Once the tables are loaded they will appear in the right
hand Fields column At this point you may be tempted to start visualizing the
data but first we need to let Power BI know the relationships between the
tables and model our star schema. We do this by selecting the Manage
Relationship icon in the ribbon. If you notice in the screen shot below I have already setup
relationships between the store_sales measure and the various dimensions. To create additional relationships you can click the New
button. The example below I have selected the item table and highlighted the
i_item_sk column which is the column representing a unique item. I then
selected the store_sales table and highlighted the ss_item_sk as the foreign
key. Clicking OK will set this relationship. Clicking on the relationship button along the far left side
will show you the relationships as a logical data model. Note the standard and
familiar star schema pattern. Now that the data is loaded and modelled you are now ready
to visualize. Power BI is extremely easy to use and navigate. Expanding the
tables and clicking on columns will insert them into your report. Feel free to
experiment with the data and find out what insights you unlock.
... View more
Labels:
11-23-2015
02:12 AM
This demo will also work on VirtualBox but you will need to add port forwarding for port 9090. https://nsrc.org/workshops/2014/btnog/raw-attachment/wiki/Track2Agenda/ex-virtualbox-portforward-ssh.htm
... View more