Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13380 | 02-20-2018 12:33 PM | |
1514 | 02-19-2018 05:12 AM | |
1864 | 12-28-2017 06:13 AM | |
7149 | 09-28-2017 09:25 AM | |
12190 | 09-25-2017 11:19 AM |
09-17-2017
09:02 AM
Summary Business users are
continuously envisioning new and innovative ways to use data for operational
reporting and advanced analytics. The Data Lake, a next-generation data storage
and management solution, was developed to meet the ever-evolving needs of
increasingly savvy users. This article explores existing challenges with the enterprise data warehouse and other
existing data management and analytic solutions. It describes the necessary
features of the Data Lake architecture and the capabilities required to
leverage a Data and Analytics as a Service (DAaaS) model. It also covers the
characteristics of a successful Data Lake implementation and critical
considerations for designing a Data Lake. Current EDW Challenges Business users are
continuously envisioning new and innovative ways to use data for operational
reporting and advanced analytics. With the evolution of users’ needs coupled
with advances in data storage technologies, the inadequacies of current
enterprise data warehousing solutions have become more apparent. The following
challenges with today’s data warehouses can impede usage and prevent users from
maximizing their analytic capabilities: Timeliness. Introducing new content to the enterprise
data warehouse can be a time-consuming and cumbersome process. When users need
immediate access to data, even short processing delays can be frustrating and
cause users to bypass the proper processes in favor of getting the data quickly
themselves. Users also may waste valuable time and resources to pull the data
from operational systems, store and manage it themselves, and then analyze it. Flexibility. Users not only lack on-demand access to any
data they may need at any time, but also the ability to use the tools of their
choice to analyze the data and derive critical insights. Additionally, current
data warehousing solutions often store one type of data, while today’s users
need to be able to analyze and aggregate data across many different formats. Quality. Users may view the current data warehouse
with suspicion. If where the data originated and how it has been acted on are
unclear, users may not trust the data. Also, if users worry that the data in the
data warehouse is missing or inaccurate, they may circumvent the warehouse in
favor of getting the data themselves directly from other internal or external
sources, potentially leading to multiple, conflicting instances of the same
data. Findability. With many current data warehousing
solutions, users do not have a function to rapidly and easily search for and
find the data they need when they need it. Inability to find data also limits
the users’ ability to leverage and build on existing data analyses.
Advanced analytics users require a data storage solution based on an IT “push”
model (not driven by specific analytics projects). Unlike existing solutions,
which are specific to one or a small family of use cases, what is needed is a
storage solution that enables multiple, varied use cases across the enterprise. This new solution
needs to support multiple reporting tools in a self-serve capacity, to allow
rapid ingestion of new datasets without extensive modeling, and to scale large
datasets while delivering performance. It should support advanced analytics,
like machine learning and text analytics, and allow users to cleanse and
process the data iteratively and to track lineage of data for compliance. Users
should be able to easily search and explore structured, unstructured, internal,
and external data from multiple sources in one secure place. The solution that fits
all of these criteria is the data lake
The Data Lake Architecture The Data Lake is a
data-centered architecture featuring a repository capable of storing vast
quantities of data in various formats. Data from webserver logs, data bases,
social media, and third-party data is ingested into the Data Lake. Curation
takes place through capturing metadata and lineage and making it available in
the data catalog (Datapedia). Security policies, including entitlements, also
are applied. Data can flow into the
Data Lake by either batch processing or real-time processing of streaming data.
Additionally, data itself is no longer restrained by initial schema decisions,
and can be exploited more freely by the enterprise. Rising above this
repository is a set of capabilities that allow IT to provide Data and Analytics
as a Service (DAaaS), in a supply-demand model. IT takes the role of the data
provider (supplier), while business users (data scientists, business analysts)
are consumers. The DAaaS model
enables users to self-serve their data and analytic needs. Users browse the
lake’s data catalog (a Datapedia) to find and select the available data and
fill a metaphorical “shopping cart” (effectively an analytics sandbox) with
data to work with. Once access is provisioned, users can use the analytics
tools of their choice to develop models and gain insights. Subsequently, users
can publish analytical models or push refined or transformed data back into the
Data Lake to share with the larger community. Although provisioning
an analytic sandbox is a primary use, the Data Lake also has other
applications. For example, the Data Lake can also be used to ingest raw data,
curate the data, and apply ETL. This data can then be loaded to an Enterprise
Data Warehouse. To take advantage of the flexibility provided by the Data Lake,
organizations need to customize and configure the Data Lake to their specific
requirements and domains. Characteristics of a Successful Data Lake Implementation A Data Lake enables
users to analyze the full variety and volume of data stored in the lake. This
necessitates features and functionalities to secure and curate the data, and
then to run analytics, visualization, and reporting on it. The characteristics
of a successful Data Lake include: Use
of multiple tools and products. Extracting maximum value out of the Data Lake requires
customized management and integration that are currently unavailable from any
single open-source platform or commercial product vendor. The cross-engine
integration necessary for a successful Data Lake requires multiple technology
stacks that natively support structured, semi-structured, and unstructured data
types. Domain
specification. The Data Lake must
be tailored to the specific industry. A Data Lake customized for biomedical
research would be significantly different from one tailored to financial
services. The Data Lake requires a business-aware data-locating capability that
enables business users to find, explore, understand, and trust the data. This
search capability needs to provide an intuitive means for navigation, including
key word, faceted, and graphical search. Under the covers, such a capability
requires sophisticated business ontologies, within which business terminology
can be mapped to the physical data. The tools used should enable independence
from IT so that business users can obtain the data they need when they need it
and can analyze it as necessary, without IT intervention. Automated
metadata management. The Data Lake
concept relies on capturing a robust set of attributes for every piece of
content within the lake. Attributes like data lineage, data quality, and usage
history are vital to usability. Maintaining this metadata requires a
highly-automated metadata extraction, capture, and tracking facility. Without a
high-degree of automated and mandatory metadata management, a Data Lake will
rapidly become a Data Swamp. Configurable
ingestion workflows. In a thriving Data
Lake, new sources of external information will be continually discovered by
business users. These new sources need to be rapidly on-boarded to avoid
frustration and to realize immediate opportunities. A configuration-driven,
ingestion workflow mechanism can provide a high level of reuse, enabling easy,
secure, and trackable content ingestion from new sources. Integrate
with the existing environment. The Data Lake needs to meld into and support the existing
enterprise data management paradigms, tools, and methods. It needs a supervisor
that integrates and manages, when required, existing data management tools,
such as data profiling, data mastering and cleansing, and data masking
technologies. Keeping all of these
elements in mind is critical for the design of a successful Data Lake. Designing the Data Lake Designing a successful
Data Lake is an intensive endeavor, requiring a comprehensive understanding of
the technical requirements and the business acumen to fully customize and
integrate the architecture for the organization’s specific needs. Knowledgent’s Big Data
Scientists and Engineers provide the expertise necessary to evolve the Data
Lake to a successful Data and Analytics as a Service solution, including: DAaaS
Strategy Service Definition.
Our Informationists leverage define the catalog of services to be provided by
the DAaaS platform, including data onboarding, data cleansing, data
transformation, datapedias, analytic tool libraries, and others. DAaaS
Architecture. We help our clients
achieve a target-state DAaaS architecture, including architecting the
environment, selecting components, defining engineering processes, and
designing user interfaces. DAaaS
PoC. We design and
execute Proofs-of-Concept (PoC) to demonstrate the viability of the DAaaS
approach. Key capabilities of the DAaaS platform are built/demonstrated using
leading-edge bases and other selected tools. DAaaS
Operating Model Design and Rollout. We customize our DAaaS operating models to meet the individual
client’s processes, organizational structure, rules, and governance. This
includes establishing DAaaS chargeback models, consumption tracking, and
reporting mechanisms. DAaaS
Platform Capability Build-Out. We provide the expertise to conduct an iterative build-out of
all platform capabilities, including design, development and integration,
testing, data loading, metadata and catalog population, and rollout. Conclusion The Data Lake can be
an effective data management solution for advanced analytics experts and
business users alike. A Data Lake allows users to analyze a large variety and
volume when and how they want. Following a Data and Analytics as a Service
(DAaaS) model provides users with on-demand, self-serve data. However, to be
successful, a Data Lake needs to leverage a multitude of products while being
tailored to the industry and providing users with extensive, scalable
customization. Knowledgent’s Informationists provide the blend of technical
expertise and business acumen to help organizations design and implement their
perfect Data Lake
... View more
09-13-2017
06:04 AM
There is no specific Architect certification or exam currently. Below are available certification exams: https://hortonworks.com/services/training/certification/
... View more
08-18-2017
01:54 PM
1 Kudo
@Saurab Dahal Yes its achievable. But there are few tweeks which has to be done. Partitioned table should be created with additional field("month") along with sale_date. Create the hive table with month as partitioned column. When inserting into the table, extract only the month from sales_date and pass it to insert statement. Insert into table table_name partitioned(month) select col1,col2,MONTH(sales_date),sale_date from source_table; Above command should work. Make sure the below property is enabled. set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict Hope It helps!
... View more
08-16-2017
12:56 PM
Thanks @Shawn Weeks
... View more
08-10-2017
06:56 AM
@Greg Keys Thanks Greg. I got the information which I was looking for.
... View more
07-20-2017
07:36 AM
@Bala Vignesh N V Your problem statement can be interpreted in two ways. The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has: genres = movies.flatMap(lambda line: line.split(',')[2].split('|'))
genres.countByValue() We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them. This gives you: 'Adventure': 2,
'Animation': 1,
'Children': 2,
'Comedy': 4,
'Drama': 1,
'Fantasy': 2,
'Romance': 2
Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use: genre_combinations = movies.map(lambda line: line.split(',')[2])
genre_combinations.countByValue()
This gives you: 'Adventure|Animation|Children|Comedy|Fantasy': 1,
'Adventure|Children|Fantasy': 1,
'Comedy': 1,
'Comedy|Drama|Romance': 1,
'Comedy|Romance': 1
... View more
01-03-2018
05:39 PM
@Félicien Catherin The tutorial has typo... you need to create normal table first using following sytax: CREATE TABLE FIREWALL_LOGS( time STRING, ip STRING, country STRING, status INT )
CLUSTERED BY (time) into 25 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/tmp/server-logs' TBLPROPERTIES("transactional"="true"); Once the above table is created you can convert it to ORC CREATE TABLE FIREWALL AS STORED AS ORC SELECT * FROM FIREWALL_LOGS;
... View more
06-12-2017
01:44 AM
1 Kudo
1) Edge node normally have Hadoop Client installed, using this HDFS
client is responsible for data copy/move to DataNode and Metadata stored
in Namenode 2) HDFS clent act as :- staging/intermediate layer for DN and NM. --
the client contacts the NameNode.TheNameNode inserts the file name into
the file system hierarchy and allocates a data block for it.
TheNameNode responds to the client request with the identity of the DataNode and the destination data block.
3) In turn worker node doesn't have any role to play here. Is my
understanding right? :- No,. The actual task will be done by worker node
only, as it JOB assigned by the Resource Manager. Job Work Flow :- HDFS
Client -> Namenode ->Resource Manager -> Worker/Data Node
->once all MR task completed Datanode will have actual data and Meta
Data Stored Namenode. 4) Normally they separate edge node, master node and data nodes, resource manager node. Edge Node :- Will have batch user id, which is responsible for running the batch. Data Node:- Contain the Physical Data of Hadoop Cluster . Name Node :- will have metadata of Hadoop Cluster. is this help full !
... View more
04-10-2018
02:27 PM
@Bala Vignesh N V Your worker node is same as your data node. Worker node are those who actually does the work in the cluster.
... View more