About balavignesh_nag

balavignesh_nag · ‎09-17-2017

Summary Business users are continuously envisioning new and innovative ways to use data for operational reporting and advanced analytics. The Data Lake, a next-generation data storage and management solution, was developed to meet the ever-evolving needs of increasingly savvy users. This article explores existing challenges with the enterprise data warehouse and other existing data management and analytic solutions. It describes the necessary features of the Data Lake architecture and the capabilities required to leverage a Data and Analytics as a Service (DAaaS) model. It also covers the characteristics of a successful Data Lake implementation and critical considerations for designing a Data Lake. Current EDW Challenges Business users are continuously envisioning new and innovative ways to use data for operational reporting and advanced analytics. With the evolution of users’ needs coupled with advances in data storage technologies, the inadequacies of current enterprise data warehousing solutions have become more apparent. The following challenges with today’s data warehouses can impede usage and prevent users from maximizing their analytic capabilities: Timeliness. Introducing new content to the enterprise data warehouse can be a time-consuming and cumbersome process. When users need immediate access to data, even short processing delays can be frustrating and cause users to bypass the proper processes in favor of getting the data quickly themselves. Users also may waste valuable time and resources to pull the data from operational systems, store and manage it themselves, and then analyze it. Flexibility. Users not only lack on-demand access to any data they may need at any time, but also the ability to use the tools of their choice to analyze the data and derive critical insights. Additionally, current data warehousing solutions often store one type of data, while today’s users need to be able to analyze and aggregate data across many different formats. Quality. Users may view the current data warehouse with suspicion. If where the data originated and how it has been acted on are unclear, users may not trust the data. Also, if users worry that the data in the data warehouse is missing or inaccurate, they may circumvent the warehouse in favor of getting the data themselves directly from other internal or external sources, potentially leading to multiple, conflicting instances of the same data. Findability. With many current data warehousing solutions, users do not have a function to rapidly and easily search for and find the data they need when they need it. Inability to find data also limits the users’ ability to leverage and build on existing data analyses. Advanced analytics users require a data storage solution based on an IT “push” model (not driven by specific analytics projects). Unlike existing solutions, which are specific to one or a small family of use cases, what is needed is a storage solution that enables multiple, varied use cases across the enterprise. This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without extensive modeling, and to scale large datasets while delivering performance. It should support advanced analytics, like machine learning and text analytics, and allow users to cleanse and process the data iteratively and to track lineage of data for compliance. Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one secure place. The solution that fits all of these criteria is the data lake The Data Lake Architecture The Data Lake is a data-centered architecture featuring a repository capable of storing vast quantities of data in various formats. Data from webserver logs, data bases, social media, and third-party data is ingested into the Data Lake. Curation takes place through capturing metadata and lineage and making it available in the data catalog (Datapedia). Security policies, including entitlements, also are applied. Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. Additionally, data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise. Rising above this repository is a set of capabilities that allow IT to provide Data and Analytics as a Service (DAaaS), in a supply-demand model. IT takes the role of the data provider (supplier), while business users (data scientists, business analysts) are consumers. The DAaaS model enables users to self-serve their data and analytic needs. Users browse the lake’s data catalog (a Datapedia) to find and select the available data and fill a metaphorical “shopping cart” (effectively an analytics sandbox) with data to work with. Once access is provisioned, users can use the analytics tools of their choice to develop models and gain insights. Subsequently, users can publish analytical models or push refined or transformed data back into the Data Lake to share with the larger community. Although provisioning an analytic sandbox is a primary use, the Data Lake also has other applications. For example, the Data Lake can also be used to ingest raw data, curate the data, and apply ETL. This data can then be loaded to an Enterprise Data Warehouse. To take advantage of the flexibility provided by the Data Lake, organizations need to customize and configure the Data Lake to their specific requirements and domains. Characteristics of a Successful Data Lake Implementation A Data Lake enables users to analyze the full variety and volume of data stored in the lake. This necessitates features and functionalities to secure and curate the data, and then to run analytics, visualization, and reporting on it. The characteristics of a successful Data Lake include: Use of multiple tools and products. Extracting maximum value out of the Data Lake requires customized management and integration that are currently unavailable from any single open-source platform or commercial product vendor. The cross-engine integration necessary for a successful Data Lake requires multiple technology stacks that natively support structured, semi-structured, and unstructured data types. Domain specification. The Data Lake must be tailored to the specific industry. A Data Lake customized for biomedical research would be significantly different from one tailored to financial services. The Data Lake requires a business-aware data-locating capability that enables business users to find, explore, understand, and trust the data. This search capability needs to provide an intuitive means for navigation, including key word, faceted, and graphical search. Under the covers, such a capability requires sophisticated business ontologies, within which business terminology can be mapped to the physical data. The tools used should enable independence from IT so that business users can obtain the data they need when they need it and can analyze it as necessary, without IT intervention. Automated metadata management. The Data Lake concept relies on capturing a robust set of attributes for every piece of content within the lake. Attributes like data lineage, data quality, and usage history are vital to usability. Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp. Configurable ingestion workflows. In a thriving Data Lake, new sources of external information will be continually discovered by business users. These new sources need to be rapidly on-boarded to avoid frustration and to realize immediate opportunities. A configuration-driven, ingestion workflow mechanism can provide a high level of reuse, enabling easy, secure, and trackable content ingestion from new sources. Integrate with the existing environment. The Data Lake needs to meld into and support the existing enterprise data management paradigms, tools, and methods. It needs a supervisor that integrates and manages, when required, existing data management tools, such as data profiling, data mastering and cleansing, and data masking technologies. Keeping all of these elements in mind is critical for the design of a successful Data Lake. Designing the Data Lake Designing a successful Data Lake is an intensive endeavor, requiring a comprehensive understanding of the technical requirements and the business acumen to fully customize and integrate the architecture for the organization’s specific needs. Knowledgent’s Big Data Scientists and Engineers provide the expertise necessary to evolve the Data Lake to a successful Data and Analytics as a Service solution, including: DAaaS Strategy Service Definition. Our Informationists leverage define the catalog of services to be provided by the DAaaS platform, including data onboarding, data cleansing, data transformation, datapedias, analytic tool libraries, and others. DAaaS Architecture. We help our clients achieve a target-state DAaaS architecture, including architecting the environment, selecting components, defining engineering processes, and designing user interfaces. DAaaS PoC. We design and execute Proofs-of-Concept (PoC) to demonstrate the viability of the DAaaS approach. Key capabilities of the DAaaS platform are built/demonstrated using leading-edge bases and other selected tools. DAaaS Operating Model Design and Rollout. We customize our DAaaS operating models to meet the individual client’s processes, organizational structure, rules, and governance. This includes establishing DAaaS chargeback models, consumption tracking, and reporting mechanisms. DAaaS Platform Capability Build-Out. We provide the expertise to conduct an iterative build-out of all platform capabilities, including design, development and integration, testing, data loading, metadata and catalog population, and rollout. Conclusion The Data Lake can be an effective data management solution for advanced analytics experts and business users alike. A Data Lake allows users to analyze a large variety and volume when and how they want. Following a Data and Analytics as a Service (DAaaS) model provides users with on-demand, self-serve data. However, to be successful, a Data Lake needs to leverage a multitude of products while being tailored to the industry and providing users with extensive, scalable customization. Knowledgent’s Informationists provide the blend of technical expertise and business acumen to help organizations design and implement their perfect Data Lake

pardeep_kumar · ‎09-13-2017

There is no specific Architect certification or exam currently. Below are available certification exams: https://hortonworks.com/services/training/certification/

balavignesh_nag · ‎08-18-2017

@Saurab Dahal Yes its achievable. But there are few tweeks which has to be done. Partitioned table should be created with additional field("month") along with sale_date. Create the hive table with month as partitioned column. When inserting into the table, extract only the month from sales_date and pass it to insert statement. Insert into table table_name partitioned(month) select col1,col2,MONTH(sales_date),sale_date from source_table; Above command should work. Make sure the below property is enabled. set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict Hope It helps!

lruizr · ‎08-16-2017

Yes, it's the better way to solve problems 🙂

balavignesh_nag · ‎08-16-2017

Thanks @Shawn Weeks

balavignesh_nag · ‎08-10-2017

@Greg Keys Thanks Greg. I got the information which I was looking for.

gnovak · ‎07-20-2017

@Bala Vignesh N V Your problem statement can be interpreted in two ways. The first (and for me more logical) way is that a movie has multiple genres, and you want to count how many movies each genre has: genres = movies.flatMap(lambda line: line.split(',')[2].split('|')) genres.countByValue() We map each lines into multiple output items (genres), that why we use flatMap. First, we split each line by ',' and get the 3rd column, then we split the genres by '|' and omit them. This gives you: 'Adventure': 2, 'Animation': 1, 'Children': 2, 'Comedy': 4, 'Drama': 1, 'Fantasy': 2, 'Romance': 2 Your 'SQL' query (select genres, count(*)) suggests another approach: if you want to count the combinations of genres, for example movies that are Comedy AND Romance. In that case you can simply use: genre_combinations = movies.map(lambda line: line.split(',')[2]) genre_combinations.countByValue() This gives you: 'Adventure|Animation|Children|Comedy|Fantasy': 1, 'Adventure|Children|Fantasy': 1, 'Comedy': 1, 'Comedy|Drama|Romance': 1, 'Comedy|Romance': 1

denish_j_patel · ‎01-03-2018

@Félicien Catherin The tutorial has typo... you need to create normal table first using following sytax: CREATE TABLE FIREWALL_LOGS( time STRING, ip STRING, country STRING, status INT ) CLUSTERED BY (time) into 25 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/tmp/server-logs' TBLPROPERTIES("transactional"="true"); Once the above table is created you can convert it to ORC CREATE TABLE FIREWALL AS STORED AS ORC SELECT * FROM FIREWALL_LOGS;

shivkumar82015 · ‎06-12-2017

1) Edge node normally have Hadoop Client installed, using this HDFS client is responsible for data copy/move to DataNode and Metadata stored in Namenode 2) HDFS clent act as :- staging/intermediate layer for DN and NM. -- the client contacts the NameNode.TheNameNode inserts the file name into the file system hierarchy and allocates a data block for it. TheNameNode responds to the client request with the identity of the DataNode and the destination data block. 3) In turn worker node doesn't have any role to play here. Is my understanding right? :- No,. The actual task will be done by worker node only, as it JOB assigned by the Resource Manager. Job Work Flow :- HDFS Client -> Namenode ->Resource Manager -> Worker/Data Node ->once all MR task completed Datanode will have actual data and Meta Data Stored Namenode. 4) Normally they separate edge node, master node and data nodes, resource manager node. Edge Node :- Will have batch user id, which is responsible for running the batch. Data Node:- Contain the Physical Data of Hadoop Cluster . Name Node :- will have metadata of Hadoop Cluster. is this help full !

panigrahi_dibya · ‎04-10-2018

@Bala Vignesh N V Your worker node is same as your data node. Worker node are those who actually does the work in the cluster.

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Diving into Datalake

Re: Hortonworks exam for Architect

Re: Create hive partition by month

Re: Hive Insert into table issue

Re: Multi-delimiter in Hive using regex

Re: Does data get copied in edge node from externa...

Re: Using countByValue() for a particular column i...

Re: error while creating hive table

Re: Role of edge node & worker node in file copyin...

Re: What are worker and Edge nodes?