Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
11538 | 02-20-2018 12:33 PM | |
1155 | 02-19-2018 05:12 AM | |
1441 | 12-28-2017 06:13 AM | |
6321 | 09-28-2017 09:25 AM | |
10941 | 09-25-2017 11:19 AM |
11-22-2019
10:56 PM
Hi @mqureshi , you have explained beautifully. But how the replication of blocks will impact this calculation? Please explain. Regards.
... View more
09-25-2018
05:08 PM
i am interesting too
... View more
02-02-2018
03:58 PM
Lets approach your problems from basics. 1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too. 2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop. 3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table. 4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks. The logic is the same as hadoop. 5. Spark handles compression in the same way as MR . 6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage. Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.
... View more
12-19-2017
11:12 PM
you should be able to use show table extended partition to see if you can get info on it and not try to open anyone who is zero bytes. Like this: scala> var sqlCmd="show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01')" sqlCmd: String = show table extended from mydb like 'mytable' partition (date_time_date='2017-01-01') scala> var partitionsList=sqlContext.sql(sqlCmd).collectAsList partitionsList: java.util.List[org.apache.spark.sql.Row] = [[mydb,mytable,false,Partition Values: [date_time_date=2017-01-01] Location: hdfs://mycluster/apps/hive/warehouse/mydb.db/mytable/date_time_date=2017-01-01 Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Storage Properties: [serialization.format=1] Partition Parameters: {rawDataSize=441433136, numFiles=1, transient_lastDdlTime=1513597358, totalSize=4897483, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=37825} ]] Let me know if that works and you can avoid the 0 byter's with such or if you still get null pointer.. James
... View more
11-13-2017
08:36 PM
@Bala Vignesh N V So you've renamed the partition in Hive and can see the new name there but when you look on HDFS it still has the original partition name, correct? In my example in the previous post I originally had 2 partitions (part=a and part=b) in Hive and I renamed part=a to part=z. On HDFS, part=a never changed but the PART_NAME column in the metastore database was updated to part=z. In Hive, I can only see part=z and part=b and if I do a SELECT for the data in part=z then it will lookup the LOCATION column from the metastore database for part=z, which still points to the part=a directory on HDFS, and read data for part=z from there. So this way for external tables, you can rename the partitions in Hive to whatever you like without affecting the underlying data on HDFS.
... View more
11-30-2018
02:17 PM
Hello folks, Please help me with the following query: There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount.
Table : T1 ================================ ProductID | ProductName | Price
-------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2 ======================= ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10
4 | 15
5 | 20 , Hello everyone, Please help me with the following query in Hive. There are two tables T1 and T2 find the sum of price if customer buys all the product how much he has to pay after discount.
Table : T1 ================================ ProductID | ProductName | Price
-------------------------------- 1 | p1 | 1000 2 | p2 | 2000 3 | p3 | 3000 4 | p4 | 4000 5 | p5 | 5000 Table : T2
=======================
ProductID | Disocunt % ----------------------- 1 | 10 2 | 15 3 | 10
4 | 15
5 | 20
... View more
10-30-2017
12:13 PM
@Saurabh It happens sometimes because of limitation of no of lines displayed in CLI. try this --> hive -e "show create table sample_db.i0001_ivo_hdr;" > ddl.txt
... View more
09-17-2017
09:02 AM
Summary Business users are
continuously envisioning new and innovative ways to use data for operational
reporting and advanced analytics. The Data Lake, a next-generation data storage
and management solution, was developed to meet the ever-evolving needs of
increasingly savvy users. This article explores existing challenges with the enterprise data warehouse and other
existing data management and analytic solutions. It describes the necessary
features of the Data Lake architecture and the capabilities required to
leverage a Data and Analytics as a Service (DAaaS) model. It also covers the
characteristics of a successful Data Lake implementation and critical
considerations for designing a Data Lake. Current EDW Challenges Business users are
continuously envisioning new and innovative ways to use data for operational
reporting and advanced analytics. With the evolution of users’ needs coupled
with advances in data storage technologies, the inadequacies of current
enterprise data warehousing solutions have become more apparent. The following
challenges with today’s data warehouses can impede usage and prevent users from
maximizing their analytic capabilities: Timeliness. Introducing new content to the enterprise
data warehouse can be a time-consuming and cumbersome process. When users need
immediate access to data, even short processing delays can be frustrating and
cause users to bypass the proper processes in favor of getting the data quickly
themselves. Users also may waste valuable time and resources to pull the data
from operational systems, store and manage it themselves, and then analyze it. Flexibility. Users not only lack on-demand access to any
data they may need at any time, but also the ability to use the tools of their
choice to analyze the data and derive critical insights. Additionally, current
data warehousing solutions often store one type of data, while today’s users
need to be able to analyze and aggregate data across many different formats. Quality. Users may view the current data warehouse
with suspicion. If where the data originated and how it has been acted on are
unclear, users may not trust the data. Also, if users worry that the data in the
data warehouse is missing or inaccurate, they may circumvent the warehouse in
favor of getting the data themselves directly from other internal or external
sources, potentially leading to multiple, conflicting instances of the same
data. Findability. With many current data warehousing
solutions, users do not have a function to rapidly and easily search for and
find the data they need when they need it. Inability to find data also limits
the users’ ability to leverage and build on existing data analyses.
Advanced analytics users require a data storage solution based on an IT “push”
model (not driven by specific analytics projects). Unlike existing solutions,
which are specific to one or a small family of use cases, what is needed is a
storage solution that enables multiple, varied use cases across the enterprise. This new solution
needs to support multiple reporting tools in a self-serve capacity, to allow
rapid ingestion of new datasets without extensive modeling, and to scale large
datasets while delivering performance. It should support advanced analytics,
like machine learning and text analytics, and allow users to cleanse and
process the data iteratively and to track lineage of data for compliance. Users
should be able to easily search and explore structured, unstructured, internal,
and external data from multiple sources in one secure place. The solution that fits
all of these criteria is the data lake
The Data Lake Architecture The Data Lake is a
data-centered architecture featuring a repository capable of storing vast
quantities of data in various formats. Data from webserver logs, data bases,
social media, and third-party data is ingested into the Data Lake. Curation
takes place through capturing metadata and lineage and making it available in
the data catalog (Datapedia). Security policies, including entitlements, also
are applied. Data can flow into the
Data Lake by either batch processing or real-time processing of streaming data.
Additionally, data itself is no longer restrained by initial schema decisions,
and can be exploited more freely by the enterprise. Rising above this
repository is a set of capabilities that allow IT to provide Data and Analytics
as a Service (DAaaS), in a supply-demand model. IT takes the role of the data
provider (supplier), while business users (data scientists, business analysts)
are consumers. The DAaaS model
enables users to self-serve their data and analytic needs. Users browse the
lake’s data catalog (a Datapedia) to find and select the available data and
fill a metaphorical “shopping cart” (effectively an analytics sandbox) with
data to work with. Once access is provisioned, users can use the analytics
tools of their choice to develop models and gain insights. Subsequently, users
can publish analytical models or push refined or transformed data back into the
Data Lake to share with the larger community. Although provisioning
an analytic sandbox is a primary use, the Data Lake also has other
applications. For example, the Data Lake can also be used to ingest raw data,
curate the data, and apply ETL. This data can then be loaded to an Enterprise
Data Warehouse. To take advantage of the flexibility provided by the Data Lake,
organizations need to customize and configure the Data Lake to their specific
requirements and domains. Characteristics of a Successful Data Lake Implementation A Data Lake enables
users to analyze the full variety and volume of data stored in the lake. This
necessitates features and functionalities to secure and curate the data, and
then to run analytics, visualization, and reporting on it. The characteristics
of a successful Data Lake include: Use
of multiple tools and products. Extracting maximum value out of the Data Lake requires
customized management and integration that are currently unavailable from any
single open-source platform or commercial product vendor. The cross-engine
integration necessary for a successful Data Lake requires multiple technology
stacks that natively support structured, semi-structured, and unstructured data
types. Domain
specification. The Data Lake must
be tailored to the specific industry. A Data Lake customized for biomedical
research would be significantly different from one tailored to financial
services. The Data Lake requires a business-aware data-locating capability that
enables business users to find, explore, understand, and trust the data. This
search capability needs to provide an intuitive means for navigation, including
key word, faceted, and graphical search. Under the covers, such a capability
requires sophisticated business ontologies, within which business terminology
can be mapped to the physical data. The tools used should enable independence
from IT so that business users can obtain the data they need when they need it
and can analyze it as necessary, without IT intervention. Automated
metadata management. The Data Lake
concept relies on capturing a robust set of attributes for every piece of
content within the lake. Attributes like data lineage, data quality, and usage
history are vital to usability. Maintaining this metadata requires a
highly-automated metadata extraction, capture, and tracking facility. Without a
high-degree of automated and mandatory metadata management, a Data Lake will
rapidly become a Data Swamp. Configurable
ingestion workflows. In a thriving Data
Lake, new sources of external information will be continually discovered by
business users. These new sources need to be rapidly on-boarded to avoid
frustration and to realize immediate opportunities. A configuration-driven,
ingestion workflow mechanism can provide a high level of reuse, enabling easy,
secure, and trackable content ingestion from new sources. Integrate
with the existing environment. The Data Lake needs to meld into and support the existing
enterprise data management paradigms, tools, and methods. It needs a supervisor
that integrates and manages, when required, existing data management tools,
such as data profiling, data mastering and cleansing, and data masking
technologies. Keeping all of these
elements in mind is critical for the design of a successful Data Lake. Designing the Data Lake Designing a successful
Data Lake is an intensive endeavor, requiring a comprehensive understanding of
the technical requirements and the business acumen to fully customize and
integrate the architecture for the organization’s specific needs. Knowledgent’s Big Data
Scientists and Engineers provide the expertise necessary to evolve the Data
Lake to a successful Data and Analytics as a Service solution, including: DAaaS
Strategy Service Definition.
Our Informationists leverage define the catalog of services to be provided by
the DAaaS platform, including data onboarding, data cleansing, data
transformation, datapedias, analytic tool libraries, and others. DAaaS
Architecture. We help our clients
achieve a target-state DAaaS architecture, including architecting the
environment, selecting components, defining engineering processes, and
designing user interfaces. DAaaS
PoC. We design and
execute Proofs-of-Concept (PoC) to demonstrate the viability of the DAaaS
approach. Key capabilities of the DAaaS platform are built/demonstrated using
leading-edge bases and other selected tools. DAaaS
Operating Model Design and Rollout. We customize our DAaaS operating models to meet the individual
client’s processes, organizational structure, rules, and governance. This
includes establishing DAaaS chargeback models, consumption tracking, and
reporting mechanisms. DAaaS
Platform Capability Build-Out. We provide the expertise to conduct an iterative build-out of
all platform capabilities, including design, development and integration,
testing, data loading, metadata and catalog population, and rollout. Conclusion The Data Lake can be
an effective data management solution for advanced analytics experts and
business users alike. A Data Lake allows users to analyze a large variety and
volume when and how they want. Following a Data and Analytics as a Service
(DAaaS) model provides users with on-demand, self-serve data. However, to be
successful, a Data Lake needs to leverage a multitude of products while being
tailored to the industry and providing users with extensive, scalable
customization. Knowledgent’s Informationists provide the blend of technical
expertise and business acumen to help organizations design and implement their
perfect Data Lake
... View more