Created on 05-09-201610:30 PM - edited 08-17-201912:31 PM
Mainframe Offload is becoming the talk of most of the organizations , as they are trying to reduce the cost and move into the latest next generation technologies. Organizations are running critical business applications on mainframe systems that are generating and processing huge volume of data which are tagged with pretty high maintenance cost. With Big Data becoming more famous and every industry is trying to leverage the capability of the Open Source Technologies, organizations are now trying move some or all of their applications to the Open Source. Since the open source systems platforms like Hadoop ecosystems have become more robust , flexible , cheaper , better performing than the traditional systems,the current trend in offloading the legacy systems is becoming more popular. Keeping this in mind , in this article we will discuss about one of the topics, how to offload the data from the the legacy systems like Mainframes into the next generation technologies like Hadoop.
The successful Hadoop journey typically starts with new analytic applications, which lead to a Data Lake. As more and more applications are created that derive value from the new types of data from sensors/machines, server logs, clickstreams and other sources, the Data Lake forms with Hadoop acting as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with and complement the Data Lake journey
As most of you know, Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The Benefits of Hadoop is the computing power, flexibility, low cost, horizontally scalable , fault tolerant and lots more..
There are multiple options which are used in the industry to move the data from Mainframe Legacy Systems to the Hadoop Ecosystems. I would be discussing the top 3 methods / options which can be leveraged by the customers
Customers have had the most success with Option 1.
Option 1: Syncsort + Hadoop (Hortonworks Data Platform)
Syncsort is a Hortonworks Certified Technology Partner and has over 40 years of experience helping organizations integrate big data…smarter. Syncsort plays well in customer who currently does not use informatica in-house and who are trying to offload mainframe data into Hadoop.
Syncsort integrates with Hadoop and HDP directly through YARN, making it easier for users to write and maintain MapReduce jobs graphically. Additionally, through the YARN integration, the processing initiated by DMX-h within the HDP cluster will make better use of the resources and execute more efficiently.
Syncsort DMX-h was designed from the ground up for Hadoop - combining a long history of innovation with significant contributions Syncsort has made to improve Apache Hadoop. DMX-h enables people with a much broader range of skills — not just mainframe or MapReduce programmers — to create ETL tasks that execute within the MapReduce framework, replacing complex manual code with a powerful, easy-to-use graphical development environment. DMX-h makes it easier to read, translate, and distribute data with Hadoop.
Synsort supports mainframe record formats including fixed, variable with block descriptor and VSAM, it has an easy to use graphical development environment, which makes development easy and faster with very less coding. It has the capabilities including the import of Cobol copybooks and able to do the data transformations like EBCDIC to ASCII conversions on the fly without coding and seamlessly connects to all sources and targets of hadoop. It enables the developers to the build once and reuse many times and there is no need to install any software on the mainframe systems.
The Below diagram provides the generic Solution Architecture on the how the above mentioned technologies can be utilized or architected in any organizations looking forward to move data from Mainframe Legacy Systems to the Hadoop and would like to take advantage of the Modern Data Architecture.
the customer is trying to do the Mainframe Offload into Hadoop and also
have Informatica in their environment which is used for ETL Processing,
then the customer can leverage the capability of Informatica Power
Exchange to convert the EBCDIC file format to ASCII file format which is
more readable for the Hadoop environment. Informatica PowerExchange
makes it easy for data to be extracted , converted (EBCDIC to ASCII or
vise versa) , filter and available to any target databased in memory
without any program code or in file format in any FTP site or into
Hadoop directly for been utilized. Informatica PowerExchange utilizes
the VSAM (Cobol Copy Books) directly for the conversion process from
EBCDIC to ASCII format.
The ASCII files which are been generated
can now be FTPed to any SAN, NFS or local file systems, which can be
leveraged by the Hadoop Ecosystem.
There are different ways to
move the data into Hadoop, either through the traditional scripting way
like Sqoop, flume, java scripting , Pig etc. or by using Hortonworks
Data Flow (Apache Nifi) which makes the data ingestion into Hadoop
(Hortonworks Data Platform) more easier.
Now that we have the
data in Hadoop Environment , we can how use the capabilities of
Hortonworks Data Platform to perform the Cleansing, Aggregation,
Transformation, Machine Learning, Search, Visualizing etc.
Option 3: Informatica BDE (Big Data Edition) + Hadoop (Hortonworks Data Platform)
Big Data being the buzz word in the world, every organization would
like to leverage the capability of Big Data and so companies are now
trying to integrate their current software into the Big Data Ecosystem,
so that they not only be able to pull or push the data to Hadoop, but
Informatica has come up with a Big Data Edition that provides an
extensive library of prebuilt transformation capabilities on Hadoop,
including data type conversions and string manipulations, high
performance cache-enabled lookups, joiners, sorters, routers,
aggregations, and many more. The customers can rapidly develop data
flows on Hadoop using a codeless graphical development environment that
increases productivity and promote reuse. This also leverages the
capabilities of the distributed computing, fault tolerance, parallelism
which Hadoop bring to the table by default.
Customers who are
already using Informatica for the ETL process are moving towards
Informatica BDE which run on top of hadoop ecosystems like Hortonworks
Data Flow and the good part of this is that the customers do not need to
built resources with new skills but can leverage their existing ETL
resources and build the ETL Process graphically using Informatica BDE
which under the covers are converted into Hive Queries and these source
code are then moved the the underlying Hadoop Cluster and execute them
on Hadoop to make use of its distribute computing and parallelism which
enables them to run their process in more efficient and faster way.