Member since
11-02-2015
10
Posts
19
Kudos Received
1
Solution
05-09-2016
10:30 PM
4 Kudos
Mainframe Offload is becoming the talk of most of the organizations , as they are trying to reduce the cost and move into the latest next generation technologies. Organizations are running critical business applications on mainframe systems that are generating and processing huge volume of data which are tagged with pretty high maintenance cost. With Big Data becoming more famous and every industry is trying to leverage the capability of the Open Source Technologies, organizations are now trying move some or all of their applications to the Open Source. Since the open source systems platforms like Hadoop ecosystems have become more robust , flexible , cheaper , better performing than the traditional systems, the current trend in offloading the legacy systems is becoming more popular. Keeping this in mind , in this article we will discuss about one of the topics, how to offload the data from the the legacy systems like Mainframes into the next generation technologies like Hadoop. The successful Hadoop journey typically starts with new analytic applications, which lead to a Data Lake. As more and more applications are created that derive value from the new types of data from sensors/machines, server logs, clickstreams and other sources, the Data Lake forms with Hadoop acting as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with and complement the Data Lake journey As most of you know, Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The Benefits of Hadoop is the computing power, flexibility, low cost, horizontally scalable , fault tolerant and lots more.. There are multiple options which are used in the industry to move the data from Mainframe Legacy Systems to the Hadoop Ecosystems. I would be discussing the top 3 methods / options which can be leveraged by the customers Customers have had the most success with Option 1. Option 1: Syncsort + Hadoop (Hortonworks Data Platform) Syncsort is a Hortonworks Certified Technology Partner and has over 40 years of experience helping organizations integrate big data…smarter. Syncsort plays well in customer who currently does not use informatica in-house and who are trying to offload mainframe data into Hadoop. Syncsort integrates with Hadoop and HDP directly through YARN, making it easier for users to write and maintain MapReduce jobs graphically. Additionally, through the YARN integration, the processing initiated by DMX-h within the HDP cluster will make better use of the resources and execute more efficiently. Syncsort DMX-h was designed from the ground up for Hadoop - combining a long history of innovation with significant contributions Syncsort has made to improve Apache Hadoop. DMX-h enables people with a much broader range of skills — not just mainframe or MapReduce programmers — to create ETL tasks that execute within the MapReduce framework, replacing complex manual code with a powerful, easy-to-use graphical development environment. DMX-h makes it easier to read, translate, and distribute data with Hadoop. Synsort supports mainframe record formats including fixed, variable with block descriptor and VSAM, it has an easy to use graphical development environment, which makes development easy and faster with very less coding. It has the capabilities including the import of Cobol copybooks and able to do the data transformations like EBCDIC to ASCII conversions on the fly without coding and seamlessly connects to all sources and targets of hadoop. It enables the developers to the build once and reuse many times and there is no need to install any software on the mainframe systems. For more details please check the below links: Synsort DMX-H - http://www.syncsort.com/en/Products/BigData/DMXh Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/ The Below diagram provides the generic Solution Architecture on the how the above mentioned technologies can be utilized or architected in any organizations looking forward to move data from Mainframe Legacy Systems to the Hadoop and would like to take advantage of the Modern Data Architecture.
Option 2: Informatica Powercenter + Apache Nifi + Hadoop (Hortonworks Data Platform) If
the customer is trying to do the Mainframe Offload into Hadoop and also
have Informatica in their environment which is used for ETL Processing,
then the customer can leverage the capability of Informatica Power
Exchange to convert the EBCDIC file format to ASCII file format which is
more readable for the Hadoop environment. Informatica PowerExchange
makes it easy for data to be extracted , converted (EBCDIC to ASCII or
vise versa) , filter and available to any target databased in memory
without any program code or in file format in any FTP site or into
Hadoop directly for been utilized. Informatica PowerExchange utilizes
the VSAM (Cobol Copy Books) directly for the conversion process from
EBCDIC to ASCII format. The ASCII files which are been generated
can now be FTPed to any SAN, NFS or local file systems, which can be
leveraged by the Hadoop Ecosystem. There are different ways to
move the data into Hadoop, either through the traditional scripting way
like Sqoop, flume, java scripting , Pig etc. or by using Hortonworks
Data Flow (Apache Nifi) which makes the data ingestion into Hadoop
(Hortonworks Data Platform) more easier. Now that we have the
data in Hadoop Environment , we can how use the capabilities of
Hortonworks Data Platform to perform the Cleansing, Aggregation,
Transformation, Machine Learning, Search, Visualizing etc. For more details please check the below links: Informatica Power Exchange - https://www.informatica.com/products/data-integration/connectors-powerexchange.html#fbid=5M-ngamL9Yh Hortonworks Data Flow (Powered by Apache Nifi) - http://hortonworks.com/products/hdf/ Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/
Option 3: Informatica BDE (Big Data Edition) + Hadoop (Hortonworks Data Platform) With
Big Data being the buzz word in the world, every organization would
like to leverage the capability of Big Data and so companies are now
trying to integrate their current software into the Big Data Ecosystem,
so that they not only be able to pull or push the data to Hadoop, but
Informatica has come up with a Big Data Edition that provides an
extensive library of prebuilt transformation capabilities on Hadoop,
including data type conversions and string manipulations, high
performance cache-enabled lookups, joiners, sorters, routers,
aggregations, and many more. The customers can rapidly develop data
flows on Hadoop using a codeless graphical development environment that
increases productivity and promote reuse. This also leverages the
capabilities of the distributed computing, fault tolerance, parallelism
which Hadoop bring to the table by default. Customers who are
already using Informatica for the ETL process are moving towards
Informatica BDE which run on top of hadoop ecosystems like Hortonworks
Data Flow and the good part of this is that the customers do not need to
built resources with new skills but can leverage their existing ETL
resources and build the ETL Process graphically using Informatica BDE
which under the covers are converted into Hive Queries and these source
code are then moved the the underlying Hadoop Cluster and execute them
on Hadoop to make use of its distribute computing and parallelism which
enables them to run their process in more efficient and faster way. For more details please check the below links: Informatica Big Data Edition - https://www.informatica.com/products/big-data/big-data-edition.html#fbid=5M-ngamL9Yh Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/
... View more
Labels:
04-02-2016
12:22 AM
12 Kudos
Step 1 : Log into AWS your credentials Step 2 : From the AWS console go to the following options and create a user in for the demo in AWS Security & Identity --> Identity and Access Management --> Users --> Create New Users Step 3 : Make note of the credentials awsAccessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; awsSecretAccessKey = 'yyyyyyyyyyyyyyyyyyyyyyyyyyy'; Step 4 : Add the User to the Admin Group by clicking the button “User Actions” and select the option Add Users to Group and add select your user (admin) Step 5 : Assign the Administration Access Policy to the User (admin) Step 6 : In the AWS Console , Go to S3 and create a bucket “s3hdptest” and pick your region Step 7 : Upload the file manually by using the upload button. In our example we are uploading the file S3HDPTEST.csv Step 8 : In the Hadoop Environment create the user with the same name as it is created in the S3 Environment Step 9 : In Ambari do all the below properties in both hdfs-site.xml and hive-site.xml <property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
Step 10 : Restart the Hadoop Services like HDFS , Hive and any depending services Step 11 : Ensure the NTP is set to the properly to reflect the AWS timestamp, follow the steps in the below link http://www.emind.co/how-to/how-to-fix-amazon-s3-requesttimetooskewed Step 12 : Run the below statement from the command line to test whether we are able to view the file from S3 [root@sandbox ~]# su admin
bash-4.1$ hdfs dfs -ls s3a://s3hdptest/S3HDPTEST.csv
-rw-rw-rw- 1 188 2016-03-29 22:12 s3a://s3hdptest/S3HDPTEST.csv
bash-4.1$
Step 13: To verify the data you can use the below command bash-4.1$ hdfs dfs -cat s3a://s3hdptest/S3HDPTEST.csv Step 14 : Move a file from S3 to HDFS bash-4.1$ hadoop fs -cp s3a://s3hdptest/S3HDPTEST.csv /user/admin/S3HDPTEST.csv Step 15 : Move a file from HDFS to S3 bash-4.1$ hadoop fs -cp /user/admin/S3HDPTEST.csv s3a://s3hdptest/S3HDPTEST_1.csv Step 15a : Verify whether the file has been stored in the AWS S3 Bucket Step 16 : To access the data using Hive from S3: Connect to Hive from Ambari using the Hive Views or Hive CLI A) Create a table for the datafile in S3 hive> CREATE EXTERNAL TABLE mydata
(FirstName STRING, LastName STRING, StreetAddress STRING, City STRING, State STRING,ZipCode INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3a://s3hdptest/';
B) Select the file data from Hive hive> SELECT * FROM mydata; Step 17 : To Access the data using Pig from S3: [root@sandbox ~]# pig -x tez
grunt> a = load 's3a://s3hdptest/S3HDPTEST.csv' using PigStorage();
grunt> dump a;
Step 18 : To Store the data using Pig to S3: grunt> store a into 's3a://s3hdptest/OUTPUT' using PigStorage(); Checking the created data file in AWS S3 bucket Note: For the article related to accessing AWS S3 Bucket using Spark please refer to the below link: https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
... View more
Labels: