Member since
11-02-2015
10
Posts
19
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
957 | 05-17-2016 06:01 PM |
05-17-2016
06:01 PM
1 Kudo
Further investigation of the error into the Hawq logs noticed that the HAWQ Master was trying to copy some folders and files in the folder /data/hawq/master , but since the folder was already populated with some files and folder , the process was expecting this folder to be empty, due to which the Master init failed. So I backed up the content from the folder /data/hawq/master and emptied the folder. When I tried to install now through Ambari 2.2.2 , it went through and was able to successfully install the HAWQ Services in Ambari and both HAWQ and PXF Services were up and running.
... View more
05-17-2016
05:56 PM
Tried installing HAWQ in HDP 2.4 Sandbox environment. Upgraded the Ambari to 2.2.2 and then followed the instruction from the below link : http://hdb.docs.pivotal.io/20/install/install-ambari.html Since the port 5432 which is the default port of HAWQ was already used in Sandbox , I check the available port and assigned port number 5433 for HAWQ Master Port. When I am trying for install I got the below error where the HAWQ Master
Start is getting error and not able to initiate the service.
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
05-09-2016
10:30 PM
4 Kudos
Mainframe Offload is becoming the talk of most of the organizations , as they are trying to reduce the cost and move into the latest next generation technologies. Organizations are running critical business applications on mainframe systems that are generating and processing huge volume of data which are tagged with pretty high maintenance cost. With Big Data becoming more famous and every industry is trying to leverage the capability of the Open Source Technologies, organizations are now trying move some or all of their applications to the Open Source. Since the open source systems platforms like Hadoop ecosystems have become more robust , flexible , cheaper , better performing than the traditional systems, the current trend in offloading the legacy systems is becoming more popular. Keeping this in mind , in this article we will discuss about one of the topics, how to offload the data from the the legacy systems like Mainframes into the next generation technologies like Hadoop. The successful Hadoop journey typically starts with new analytic applications, which lead to a Data Lake. As more and more applications are created that derive value from the new types of data from sensors/machines, server logs, clickstreams and other sources, the Data Lake forms with Hadoop acting as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with and complement the Data Lake journey As most of you know, Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. The Benefits of Hadoop is the computing power, flexibility, low cost, horizontally scalable , fault tolerant and lots more.. There are multiple options which are used in the industry to move the data from Mainframe Legacy Systems to the Hadoop Ecosystems. I would be discussing the top 3 methods / options which can be leveraged by the customers Customers have had the most success with Option 1. Option 1: Syncsort + Hadoop (Hortonworks Data Platform) Syncsort is a Hortonworks Certified Technology Partner and has over 40 years of experience helping organizations integrate big data…smarter. Syncsort plays well in customer who currently does not use informatica in-house and who are trying to offload mainframe data into Hadoop. Syncsort integrates with Hadoop and HDP directly through YARN, making it easier for users to write and maintain MapReduce jobs graphically. Additionally, through the YARN integration, the processing initiated by DMX-h within the HDP cluster will make better use of the resources and execute more efficiently. Syncsort DMX-h was designed from the ground up for Hadoop - combining a long history of innovation with significant contributions Syncsort has made to improve Apache Hadoop. DMX-h enables people with a much broader range of skills — not just mainframe or MapReduce programmers — to create ETL tasks that execute within the MapReduce framework, replacing complex manual code with a powerful, easy-to-use graphical development environment. DMX-h makes it easier to read, translate, and distribute data with Hadoop. Synsort supports mainframe record formats including fixed, variable with block descriptor and VSAM, it has an easy to use graphical development environment, which makes development easy and faster with very less coding. It has the capabilities including the import of Cobol copybooks and able to do the data transformations like EBCDIC to ASCII conversions on the fly without coding and seamlessly connects to all sources and targets of hadoop. It enables the developers to the build once and reuse many times and there is no need to install any software on the mainframe systems. For more details please check the below links: Synsort DMX-H - http://www.syncsort.com/en/Products/BigData/DMXh Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/ The Below diagram provides the generic Solution Architecture on the how the above mentioned technologies can be utilized or architected in any organizations looking forward to move data from Mainframe Legacy Systems to the Hadoop and would like to take advantage of the Modern Data Architecture.
Option 2: Informatica Powercenter + Apache Nifi + Hadoop (Hortonworks Data Platform) If
the customer is trying to do the Mainframe Offload into Hadoop and also
have Informatica in their environment which is used for ETL Processing,
then the customer can leverage the capability of Informatica Power
Exchange to convert the EBCDIC file format to ASCII file format which is
more readable for the Hadoop environment. Informatica PowerExchange
makes it easy for data to be extracted , converted (EBCDIC to ASCII or
vise versa) , filter and available to any target databased in memory
without any program code or in file format in any FTP site or into
Hadoop directly for been utilized. Informatica PowerExchange utilizes
the VSAM (Cobol Copy Books) directly for the conversion process from
EBCDIC to ASCII format. The ASCII files which are been generated
can now be FTPed to any SAN, NFS or local file systems, which can be
leveraged by the Hadoop Ecosystem. There are different ways to
move the data into Hadoop, either through the traditional scripting way
like Sqoop, flume, java scripting , Pig etc. or by using Hortonworks
Data Flow (Apache Nifi) which makes the data ingestion into Hadoop
(Hortonworks Data Platform) more easier. Now that we have the
data in Hadoop Environment , we can how use the capabilities of
Hortonworks Data Platform to perform the Cleansing, Aggregation,
Transformation, Machine Learning, Search, Visualizing etc. For more details please check the below links: Informatica Power Exchange - https://www.informatica.com/products/data-integration/connectors-powerexchange.html#fbid=5M-ngamL9Yh Hortonworks Data Flow (Powered by Apache Nifi) - http://hortonworks.com/products/hdf/ Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/
Option 3: Informatica BDE (Big Data Edition) + Hadoop (Hortonworks Data Platform) With
Big Data being the buzz word in the world, every organization would
like to leverage the capability of Big Data and so companies are now
trying to integrate their current software into the Big Data Ecosystem,
so that they not only be able to pull or push the data to Hadoop, but
Informatica has come up with a Big Data Edition that provides an
extensive library of prebuilt transformation capabilities on Hadoop,
including data type conversions and string manipulations, high
performance cache-enabled lookups, joiners, sorters, routers,
aggregations, and many more. The customers can rapidly develop data
flows on Hadoop using a codeless graphical development environment that
increases productivity and promote reuse. This also leverages the
capabilities of the distributed computing, fault tolerance, parallelism
which Hadoop bring to the table by default. Customers who are
already using Informatica for the ETL process are moving towards
Informatica BDE which run on top of hadoop ecosystems like Hortonworks
Data Flow and the good part of this is that the customers do not need to
built resources with new skills but can leverage their existing ETL
resources and build the ETL Process graphically using Informatica BDE
which under the covers are converted into Hive Queries and these source
code are then moved the the underlying Hadoop Cluster and execute them
on Hadoop to make use of its distribute computing and parallelism which
enables them to run their process in more efficient and faster way. For more details please check the below links: Informatica Big Data Edition - https://www.informatica.com/products/big-data/big-data-edition.html#fbid=5M-ngamL9Yh Hortonworks Data Platform (Powered by Apache Hadoop) - http://hortonworks.com/products/hdp/
... View more
Labels:
05-09-2016
05:53 PM
Awesome. Thanks Abdelkrim for the great info, this helps.
... View more
05-06-2016
07:17 PM
Working with a customer who are currently using Node labels and Yarn queue extensively, allocating resources and granting access to the users to their respective queues to work on the application they are working. For example: Lets us take we have 2 users (User1 and User2), User1 works on application related to Hive and User2 works on applications related to Spark. Assigned 10 nodes which was labelled as HiveNodeLabel and another 10 nodes as SparkNodeLabel. The customer then assigned these respective Hive and Spark node labels to its relevant HiveQueue and SparkQueue so that it can leverage its optimized nodes for processing. User 1 is now granted access to run his hive application using his HiveQueue which utilizes the resources from the assigned nodes (HiveNodeLabel) , similarly for User 2 for SparkQueue utilizing SparkNodeLabel. The question is if User 1 needs to run application both Hive and Spark Application and has been assigned to both the Queues. How will the Cluster know which application the user is currently running (Hive or Spark) and how will it decide that it has to be run on Spark Queue & Node Labels or Hive Queue & Node Labels ? Is there anything like Project / Application Type we can use in order to determine which one to use ? As anyone worked on such scenario and it would be great if anyone can throw some light and provide couple of options in how we can handle this ?
... View more
Labels:
- Labels:
-
Apache YARN
04-11-2016
12:56 PM
Is there any preferred automated code deployment tools for HDP ? Where can I find details on Best Practices for code deployment in Hadoop ?
... View more
Labels:
04-04-2016
03:53 PM
Thanks Ben for the valuable information. You are right, we also suggested on using queues and setting up percentage of the cluster in the same cluster , but they are looking into having two separate cluster. So the only suggestion to them would be using HDF(Apache Nifi) to seamlessly replicate data into the second cluster as it is loaded into the first cluster, which will give them feasibility of having the same data at the same time in both the clusters.
... View more
04-04-2016
02:49 PM
1 Kudo
One of the customer is currently running a cluster and would like to create a separate cluster for research purpose and are looking into the feasibility to monitor 2 different cluster from one Ambari Server , which we mentioned that currently we do not support multi-cluster operations through Ambari, so now they are looking into feasibility of sharing the data nodes so that both the clusters can utilized the same data without moving the data across the clusters or replicating.
... View more
04-04-2016
02:02 PM
1 Kudo
Is it possible to share the same datanode between 2 different clusters monitored by separate Ambari Server for each cluster ?
... View more
Labels:
- Labels:
-
Apache Ambari
04-02-2016
12:22 AM
12 Kudos
Step 1 : Log into AWS your credentials Step 2 : From the AWS console go to the following options and create a user in for the demo in AWS Security & Identity --> Identity and Access Management --> Users --> Create New Users Step 3 : Make note of the credentials awsAccessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; awsSecretAccessKey = 'yyyyyyyyyyyyyyyyyyyyyyyyyyy'; Step 4 : Add the User to the Admin Group by clicking the button “User Actions” and select the option Add Users to Group and add select your user (admin) Step 5 : Assign the Administration Access Policy to the User (admin) Step 6 : In the AWS Console , Go to S3 and create a bucket “s3hdptest” and pick your region Step 7 : Upload the file manually by using the upload button. In our example we are uploading the file S3HDPTEST.csv Step 8 : In the Hadoop Environment create the user with the same name as it is created in the S3 Environment Step 9 : In Ambari do all the below properties in both hdfs-site.xml and hive-site.xml <property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
Step 10 : Restart the Hadoop Services like HDFS , Hive and any depending services Step 11 : Ensure the NTP is set to the properly to reflect the AWS timestamp, follow the steps in the below link http://www.emind.co/how-to/how-to-fix-amazon-s3-requesttimetooskewed Step 12 : Run the below statement from the command line to test whether we are able to view the file from S3 [root@sandbox ~]# su admin
bash-4.1$ hdfs dfs -ls s3a://s3hdptest/S3HDPTEST.csv
-rw-rw-rw- 1 188 2016-03-29 22:12 s3a://s3hdptest/S3HDPTEST.csv
bash-4.1$
Step 13: To verify the data you can use the below command bash-4.1$ hdfs dfs -cat s3a://s3hdptest/S3HDPTEST.csv Step 14 : Move a file from S3 to HDFS bash-4.1$ hadoop fs -cp s3a://s3hdptest/S3HDPTEST.csv /user/admin/S3HDPTEST.csv Step 15 : Move a file from HDFS to S3 bash-4.1$ hadoop fs -cp /user/admin/S3HDPTEST.csv s3a://s3hdptest/S3HDPTEST_1.csv Step 15a : Verify whether the file has been stored in the AWS S3 Bucket Step 16 : To access the data using Hive from S3: Connect to Hive from Ambari using the Hive Views or Hive CLI A) Create a table for the datafile in S3 hive> CREATE EXTERNAL TABLE mydata
(FirstName STRING, LastName STRING, StreetAddress STRING, City STRING, State STRING,ZipCode INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3a://s3hdptest/';
B) Select the file data from Hive hive> SELECT * FROM mydata; Step 17 : To Access the data using Pig from S3: [root@sandbox ~]# pig -x tez
grunt> a = load 's3a://s3hdptest/S3HDPTEST.csv' using PigStorage();
grunt> dump a;
Step 18 : To Store the data using Pig to S3: grunt> store a into 's3a://s3hdptest/OUTPUT' using PigStorage(); Checking the created data file in AWS S3 bucket Note: For the article related to accessing AWS S3 Bucket using Spark please refer to the below link: https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
... View more
Labels: