Member since
05-14-2016
23
Posts
8
Kudos Received
0
Solutions
05-21-2021
12:25 AM
Hi @Rajuambala as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.
... View more
05-31-2018
12:11 PM
1 Kudo
About Apache Spark: Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine. We can do ETL analysis, Machine learning, Graph processing on large volumes of data at rest as well as data in motion with rich consise high level APIs in programming languages such as Python, Java, Scala, R and SQL. Spark execution engine is capable of processing batch data and real time streaming data from different data sources via Spark streaming. Spark supports an interactive shell for performing data exploration and for ad hoc analysis. We can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. spark-shell - scala interactive shell pyspark - python interactive shell sparkR - R interactive shell To run applications distributed across a cluster, Spark requires a cluster manager. Spark environment can be deployed and managed through the following cluster managers. Spark standalone Hadoop YARN Apache Mesos Kubernetes Spark standalone is a simplest way to deploy Spark on a private cluster. Here, Spark application processes are managed by Spark Master and Worker nodes. When Spark applications run on a YARN cluster manager, Spark application processes are managed by the YARN ResourceManager and NodeManager. YARN controls resource management, scheduling and security of submitted applications. Besides Spark on YARN, Spark cluster can be managed by Mesos(A general cluster manager that can also run Hadoop MapReduce and service applications) and Kubernetes(an open-source system for automating deployment, scaling, and management of containerized applications). The Kubernetes scheduler is currently experimental. How Spark works in Cluster mode? Spark orchestrates its operations through the driver program. A driver program is the process running the main() function of the application and creating the SparkContext. When the driver program is run, the Spark framework initializes executor processes on the cluster hosts that process our data. The driver program must be network addressable from the worker nodes. An executor is a process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. Each driver and executor process in a cluster is a different JVM. The following occurs when we submit a Spark application to a cluster: The driver is launched and invokes the main method in the Spark application. The driver requests resources from the cluster manager to launch executors. The cluster manager launches executors on behalf of the driver program. The driver runs the application. Based on the transformations and actions in the application, the driver sends tasks
to executors. Tasks are run on executors to compute and save results. If dynamic allocation is enabled, after executors are idle for a specified period, they are released.
When driver's main method exits or calls SparkContext.stop , it terminates any outstanding executors and releases resources from the cluster manager. Let us focus on running spark applications on YARN. Advantages of spark on YARN than other cluster managers: We can dynamically share and centrally configure the same pool of cluster resources among all frameworks that run on YARN. We can use all the features of YARN schedulers(Capacity Scheduler, Fair Scheduler or FIFO Scheduler) for categorizing, isolating, and prioritizing workloads. We can choose the number of executors to use; in contrast, Spark Standalone requires each application to run an executor on every host in the cluster. Spark can run against Kerberos-enabled Hadoop clusters and use secure authentication between its processes. Deployment Modes in YARN: In YARN, each application instance has an ApplicationMaster process, which is the first container started for that application. The application is responsible for requesting resources from the ResourceManager. Once the resources are allocated, the application instructs NodeManagers to start containers on its behalf. ApplicationMasters eliminate the need for an active client: the process starting the application can terminate, and coordination continues from a process managed by YARN running on the cluster. Cluster Deployment Mode: In cluster mode, the Spark driver runs in the ApplicationMaster on a cluster host. A single process in a YARN container
is responsible for both driving the application and requesting resources from YARN. The client that launches the
application does not need to run for the lifetime of the application. Cluster mode is not well suited to using Spark interactively. Spark applications that require user input, such as
spark-shell and pyspark , require the Spark driver to run inside the client process that initiates the Spark application. Client Deployment Mode: In client mode, the Spark driver runs on the host where the job is submitted. The ApplicationMaster is responsible only
for requesting executor containers from YARN. After the containers start, the client communicates with the containers
to schedule work. Why Java and Scala is preferred for spark application development? Accessing Spark with Java and Scala offers many advantages:
platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance because Spark itself runs in the JVM. You lose these advantages when using the Spark Python API.
Managing dependencies and making them available for Python jobs on a cluster can be difficult.
To determine which dependencies are required on the cluster, we must understand that Spark code applications run in Spark executor processes distributed throughout the cluster.
If the Python transformations we define use any third-party libraries, such as NumPy or nltk, Spark executors require access to those libraries when they run on remote executors. It is better to implement a shared python lib path in worker modes using nfs to access the third-party libraries. Spark Extensions: Spark Streaming: Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. We can apply Spark’s machine learning and graph processing algorithms on data streams. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark SQL: Spark SQL is a Spark module for structured data processing. Spark SQL supports executing SQL queries on a variety of data sources such as relaional databases and hive through a DataFrame interface. A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation. Machine Learning Library(MLib): MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. MLib provides tools and utilities to load ML algorithms such as classification, regression, clustering, and collaborative filtering on large volume of datasets. It supports operations such as feature extraction, transformation, dimensionality reduction, and selection. It has pipelining tools for constructing, evaluating and tuning ML pipelines. We can save and load algorithms, models and pipelines. Spark MLib has utilities to work with linear algebra, statistics, data handling, etc. GraphX: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
Let us see how to develop a streaming application using apache spark streaming module.
... View more
Labels:
03-25-2018
12:51 PM
2 Kudos
Introduction: When I start thinking about a leading opensource cloud computing platform, Openstack comes into a picture. OpenStack controls a large pools of compute, storage, and networking resources throughout a datacenter. Openstack has gained a very high popularity in the technology market. It introduces a new services and features in every release cycle to address critical IT requirements. A very important requirement for any IT organization is to build a robust platform to perform data analytics with a large data sets. BigData is the latest buzzword in the IT Industry. This article primarily focusses on how Openstack plays a key role in addressing BigData usecase. Bigdata on Openstack! Now a days the data is generated everywhere and its volume is growing exponentially. Data is generated from web server, application server, database server in the form of user information, log files and system state information. Apart from this, a huge volume of data is generated from IoT devices like sensors, vehicles, industrial devices. Data generated from the scientific simulation model is also an example of big data source. It is difficult to store these data and perform analytics with a traditional software tools. Hadoop can address the issue. Let me share my usecase with you! I have a bulk amount of data stored in RDBMS environment. It is not performing well when the data grows bigger. I could not imagine its performance when it grows even bigger. I do not feel better to adopt with NoSQL culture in this stage. I need to store and process a bulk amount of data in a cost effective way. Should I rely on high end server in a non virtualized environment? My requirement is to scale the cluster at any time and need a better dashboard to manage all of its components. I planned to set up a hadoop cluster on top of openstack and created my ETL job environment. Hadoop is an industry standard framework for storing and analyzing a large data set with its fault tolerant hadoop distributed file system and mapreduce implematation. Scalability is a very common problem in a typical hadoop cluster. Openstack has introduced a project called Sahara – Data Processing as a Service. Openstack Sahara aims to provision and manage data processing frameworks such as hadoop mapreduce, spark and storm in a cluster topology. This project is similar to the Data analytics platform provided by Amazon Elastic MapReduce(EMR) service. Sahara deploys the cluster in a few minutes. Besides, Openstack Sahara can scale the cluster by adding or removing worker nodes based on demand. Benefits of managing hadoop cluster with Openstack Sahara:
Cluster can be provisioned faster and easy to configure. Like other openstack services, Sahara service can be managed through a powerful REST API, CLI and horizon dashboard. Plugins available to support mutliple hadoop vendor such as Vannila(Apache Hadoop), HDP(ambari), CDH(coudera), MapR, Spark, Storm. Cluster size can be scaled up and down based on demand. It can be integrated with Openstack Swift to store the data processed by hadoop and spark. Cluster monitoring can be made simple. Apart from cluster provisioning, sahara can be used as “Analytics as a Service” for ad-hoc or bursty analytic workloads. Here, we can select the framwork and load the job data. It bahaves as a transient cluster and it will be automatically after job completion. Architecture: Openstack sahara is architected to leverage the core services and other fully managed services of openstack. It makes Sahara more reliable and ability to manage the hadoop cluster efficiently. We can optionally use services such as Trove, Swift in your deployment. Let us look into the internals of Sahara service.
Sahara service has an API server which responds to HTTP request from the end user and interacts with other openstack services to perform its function. Keystone(Identity as a Service) – authenticates users and provides security tokens that are used to work with OpenStack, limiting a user’s abilities in sahara to their OpenStack privileges. Heat(Orchestration as a Service) – used to provision and orchestrate the deployment of data processing clusters. Glance(Virtual Machine Image as a Service) – stores VM images with operating system and pre-installed hadoop/spark software packages to create a data processing cluster. Nova(Compute as a Service) – provisions a virual machine for data processing clusters. Ironic(Bare metal as a Service) provisions a bare metal node for data processing clusters. Neutron(Networking as a Service) – facilitates networking services from basic to advanced topology to access the data processing clusters. Cinder(Block Storage) – provides a persistant storage media for cluster nodes. Swift(Object Storage) – provides a reliable storage to keep job binaries and the data processed by hadoop/spark. Designate(DNS as a Service) – provides a hosted zone to keep DNS records of the cluster instances. Hadoop services communicates with the cluster instances by their hostnames. Ceilometer(Telemetry as a Service) – collects and stores the metrics about the cluster for metering and monitoring purposes. Manila(File Share as a Service) – can be used to store job binaries and data created by the job. Barbican(Key Management Service) – stores sensitive data such as password and private keys securely. Trove(Database as a Service) – provides data base instance for hive metastore and to store the states of the hadoop services and other management services. How to setup sahara cluster? Openstack team has given a clear document to setup Sahara service. Please follow the steps given in the installation guide for deploying Sahara in your environment. There are several ways where we can deploy sahara service. For experimenting Sahara service, kolla is a preferrable choice. You can also manage sahara project through horizon dashboard. https://docs.openstack.org/sahara/latest/install/index.html ETL(Extract, Transform and Load) or ELT(Extract, Load and Transform) with Sahara Cluster: There are numerous ETL tools available in the market and tradtitional data warehouse has it own benefits and limitations, it might be in some other location other than your data source. The reason I am targetting hadoop is that it is an ideal platform to run your ETL jobs. Data in your datastore has a varitey of data including structured, semi-structured and unstructured. Hadoop ecosystem has tools to ingest data from different data sources including databases, files and other data streams and store it in a centralized Hadoop Distributed File System(HDFS). As the data grows rapidly, Hadoop cluster can be scaled and leverages Openstack Sahara. Apache Hive is the data warehouse project built on top of the hadoop ecosystem. It is a proven tool for doing ETL analysis. Once the data is extracted from the data sources with tools such as Sqoop, Flume, Kafka, etc., it should be cleansed and transformed by hive or pig scripts by Mapreduce technique. Later we can load the data to the table we created while transformation phase and it is subjected to perform analysis. It can be generated as a report and visualized with any BI tools. Another advantage of hive is that it is an interactive query engine and it can be accessed via Hive Query Language. It resembles SQL. So, database guy can execute job in hadoop ecosystem without prior knowledge in Java and Mapreduce concepts. Hive query execution engine parses the hive query and convert it in to a sequence of Mapreduce/Spark job to a cluster. Hive can be accessed by JDBC/ODBC driver and thrift clients. Oozie is a workflow engine available in hadoop ecosystem, A workflow is nothing but a set of tasks that needs to be executed as a sequene in a distributed environment. Oozie helps us to create a simple workflow to cascading multiple workflows and create coordinated jobs. However it is ideal for crating workflow for complex ETL jobs, It does not have modules to support all actions related to hadoop. A large organization might their own workflow engine to execute their tasks. We can use any workflow engnine to carryout our ETL job. Openstack mistral – Workflow as a Service is the best example. Apache oozie resembles Openstack mistral in some aspects. It acts a job scheduler and it can be triggered at reqular intervals of time. Let us look at a typical ETL job process with hadoop. An application stores its data in a mysql server. The data stored in DB server should be analysed at a minium cost and time. Extract Phase: The very first step is to extract data from mysql and store it in HDFS. Apache sqoop can be used to export/import from a structured data source such as RDBMS data store. If the data to be extracted is of semi-structured or unstructured, we can use Apache Flume to ingest the data from data soruce such as web server log, twitter data stream, sensor data. sqoop import --connect jdbc:mysql://<database host>/<database name> --driver com.mysql.jdbc.Driver–table <table name> --username <database user> --password <database password> --num-mappers 3 --target-dir hdfs://hadoop_sahara_local/user/rkkrishnaa/database
Transform phase: The data extracted from the above phase is not in a proper format. It is just a raw data. It should be cleansed by applying a proper filter and data aggregation from multiple tables. This is our staging and intermediate area for storing data in HDFS. Now, We should design hive schema for each table and create database to transform the data stored in staging area. Typically, your data is in csv format and each record is delimited by comma(,) We don’t need to check HDFS data to know how it is stored. Since we imported from MySQL, we aware of the schama. It is almost compatible with hive except some data types. Once the database is modelled, we can load the extracted data for cleaning. Still the data in the table is de-normalized. Aggregate the required columns from the different table. Open hive or beeline shell or if you have HUE(Hadoop User Experience) installed, please open hive editor and run the commands given below. CREATE DATABASE <database name>;
USE <database name>;
CREATE TABLE <table name>
(variable1 DATATYPE ,
variable2 DATATYPE ,
variable3 DATATYPE ,
. ,
. ,
. ,
variablen DATATYPE
)
PARTITIONED BY (Year INT, Month INT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://hadoop_sahara_local/user/rkkrishnaa/database/files.csv'
Similarly, we can aggreate the data from the multiple tables with ‘OVERWRITE INTO TABLE’ statement. Hive supports partitioning table to improve the query performance by distributing exection load horizontally. We prefer to partition the column which stores Year and Month. Sometimes, partitioned table creates more tasks in a mapreduce job. Load Phase: We have checked about the transform phase before. It is time to load the transformed data in to a data warehouse dirctory in HDFS, which is the final state of the data. Here we can apply our SQL queries to get an appropriate results. All DML commands can be used to analyse the warehouse data based on the use case. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML Results can be download as a csv, table or chart for analysis. It can be integrated with other popular BI tools such as Talend OpenStudio, Tabelau, etc. What Next? We had a brief look at the benefits of Openstack Sahara and a typical example of ETL task in hadoop ecosystem. It is time to automate the ETL job with Oozie workflow engine. If you are having experience in using mistral, you can adhere to it. Most of the hadoop users are acquainted with Apache Oozie, So I take Oozie as an example. Step1: Create job.properties file nameNode=hdfs://hadoop_sahara_local:8020
jobTracker=<rm-host>:8050
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}
oozie.libpath=/user/root Step2: Create workflow.xml file to define the tasks for extracting data from mysql data store to HDFS <workflow-app xmlns=”uri:oozie:workflow:0.2″ name=”my-etl-job”>
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<start to=”extract”/>
<action name=”extract”>
<sqoop xmlns=”uri:oozie:sqoop-action:0.2″>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<command>import –connect jdbc:mysql://<database host>:3306/<database name> –username <database user> –password <database password> –table <table name> –driver com.mysql.jdbc.Driver –target-dir hdfs://hadoop_sahara_local/user/rkkrishnaa/database${date} –m 3 </command>
</sqoop>
<ok to=”transform”/>
<error to=”fail”/>
<action name=”transform”>
<hive xmlns=”uri:oozie:hive-action:0.2″>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/rkkrishnaa/oozie/hive-site.xml</value>
</property>
</configuration>
<script>transform.q</script>
</hive>
<ok to=”load”/>
<error to=”fail”/>
<action name=”load”>
<hive xmlns=”uri:oozie:hive-action:0.2″>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>/user/rkkrishnaa/oozie/hive-site.xml</value>
</property>
</configuration>
<script>load.q</script>
</hive>
<ok to=”end”/>
<error to=”fail”/>
</action>
<kill name=”fail”>
<message>Sqoop failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name=”end”/>
</workflow-app>
Step3: Create file transform.q LOAD DATA INPATH 'hdfs://hadoop_sahara_local/user/rkkrishnaa/database${date}/files.csv' Step4: Create file load.q SELECT name, age, department from sales_department where experience > 5; The above workflow job will do ETL operation. We can customize the workflow job based on our use case, we have lot of mechanisms to handle task failure in our workflow job. Email action helps to track the status of the workflow task. This ETL job can be executed on daily basis or weekly basis at any time. Usually organizations put these kind of long running tasks during weekends or at night time. To know more about workflow action: https://oozie.apache.org/docs/3.3.1/index.html Conclusion: Openstack has integrated a very large hadoop ecosystem to its universe. Many cloud providers offers hadoop service within some clicks on their cloud management portal. Openstack Sahara proves the ability and robustness of openstack infrastructure. According to Openstack user survey, Very few organizations has adopted to use Sahara service in their private cloud. Sahara supports most of the hadoop vendor plugins to operate hadoop workload effectively. Execute your ETL workflow with Openstack Sahara. Enjoy Learning! || Tamasoma Jyodhirgamaya || References: [1] https://www.openstack.org/software/sample-configs#big-data [2] https://docs.openstack.org/sahara/latest/install/index.html [3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML [4] https://oozie.apache.org/docs/3.3.1/index.html
... View more
Labels:
06-30-2016
05:30 AM
Thank you, I try with this parameter
... View more
06-21-2016
08:27 PM
2 Kudos
@Radhakrishnan Rk 1. Stop the Hue instances, if any. /etc/init.d/hue stop 2. On the node where Hue is installed take a backup of hue.ini cp /etc/hue/conf/hue.ini /etc/hue/conf/hue.ini.bkup 3. On all the Hue instances edit /etc/hue/conf/hue.ini # Configuration options for connecting to LDAP and Active Directory
# -------------------------------------------------------------------
[[ldap]]
# The search base for finding users and groups
base_dn="DC=mycompany,DC=com"
# URL of the LDAP server
ldap_url=ldap://auth.mycompany.com
# A PEM-format file containing certificates for the CA's that
# Hue will trust for authentication over TLS.
# The certificate for the CA that signed the
# LDAP server certificate must be included among these certificates.
# See more here http://www.openldap.org/doc/admin24/tls.html.
## ldap_cert=
## use_start_tls=true
# Distinguished name of the user to bind as -- not necessary if the LDAP server
# supports anonymous searches
bind_dn=" uid=hadoopService,CN=ServiceAccount,DC=mycompany,DC=com"
# Password of the bind user -- not necessary if the LDAP server supports
# anonymous searches
bind_password=
# Pattern for searching for usernames -- Use <username> for the parameter
# For use when using LdapBackend for Hue authentication
ldap_username_pattern="uid=<username>,ou=People,dc=mycompany,dc=com"
# Create users in Hue when they try to login with their LDAP credentials
# For use when using LdapBackend for Hue authentication
create_users_on_login = true
# Synchronize a users groups when they login
sync_groups_on_login=true
# Ignore the case of usernames when searching for existing users in Hue.
ignore_username_case=true
# Force usernames to lowercase when creating new users from LDAP.
force_username_lowercase=true
# Use search bind authentication.
search_bind_authentication=true
# Choose which kind of subgrouping to use: nested or suboordinate (deprecated).
subgroups=suboordinate
# Define the number of levels to search for nested members.
nested_members_search_depth=10
[[[users]]]
# Base filter for searching for users
user_filter="objectclass=*"
# The username attribute in the LDAP schema
user_name_attr=sAMAccountName
[[[groups]]]
# Base filter for searching for groups
group_filter="objectclass=*"
# The username attribute in the LDAP schema
group_name_attr=cn 4. Start the /etc/init.d/hue start and test it.
... View more
05-31-2016
04:34 AM
as per AMBARI-15241 , i believe this will be available in 2.4.0.0. As of now you will have to use https://community.hortonworks.com/questions/4792/ambari-audit-log.html
... View more
05-25-2016
03:44 PM
Thank you somuch for your rapid response. I try with the information provided by you and get back to you.
... View more
05-14-2016
12:40 PM
@Radhakrishnan Rk I don't think that you can use cloudbreak to do this in your case . You can look into this https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-AddingHoststoanExistingCluster You can automate the process to scale up and down using API.
... View more