Member since
06-29-2017
6
Posts
1
Kudos Received
0
Solutions
12-13-2017
06:38 AM
You are in need of skills and not the certification to get your career going, there is no certification on scala. learn scala course here : https://goo.gl/7D4Ldn
... View more
09-27-2017
10:12 AM
Hadoop Connector Guide provides a brief introduction on cloud connectors and its features. The guide provides detailed information on how to set up the connector and run Data Synchronization tasks. The guide provides an overview of supported features and task operations that can be performed using Hadoop Connector. Docs for Hadoop connector for Informatica: https://kb.informatica.com/proddocs/Product%20Documentation/6/IC_Spring2017_HadoopConnectorGuide_en.pdf
... View more
09-23-2017
07:05 AM
I have searched for the referring document and found the below link: https://discuss.codecademy.com/t/data-base-migration-from-mysql-to-hive-using-sqoop/228062
... View more
09-23-2017
06:20 AM
5 Common use cases for Apache Spark: Streaming ingest and analytics Spark isn’t the first big data tool for handling streaming ingest, but it is the first one to integrate it with the rest of the analytic environment. Spark is friendly with the rest of the streaming data ecosystem, supporting data sources including Flume, Kafka, ZeroMQ, and HDFS. Exploratory analytics One of the headline benefits of using Spark is that you no longer need to maintain different environments for exploratory and production work. The relatively long execution times of a Hadoop MapReduce job make it difficult for hands-on exploration of data: data scientists typically still must sample data if they want to move quickly. Thanks to the speed of Spark’s in-memory capabilities, interactive exploration can now happen completely within Spark , without the need for Java engineering or sampling of the data. Model building and machine learning Spark’s status as a big data tool that data scientists find easy to use makes it ideal for building models for analytical purposes. In a pre-Spark world, big data modelers typically built their models in a language such as R or SAS, then threw them to data engineers to re-implement in Java for production on Hadoop. Graph analysis By incorporating the GraphX component, Spark brings all the benefits of using its environment to graph computation: enabling use cases such as social network analysis, fraud detection, and recommendations. Simpler, faster, ETL Though less glamorous than the analytical applications, ETL is often the lion’s share of data workloads. If the rest of your data pipeline is based on Spark, then the benefits of using Spark for ETL are obvious, with consequent increases in maintainability and code-reuse.
... View more
09-16-2017
12:40 PM
How To Install Cassandra and Run a Single-Node Cluster on Ubuntu 14.04 Introduction Cassandra, or Apache Cassandra, is a highly scalable open source NoSQL database system, achieving great performance on multi-node setups. In this tutorial, you’ll learn how to install and use it to run a single-node cluster on Ubuntu 14.04. Prerequisite To complete this tutorial, you will need the following: Ubuntu 14.04 Droplet A non-root user with sudo privileges (Initial Server Setup with Ubuntu 14.04 explains how to set this up.) Installing Cassandra We'll install Cassandra using packages from the official Apache Software Foundation repositories, so start by adding the repo so that the packages are available to your system. Note that Cassandra 2.2.2 is the latest version at the time of this publication. Change the 22x to match the latest version. For example, use 23x if Cassandra 2.3 is the latest version: echo "deb http://www.apache.org/dist/cassandra/debian 22x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
The add the repo's source: echo "deb-src http://www.apache.org/dist/cassandra/debian 22x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
To avoid package signature warnings during package updates, we need to add three public keys from the Apache Software Foundation associated with the package repositories. Add the first one using this pair of commands, which must be run one after the other: gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -
Then add the second key: gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00
gpg --export --armor 2B5C1B00 | sudo apt-key add -
Then add the third: gpg --keyserver pgp.mit.edu --recv-keys 0353B12C
gpg --export --armor 0353B12C | sudo apt-key add -
Update the package database once again: sudo apt-get update
Finally, install Cassandra: sudo apt-get install cassandra
... View more
08-26-2017
09:40 AM
1 Kudo
Apache Spark comes with a very
advanced Directed
Acyclic Graph(DAG) data processing
engine. What it means is that for every Spark job, a DAG of tasks is created to
be executed by the engine. The DAG in mathematical parlance consists of a set
of vertices and directed edges connecting them. The tasks are executed as per
the DAG layout. In the MapReduce case, the DAG consists of only two vertices,
with one vertex for the map task and the other
one for the reduce task.
... View more