Member since
05-10-2016
11
Posts
13
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2370 | 07-03-2016 11:44 AM | |
2622 | 07-02-2016 02:25 PM |
10-08-2016
04:17 PM
2 Kudos
All you need is some basic programming skill and a little
experience with AWS to get things kick started. Before we proceed, let me explain what actually is Hybrid Cloud Environment? Hybrid cloud solutions combine the dynamic scalability
of public cloud solutions with the flexibility and control of your private
cloud. Some of the benefits of Hybrid Computing include -
Business Continuity More Opportunity For Innovation Scalability Increased Speed To Market Risk Management (test the waters) Improved Connectivity Secure Systems Kafka
Mirroring With Kafka mirroring feature you can maintain a replica of
you existing Kafka Cluster. The following diagram shows how to use the MirrorMaker tool
to mirror a source Kafka cluster into a target (mirror) Kafka cluster. The tool uses a Kafka consumer to consume
messages from the source cluster, and re-publishes those messages to the local
(target) cluster using an embedded Kafka producer Use
Case Demonstrate hybrid cloud solution using Kafka Mirroring across regions Environment
Architecture The architecture above represent two cluster environments, private and
public cloud respectively, where data is replicated from source Kafka cluster
to target Kafka cluster with the help of MirrorMaker tool and analysis over the
data sets is performed using Spark Streaming clusters. The internal environment stores all the data in HDFS which is
accessible with Hive external tables. The purpose of storing data in HDFS is so
that at any given point of time the raw data is never changed and can be used
to tackle any discrepancies that might occur in the real time layer (target
cluster). The external environment receives the replicated data with the help of
Mirror Maker and a spark streaming application is responsible to process that
data and store it into Amazon S3. The crucial data that requires low level
latency based on TTL is maintained in Amazon S3. The data is then pushed to
Amazon Redshift where the user can issue low latency queries and have the
results calculated on the go. With the combine power of Hybrid Environment and Kafka mirroring you
can perform different types of data analysis over streams of data with low
latency Technology
Stack
Source System – Twitter Messaging System – Apache Kafka Target System (Internal)– HDFS, Apache Hive Target System (External) – Amazon S3 , Amazon
Redshift AWS Instance Type - EMR Streaming API – Apache Spark Programming Language – Java IDE – Eclipse Build tool – Apache Maven Operating System – CentOS 7 Workflow
A flume agent is configured to pull live data from
twitter and push it to source Kafka broker which is hosted internally. The data is then replicated to a target Kafka
broker which is hosted on AWS with the help of Kafka Mirror Tool. The data is then picked up from Kafka in the
internal environment by Spark Streaming application which stores all the data
over HDFS where you can query the data using Hive. Parallel to this there is another spark
streaming application running in the cloud environment which reads the data
from the target Kafka cluster and stores it in Amazon S3 from where it is
pushed to Amazon Redshift. Development The code base for Kafka
Mirroring in Hybrid Cloud Environment
has been officially uploaded on GitHub. You can download the source code from https://github.com/XavientInformationSystems/Kakfa-Mirroring-Hybrid-Cloud
and follow the instructions for setting up the project. Stay tuned for more amazing stuff and help the open-source community
to grow further by actively participating in the work we do to expand the
project.
... View more
Labels:
07-19-2016
07:41 AM
@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂
... View more
07-18-2016
08:25 AM
Does this delete the directories that have no data in them and leaves the directories with data in them?
The point is to only remove directories that have no data.
... View more
07-18-2016
08:02 AM
@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?
... View more
07-18-2016
07:45 AM
Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming? I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories. Please provide any other suggestions as well, the solution needs to be in Java.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
07-04-2016
05:21 AM
1 Kudo
@sharda godara - Kindly accept the answer 🙂
... View more
07-03-2016
11:44 AM
2 Kudos
The problem here is that you are running Pig in MapReduce mode, which requires the necessary hadoop jars set in your classpath. You have two options over here- 1. Set the Hadoop Classpath by using the below command :- Assuming you are seting this from the same folder as you are building your code with Maven and have put all your 3rd paty JARs in target/libs... Command -> export HADOOP_CLASSPATH=./target/classes:./target/libs/* 2. Run in Pig in Local mode - This mode allows you to run in Local mode, you need access to a single machine; all files are installed and run using your local host and local file system. Command -> pig -x local Hope this helps 🙂
... View more
07-02-2016
02:25 PM
5 Kudos
Here is the solution to your problem @Dagmawi Mengistu There are two issues over here, ISSUE 1: If you check your logs, then after relation "f", you get the "java.lang.ClassCastException". Please find the updated steps below with explanation of how to resolve this error( Comments are marked with // prefix) - a = load '/pigsample/Salaryinfo.csv' USING PigStorage(','); b = load '/pigsample/Employeeinfo.csv' USING PigStorage(','); c = filter b by $4 =='Male'; // In relation "d", carefully observer that I have type cast the field at index 0 to int, you need to explicitly do type casting like this in order to avoid the "java.lang.ClassCastException". d = foreach c generate (int)$0 as id:int, $1 as firstname:chararray, $2 as lastname:chararray, $4 as gender:chararray, $6 as city:chararray , $7 as country:chararray, $8 as countrycode:chararray; // Similarly in relation "e", we have to again explicitly type cast the field iD to int. e = foreach a generate (int)$0 as iD:int, $1 as firstname:chararray, $2 as lastname:chararray, $3 as salary:double, ToDate($4, 'MM/dd/yyyy') as dateofhire, $5 as company:chararray; // Relation "f" works perfectly now, doesn't throw any exceptions f = join d by id, e by iD; ISSUE 2 - // In relation "g", you don't need to write f.d::firstname, this will throw org.apache.pig.backend.executionengine.ExecException". You can directly reference the fields present in relation "f" of relation "d" like this - g = foreach f generate d::firstname as firstname; // Print output DUMG g; OUTPUT - (Jonathan) (Gary) (Roger) (Jeffrey) (Steve) (Lawrence) (Billy) (Joseph) (Aaron) (Steve) (Brian) (Robert) Hope this helps 🙂
... View more
06-02-2016
12:56 PM
1 Kudo
Hi @Neeraj Sabharwal, has this issue been resolved in HDP 2.4?
... View more
05-17-2016
01:30 PM
1 Kudo
Hi @Neeraj Sabharwal, I am trying to save my output results in Spark using saveAsTextFile(""). The result of which is multiple parts (part-0000, part-00001 ...so on) along with .crc files in the output directory. Do you have any idea how can I avoid forming the .crc files?
... View more