Member since
01-15-2016
82
Posts
29
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6542 | 04-03-2017 09:35 PM | |
4149 | 12-29-2016 02:22 PM | |
1247 | 06-27-2016 11:18 AM | |
990 | 06-21-2016 10:08 AM | |
1026 | 05-26-2016 01:43 PM |
06-06-2016
09:02 PM
twitter4j jars included to Flume libs by default. However, twitter source from cloudera is built with another version of twitter4j framework. I'd recommend to remove all *twitter4j* jars from flume_home/libs folder and add proper version (mentioned in cloudera's source pom) to aux_lib instead (along with custom source)
... View more
05-26-2016
01:43 PM
@azza messaoudi, check the following Twitter doc: https://dev.twitter.com/streaming/reference/post/statuses/filter And here is the custom Flume source implementation with support of all twitter streaming parameters: http://www.dataprocessingtips.com/2016/04/24/custom-twitter-source-for-apache-flume/ (including "follow" parameter which you're interested in actually)
... View more
05-04-2016
06:46 PM
I suppose is the issue with loading data. Try to create external table instead.. create EXTERNAL table tweets
....
row format serde 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/tmp/tweets_staging/';
... View more
05-02-2016
09:57 PM
As i recall is smth related to nested arrays. We're using another JSON serde lib and it does work with any complexity of jsons. Here i posted an example of twitter table ddl which is tested well. Regards, Michael
... View more
04-15-2016
11:47 AM
The easiest way in hortonworks hadoop is to use Ambari to run flume. It will show you some basic metrics and status of the agents. If you dont want to use Ambari or you have some custom flume installation, i'd recommend to read this doc: http://flume.apache.org/FlumeUserGuide.html#monitoring In any linux env you can install atleast ganglia. It will cover most of your needs in terms of agents monitoring
... View more
04-15-2016
11:40 AM
Well, based on what we know so far, i'd say 2 flume agents with the file or jdbc channel should work for you. There will be no overlap in data because is controlled by MQ itself, so it not a matter of flume. From flume processing side we ensure that no data loss happens by using file or jdbc channel.
... View more
04-14-2016
08:22 AM
1 Kudo
It would be great to see the log of the agent
... View more
04-14-2016
08:19 AM
1 Kudo
Can you explain a bit the issue with MQ? Im not an expert in WebSphere, but seems MQ is supposed to deliver each event only once. So, there should be no duplicates by design. Is it correct?
... View more
03-21-2016
06:25 PM
1 Kudo
I'd say (in general) whenever you need to parallelize your algorithm, and i suppose TF-IDF is a good candidate for it, you need to submit this job to the cluster in any way. It can be a streaming mentioned by @Lester Martin, or Pyspark mentioned by @Artem Ervits (just note - spark is not map-reduce, so if you want to learn map-reduce first, then streaming option is the best for you). And in case you have some lite algorithm to implement and it can be done on client machine/your laptop/application server etc - you can just submit to Hadoop cluster some Hive query and process the results locally then.
... View more
03-18-2016
10:25 AM
hadoop-annotations-2.7.1.2.3.4.0-3485.jar
hadoop-auth-2.7.1.2.3.4.0-3485.jar
hadoop-aws-2.7.1.2.3.4.0-3485.jar
hadoop-azure-2.7.1.2.3.4.0-3485.jar
hadoop-common-2.7.1.2.3.4.0-3485-tests.jar
hadoop-common-2.7.1.2.3.4.0-3485.jar
hadoop-nfs-2.7.1.2.3.4.0-3485.jar Double check it's a classes from Azure.. also you need to add hadoop-hdfs.jar and core-site.xml
... View more