About bluesmix

bluesmix · ‎06-06-2016

twitter4j jars included to Flume libs by default. However, twitter source from cloudera is built with another version of twitter4j framework. I'd recommend to remove all *twitter4j* jars from flume_home/libs folder and add proper version (mentioned in cloudera's source pom) to aux_lib instead (along with custom source)

bluesmix · ‎05-26-2016

@azza messaoudi, check the following Twitter doc: https://dev.twitter.com/streaming/reference/post/statuses/filter And here is the custom Flume source implementation with support of all twitter streaming parameters: http://www.dataprocessingtips.com/2016/04/24/custom-twitter-source-for-apache-flume/ (including "follow" parameter which you're interested in actually)

bluesmix · ‎05-04-2016

I suppose is the issue with loading data. Try to create external table instead.. create EXTERNAL table tweets .... row format serde 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/tmp/tweets_staging/';

bluesmix · ‎05-02-2016

As i recall is smth related to nested arrays. We're using another JSON serde lib and it does work with any complexity of jsons. Here i posted an example of twitter table ddl which is tested well. Regards, Michael

bluesmix · ‎04-15-2016

The easiest way in hortonworks hadoop is to use Ambari to run flume. It will show you some basic metrics and status of the agents. If you dont want to use Ambari or you have some custom flume installation, i'd recommend to read this doc: http://flume.apache.org/FlumeUserGuide.html#monitoring In any linux env you can install atleast ganglia. It will cover most of your needs in terms of agents monitoring

bluesmix · ‎04-15-2016

Well, based on what we know so far, i'd say 2 flume agents with the file or jdbc channel should work for you. There will be no overlap in data because is controlled by MQ itself, so it not a matter of flume. From flume processing side we ensure that no data loss happens by using file or jdbc channel.

bluesmix · ‎04-14-2016

It would be great to see the log of the agent

bluesmix · ‎04-14-2016

Can you explain a bit the issue with MQ? Im not an expert in WebSphere, but seems MQ is supposed to deliver each event only once. So, there should be no duplicates by design. Is it correct?

bluesmix · ‎03-21-2016

I'd say (in general) whenever you need to parallelize your algorithm, and i suppose TF-IDF is a good candidate for it, you need to submit this job to the cluster in any way. It can be a streaming mentioned by @Lester Martin, or Pyspark mentioned by @Artem Ervits (just note - spark is not map-reduce, so if you want to learn map-reduce first, then streaming option is the best for you). And in case you have some lite algorithm to implement and it can be done on client machine/your laptop/application server etc - you can just submit to Hadoop cluster some Hive query and process the results locally then.

bluesmix · ‎03-18-2016

hadoop-annotations-2.7.1.2.3.4.0-3485.jar hadoop-auth-2.7.1.2.3.4.0-3485.jar hadoop-aws-2.7.1.2.3.4.0-3485.jar hadoop-azure-2.7.1.2.3.4.0-3485.jar hadoop-common-2.7.1.2.3.4.0-3485-tests.jar hadoop-common-2.7.1.2.3.4.0-3485.jar hadoop-nfs-2.7.1.2.3.4.0-3485.jar Double check it's a classes from Azure.. also you need to add hadoop-hdfs.jar and core-site.xml

Online	Offline
Last Visited	‎09-05-2017 12:42 PM

Member Since	‎01-15-2016 07:31 PM
Last Visited	‎09-05-2017 12:42 PM
Posts	82
Kudos received	29

Cloudera Community

Re: Python IDE for HDP Spark cluster

Re: Flume with oozie

Re: Is it possible to run flume agent in multiple ...

Re: Using ambari for high availability setup for f...

Re: Does anyone know how can we stream Twitter dat...

Re: java.lang.NoSuchMethodError: twitter4j.FilterQ...

Re: Does anyone know how can we stream Twitter dat...

Re: Getting all NULLS when selecting from a Hive ...

Re: I'm getting a "start token not found where exp...

Re: Flume - source exec and sink hdfs. File is not...

Re: Multiple Flume Agents to fetch data from MQ me...

Re: Flume - source exec and sink hdfs. File is not...

Re: Multiple Flume Agents to fetch data from MQ me...

Re: Running Python Scripts on data in HDFS

Re: Flume agent on windows