Member since
01-15-2016
82
Posts
29
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3888 | 04-03-2017 09:35 PM | |
1983 | 12-29-2016 02:22 PM | |
590 | 06-27-2016 11:18 AM | |
511 | 06-21-2016 10:08 AM | |
481 | 05-26-2016 01:43 PM |
04-05-2017
08:29 PM
@tuxnet it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.
... View more
04-03-2017
09:35 PM
1 Kudo
@tuxnet Sure you can use any IDE with PySpark. Here is short instructions for Eclipse and PyDev: - set HADOOP_HOME variable referncing location of winutils.exe - set SPARK_HOME variable referencing your local spark folder - set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j) - add %SPARK_HOME%/python/lib/pyspark.zip and %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter For the testing purposes i'm adding code like: spark = SparkSession.builder.set_master("my-cluster-master-node:7077").. but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate() Alternatively you can setup your run configurations to use spark-submit directly. Hope it helps
... View more
02-15-2017
03:52 PM
@Cord thomas
Turn on debug logging and check the log file first
... View more
12-29-2016
05:43 PM
I'd recommend https://github.com/stickfigure/batchfb over the restfb
because of nice implementation of fb batch API. It fits very well for any facebook data consuming task.
... View more
12-29-2016
03:51 PM
@vamsi valiveti it could be the option, right. But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option. It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring.
... View more
12-29-2016
02:40 PM
@vamsi valiveti the easiest way is to detach shell from the command using nohup: nohup <my_command> &
Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service. And third option is to use Ambari to control the agents.
... View more
12-29-2016
02:22 PM
1 Kudo
@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage. I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.
... View more
07-26-2016
02:58 PM
Default transactionCapacity for file channel is 10 000. For memory channel - 100 Thats why it works for you. Add transactionCapacity property to your file channel or increase memory available for flume process (like -Xmx1024m)
... View more
06-27-2016
11:18 AM
Grant write permissions to /var/log/flume directory. Also you can specify alternative log file for specific agent: -Dflume.log.file=my_path/my_file.log
... View more
06-21-2016
10:08 AM
High availability in flume is just a matter of agents configuration regardless if you're using Ambari or not. Here few links you can check:
https://flume.apache.org/FlumeUserGuide.html#flow-reliability-in-flume https://flume.apache.org/FlumeUserGuide.html#failover-sink-processor
... View more
06-20-2016
04:15 PM
1 Kudo
I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases. If you have no such requirements - use Hive on TEZ If you have no TEZ - use Hive on MR In any case Hive acts just like a metastore..
... View more
06-06-2016
09:02 PM
twitter4j jars included to Flume libs by default. However, twitter source from cloudera is built with another version of twitter4j framework. I'd recommend to remove all *twitter4j* jars from flume_home/libs folder and add proper version (mentioned in cloudera's source pom) to aux_lib instead (along with custom source)
... View more
05-26-2016
01:43 PM
@azza messaoudi, check the following Twitter doc: https://dev.twitter.com/streaming/reference/post/statuses/filter And here is the custom Flume source implementation with support of all twitter streaming parameters: http://www.dataprocessingtips.com/2016/04/24/custom-twitter-source-for-apache-flume/ (including "follow" parameter which you're interested in actually)
... View more
05-04-2016
06:46 PM
I suppose is the issue with loading data. Try to create external table instead.. create EXTERNAL table tweets
....
row format serde 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/tmp/tweets_staging/';
... View more
05-02-2016
09:57 PM
As i recall is smth related to nested arrays. We're using another JSON serde lib and it does work with any complexity of jsons. Here i posted an example of twitter table ddl which is tested well. Regards, Michael
... View more
04-15-2016
11:47 AM
The easiest way in hortonworks hadoop is to use Ambari to run flume. It will show you some basic metrics and status of the agents. If you dont want to use Ambari or you have some custom flume installation, i'd recommend to read this doc: http://flume.apache.org/FlumeUserGuide.html#monitoring In any linux env you can install atleast ganglia. It will cover most of your needs in terms of agents monitoring
... View more
04-15-2016
11:40 AM
Well, based on what we know so far, i'd say 2 flume agents with the file or jdbc channel should work for you. There will be no overlap in data because is controlled by MQ itself, so it not a matter of flume. From flume processing side we ensure that no data loss happens by using file or jdbc channel.
... View more
04-14-2016
08:22 AM
1 Kudo
It would be great to see the log of the agent
... View more
04-14-2016
08:19 AM
1 Kudo
Can you explain a bit the issue with MQ? Im not an expert in WebSphere, but seems MQ is supposed to deliver each event only once. So, there should be no duplicates by design. Is it correct?
... View more
03-21-2016
06:25 PM
1 Kudo
I'd say (in general) whenever you need to parallelize your algorithm, and i suppose TF-IDF is a good candidate for it, you need to submit this job to the cluster in any way. It can be a streaming mentioned by @Lester Martin, or Pyspark mentioned by @Artem Ervits (just note - spark is not map-reduce, so if you want to learn map-reduce first, then streaming option is the best for you). And in case you have some lite algorithm to implement and it can be done on client machine/your laptop/application server etc - you can just submit to Hadoop cluster some Hive query and process the results locally then.
... View more
03-18-2016
10:25 AM
hadoop-annotations-2.7.1.2.3.4.0-3485.jar
hadoop-auth-2.7.1.2.3.4.0-3485.jar
hadoop-aws-2.7.1.2.3.4.0-3485.jar
hadoop-azure-2.7.1.2.3.4.0-3485.jar
hadoop-common-2.7.1.2.3.4.0-3485-tests.jar
hadoop-common-2.7.1.2.3.4.0-3485.jar
hadoop-nfs-2.7.1.2.3.4.0-3485.jar Double check it's a classes from Azure.. also you need to add hadoop-hdfs.jar and core-site.xml
... View more
03-17-2016
05:27 PM
Use jar files from your azure cluster, not sandbox. You need exactly same versions of libs used on azure cluster. Also copy core-site.xml to flume classpath (FLUME_HOME/conf should be fine) Regards
... View more
03-16-2016
11:52 AM
Seems a common issue for single node cluster and sandboxes:)
https://www.google.pl/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=hadoop+file+could+only+be+replicated+to+0+nodes+instead+of+minReplication+(%3D1)
... View more
03-16-2016
11:45 AM
This is actually steps for windows. And i tested it locally - it works
... View more
03-15-2016
08:38 PM
2 Kudos
I can propose much easier steps: 1. Download flume binaries - http://flume.apache.org/download.html and extract it somewhere (this is going to be a FLUME_HOME) 2. Download winutils and put it somwhere (f.e. C:/winutils/bin, in this case C:/winutils is going to be a HADOOP_HOME) 3. Copy all missed hdfs libs to your FLUME_HOME/lib (you can find them in your hadoop cluster, is always preferable to have exact the same versions as in /usr/hdp/current/hadoop or /usr/hdp/current/hadoop-hdfs) 4. Run flume agent with the following command: bin\flume-ng agent -name MyAgent -f conf/MyAgent.properties -property "flume.root.logger=INFO,LOGFILE,console;flume.log.file=MyLog.log;hadoop.home.dir=C:/winutils"
... View more
03-15-2016
08:30 PM
2 Kudos
That should be commons-configuration, commons-io and htrace-core from /usr/hdp/current/hadoop/lib
... View more
03-14-2016
03:36 PM
2 Kudos
I've never tried that scenario, but it should be possible. All you need is to install flume on windows machine (just extract zip file) and add jars needed to connect to azure (if any). You can use hdfs.kerberosPrincipal, hdfs.kerberosKeytab properties if you have secure hdfs Regards
... View more
03-08-2016
06:59 PM
1 Kudo
Just put everything to single config. Like Agent1.sources..
Agent1.sinks..
Agent1.channels..
Agent2.sources..
Agent2.sinks..
Agent2.channels..
Note, is possible to manage those agent separately (ambari will split them) and each one will run in a separate process
... View more
02-03-2016
06:44 PM
3 Kudos
Why dont use ambari rest api to manage Flume configs outside of main ambari screen.
Personally i didnt tested it, but it should be a valid approach. Also one issue is still there (it prevents us to use ambari for Flume managing): https://issues.apache.org/jira/browse/AMBARI-9421
And the workaround mentioned doesn't works with more than 1 agent
... View more
02-03-2016
06:35 PM
Replace <ok to="get_run_date" /> With <ok to="join-fork-actions" /> In general each of the "subflows" should end up with the join node.
... View more