About bluesmix

bluesmix · ‎04-05-2017

@tuxnet it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.

bluesmix · ‎04-03-2017

@tuxnet Sure you can use any IDE with PySpark. Here is short instructions for Eclipse and PyDev: - set HADOOP_HOME variable referncing location of winutils.exe - set SPARK_HOME variable referencing your local spark folder - set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j) - add %SPARK_HOME%/python/lib/pyspark.zip and %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter For the testing purposes i'm adding code like: spark = SparkSession.builder.set_master("my-cluster-master-node:7077").. but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate() Alternatively you can setup your run configurations to use spark-submit directly. Hope it helps

bluesmix · ‎02-15-2017

@Cord thomas Turn on debug logging and check the log file first

bluesmix · ‎12-29-2016

@vamsi valiveti it could be the option, right. But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option. It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring.

bluesmix · ‎12-29-2016

@vamsi valiveti the easiest way is to detach shell from the command using nohup: nohup <my_command> & Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service. And third option is to use Ambari to control the agents.

bluesmix · ‎12-29-2016

@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage. I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.

bluesmix · ‎07-26-2016

Default transactionCapacity for file channel is 10 000. For memory channel - 100 Thats why it works for you. Add transactionCapacity property to your file channel or increase memory available for flume process (like -Xmx1024m)

bluesmix · ‎06-27-2016

Grant write permissions to /var/log/flume directory. Also you can specify alternative log file for specific agent: -Dflume.log.file=my_path/my_file.log

bluesmix · ‎06-21-2016

High availability in flume is just a matter of agents configuration regardless if you're using Ambari or not. Here few links you can check: https://flume.apache.org/FlumeUserGuide.html#flow-reliability-in-flume https://flume.apache.org/FlumeUserGuide.html#failover-sink-processor

bluesmix · ‎06-20-2016

I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases. If you have no such requirements - use Hive on TEZ If you have no TEZ - use Hive on MR In any case Hive acts just like a metastore..

Online	Offline
Last Visited	‎09-05-2017 12:42 PM

Member Since	‎01-15-2016 07:31 PM
Last Visited	‎09-05-2017 12:42 PM
Posts	82
Kudos received	29

Cloudera Community

Re: Python IDE for HDP Spark cluster

Re: Flume with oozie

Re: Is it possible to run flume agent in multiple ...

Re: Using ambari for high availability setup for f...

Re: Does anyone know how can we stream Twitter dat...

Re: Python IDE for HDP Spark cluster

Re: Python IDE for HDP Spark cluster

Re: Flume is leaving .tmp files,Flume leaving .tmp...

Re: Flume with oozie

Re: Flume with oozie

Re: Flume with oozie

Re: java.lang.OutOfMemoryError: Java heap space wi...

Re: Is it possible to run flume agent in multiple ...

Re: Using ambari for high availability setup for f...

Re: When to go with ETL on Hive using Tez VS When...