Member since
01-15-2016
82
Posts
29
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6227 | 04-03-2017 09:35 PM | |
3912 | 12-29-2016 02:22 PM | |
1168 | 06-27-2016 11:18 AM | |
967 | 06-21-2016 10:08 AM | |
973 | 05-26-2016 01:43 PM |
04-05-2017
08:29 PM
@tuxnet it should work with spark 1.6 as well. You can check master url in spark-defaults.config file in your cluster. If you setup SPARK_CONF_DIR variable and copy spark-defaults config from your cluster to it there is no need to specify master explicitly.
... View more
04-03-2017
09:35 PM
1 Kudo
@tuxnet Sure you can use any IDE with PySpark. Here is short instructions for Eclipse and PyDev: - set HADOOP_HOME variable referncing location of winutils.exe - set SPARK_HOME variable referencing your local spark folder - set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j) - add %SPARK_HOME%/python/lib/pyspark.zip and %SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter For the testing purposes i'm adding code like: spark = SparkSession.builder.set_master("my-cluster-master-node:7077").. but with proper configuration file in SPARK_CONF_DIR it should work with just SparkSession.builder.getOrCreate() Alternatively you can setup your run configurations to use spark-submit directly. Hope it helps
... View more
02-15-2017
03:52 PM
@Cord thomas
Turn on debug logging and check the log file first
... View more
12-29-2016
03:51 PM
@vamsi valiveti it could be the option, right. But for production usage i'd think additionally about how to stop the agents and how to monitor the agent. From my experience init.d service script + ganglia monitoring is a best option. It allows you to run/stop agents easily with the commands like: /etc/init.d/flume "agent" stop/start. And ganglia provides a nice web interface for the monitoring.
... View more
12-29-2016
02:40 PM
@vamsi valiveti the easiest way is to detach shell from the command using nohup: nohup <my_command> &
Another option is to create flume init.d service script. I've posted some example script here (search for "Setup flume agent auto startup" on the page), and run the flume as a service. And third option is to use Ambari to control the agents.
... View more
12-29-2016
02:22 PM
1 Kudo
@vamsi valiveti you can trigger Flume from the oozie shell action. However pay attention that action will be executed on random cluster node, so all your nodes should have Flume installed. Also you will need to somehow control the agents after that, and if you have >10 nodes it became a problem.. That's why is not common scenario of flume usage. I'd say the good approach is to keep Flume running all the time. And schedule oozie jobs to process the data whenever you need.
... View more
07-26-2016
02:58 PM
Default transactionCapacity for file channel is 10 000. For memory channel - 100 Thats why it works for you. Add transactionCapacity property to your file channel or increase memory available for flume process (like -Xmx1024m)
... View more
06-27-2016
11:18 AM
Grant write permissions to /var/log/flume directory. Also you can specify alternative log file for specific agent: -Dflume.log.file=my_path/my_file.log
... View more
06-21-2016
10:08 AM
High availability in flume is just a matter of agents configuration regardless if you're using Ambari or not. Here few links you can check:
https://flume.apache.org/FlumeUserGuide.html#flow-reliability-in-flume https://flume.apache.org/FlumeUserGuide.html#failover-sink-processor
... View more
06-20-2016
04:15 PM
1 Kudo
I'd say whenever you need some Spark specific features like ML, GraphX or Streaming - use spark as ETL engine since it provides All-in-one solution for most usecases. If you have no such requirements - use Hive on TEZ If you have no TEZ - use Hive on MR In any case Hive acts just like a metastore..
... View more