Member since
09-15-2015
116
Posts
141
Kudos Received
40
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1829 | 02-05-2018 04:53 PM | |
2371 | 10-16-2017 09:46 AM | |
2058 | 07-04-2017 05:52 PM | |
3084 | 04-17-2017 06:44 PM | |
2258 | 12-30-2016 11:32 AM |
05-17-2016
05:29 PM
1 Kudo
You can do this with YARN ACLs against the relevant yarn queues. A better way of implementing this would be to use Ranger to set YARN access controls to a given set of queues. You would then require spark users to add queue to their submit job, and the authentication would be enforced. To ensure this, prevent user access to the default queue.
... View more
05-04-2016
11:58 PM
2 Kudos
The NiFi processor does in fact use the elastic search BulkRequest java api, so in fact, even if you set the batch size to 1, you will be using batch loading from the ES perspective. If you want to send in a file with multiple JSON records, you have two choices. You could either use InvokeHttp to post to the REST API. For a more NiFi centric solution, use SplitText to divide up the JSON records, and then process these with a decent BatchSize using the PutElasticsearch processor. This will give you good control, especially if you need to use back-pressure to even out the flow and protect the ES cluster.
... View more
04-18-2016
04:32 PM
1 Kudo
Oregon is US_WEST_2 (per http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region)
... View more
04-15-2016
05:17 PM
3 Kudos
While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large. A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example.
... View more
04-08-2016
03:41 PM
1 Kudo
To configure log levels, add --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties"
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties"
This assumes you have a file called log4j-spark.properties on the classpath (usually in resources for the project you're using to build the jar. This log4j can then control the verbosity of spark's logging. I usually use something derived from the spark default, with some customisation like: # Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.sql=WARN
# Logging for this application
log4j.logger.com.myproject=INFO
Something else to note here is that in yarn cluster mode, all your important logs (especially the executor logs) will be aggregated by the YARN ATS when the application finishes. You can get these with yarn logs -applicationId <application> This will show you all the log based on your config levels.
... View more
04-04-2016
12:54 AM
2 Kudos
Exactly as you would in a hive CTAS: %spark
sc.parallelize(1 to 10).toDF.registerTempTable("pings_raw")
%sql
create table pings
location '/path/to/pings'
as select * from pings_raw
... View more
03-31-2016
11:59 PM
1 Kudo
The --jars option is comma delimited, not colon delimited. Hence the error: it says it can't find a single file with a very long name. Change your ':' to ',' and you should get further.
... View more
03-30-2016
09:30 AM
1 Kudo
GraphX works by loading an entire graph into a combination of VertexRDDs and EdgeRDDs, so the underlying database's capabilities are not really relevant to the graph computation, since GraphX won't touch it beyond initial load. On that basis you can really use any thing that will effectively store and scan a list of paired tuples, and a list of ids and other properties. From this perspective HBase or Accumulo would seem like a good bet to attach Spark to, but of course any file in HDFS would do. For the ability to modify a graph prior to analysing it in GraphX, it's more useful to pick a 'proper' graph database. For this it's worth looking at something like Accumulo Graph which provides a graph database hosted on Accumulo, or possibly another very exciting new project, Gaffer (https://github.com/GovernmentCommunicationsHeadquarters/Gaffer) also hosted on Accumulo. Tinkerpop (https://tinkerpop.apache.org/) provides some other options focussed around the common Gremlin APIs, which are generally well understood in graph world. Other options might include something like TitanDB hosted on HBase. These will provide you with an interface API for modifying graphs effectively.
... View more
03-29-2016
02:39 PM
1 Kudo
Ultimately in Spark if you want to write to different tables you will probably end up writing different DataFrames / RDDs from the filter. Alternatively, you could write to multiple Hbase tables based on a column within a mapPartitions call, but you may have issues with achieving nice batch sizes for each Hbase write. That way you can have a generic HBase function that is parameterised by column. If your use case is simply routing a stream of records from another source to paramterised locations in Hbase, NiFi may actually be a better fit, but if you are doing significant or complex batch processing before landing into HBase the Spark approach will help.
... View more
03-24-2016
02:06 PM
1 Kudo
RDDs do not really have fields per-se, unless for example your have an RDD of Row objects. You would usually filter on an index:
rdd.filter(x => x(1) == "thing")
(example in scala for clarity, same thing applies to Java)
If you have an RDD of a typed object, the same thing applies, but you can use a getter for example in the lambda / filter function.
If you want field references you would do better wih the dataframes API. Compare for example:
sqlContext.createDataFrame(rdd, schema).where($"fieldname" == "thing")
(see Spark Official Docs - DataFrames for more on the schema)
... View more