About sball

sball · ‎05-17-2016

You can do this with YARN ACLs against the relevant yarn queues. A better way of implementing this would be to use Ranger to set YARN access controls to a given set of queues. You would then require spark users to add queue to their submit job, and the authentication would be enforced. To ensure this, prevent user access to the default queue.

sball · ‎05-04-2016

The NiFi processor does in fact use the elastic search BulkRequest java api, so in fact, even if you set the batch size to 1, you will be using batch loading from the ES perspective. If you want to send in a file with multiple JSON records, you have two choices. You could either use InvokeHttp to post to the REST API. For a more NiFi centric solution, use SplitText to divide up the JSON records, and then process these with a decent BatchSize using the PutElasticsearch processor. This will give you good control, especially if you need to use back-pressure to even out the flow and protect the ES cluster.

sball · ‎04-18-2016

Oregon is US_WEST_2 (per http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region)

sball · ‎04-15-2016

While it is quite convenient from an API perspective, Spark is a very heavy solution for Inferring schema from individual CSV and JSON files, unless they are very large. A better solution to this would be to use NiFi to infer schema. The latest version of HDF includes the InferAvroSchema processor. This will take CSV or JSON files and attach an Avro Schema to the flow file as an attribute. You can then use this with Convert processors to get schema based data into a Database, or onto HDFS for example.

sball · ‎04-08-2016

To configure log levels, add --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" This assumes you have a file called log4j-spark.properties on the classpath (usually in resources for the project you're using to build the jar. This log4j can then control the verbosity of spark's logging. I usually use something derived from the spark default, with some customisation like: # Set everything to be logged to the console log4j.rootCategory=WARN, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Settings to quiet third party logs that are too verbose log4j.logger.org.spark-project.jetty=WARN log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR log4j.logger.org.apache.spark.sql=WARN # Logging for this application log4j.logger.com.myproject=INFO Something else to note here is that in yarn cluster mode, all your important logs (especially the executor logs) will be aggregated by the YARN ATS when the application finishes. You can get these with yarn logs -applicationId <application> This will show you all the log based on your config levels.

sball · ‎04-04-2016

Exactly as you would in a hive CTAS: %spark sc.parallelize(1 to 10).toDF.registerTempTable("pings_raw") %sql create table pings location '/path/to/pings' as select * from pings_raw

sball · ‎03-31-2016

The --jars option is comma delimited, not colon delimited. Hence the error: it says it can't find a single file with a very long name. Change your ':' to ',' and you should get further.

sball · ‎03-30-2016

GraphX works by loading an entire graph into a combination of VertexRDDs and EdgeRDDs, so the underlying database's capabilities are not really relevant to the graph computation, since GraphX won't touch it beyond initial load. On that basis you can really use any thing that will effectively store and scan a list of paired tuples, and a list of ids and other properties. From this perspective HBase or Accumulo would seem like a good bet to attach Spark to, but of course any file in HDFS would do. For the ability to modify a graph prior to analysing it in GraphX, it's more useful to pick a 'proper' graph database. For this it's worth looking at something like Accumulo Graph which provides a graph database hosted on Accumulo, or possibly another very exciting new project, Gaffer (https://github.com/GovernmentCommunicationsHeadquarters/Gaffer) also hosted on Accumulo. Tinkerpop (https://tinkerpop.apache.org/) provides some other options focussed around the common Gremlin APIs, which are generally well understood in graph world. Other options might include something like TitanDB hosted on HBase. These will provide you with an interface API for modifying graphs effectively.

sball · ‎03-29-2016

Ultimately in Spark if you want to write to different tables you will probably end up writing different DataFrames / RDDs from the filter. Alternatively, you could write to multiple Hbase tables based on a column within a mapPartitions call, but you may have issues with achieving nice batch sizes for each Hbase write. That way you can have a generic HBase function that is parameterised by column. If your use case is simply routing a stream of records from another source to paramterised locations in Hbase, NiFi may actually be a better fit, but if you are doing significant or complex batch processing before landing into HBase the Spark approach will help.

sball · ‎03-24-2016

RDDs do not really have fields per-se, unless for example your have an RDD of Row objects. You would usually filter on an index: rdd.filter(x => x(1) == "thing") (example in scala for clarity, same thing applies to Java) If you have an RDD of a typed object, the same thing applies, but you can use a getter for example in the lambda / filter function. If you want field references you would do better wih the dataframes API. Compare for example: sqlContext.createDataFrame(rdd, schema).where($"fieldname" == "thing") (see Spark Official Docs - DataFrames for more on the schema)

Online	Offline
Last Visited	‎10-19-2020 01:00 PM

Member Since	‎09-15-2015 10:07 PM
Last Visited	‎10-19-2020 01:00 PM
Posts	116
Kudos received	121

Cloudera Community

Re: metron pcap query

Re: metron pcap data stored in HDFS sequence forma...

Re: Can Apache Metron be installed using CDH or EM...

Re: Installation failed with ambari, Can I retry t...

Re: metron installation on existed ambari managed ...

Re: How to restrict valid users from submitting Sp...

Re: Is it possible to perform bulk insert to ES w...

Re: GetS3 Processor - region selection

Re: Can I use NiFi to launch Spark (or other YARN)...

Re: Spark job submit log messages on console

Re: How can I specify table output directory using...

Re: spark and s3 dependencies

Re: What Graph Database is best to use with Spark ...

Re: How do I filter stream by a column using Spark

Re: How to filter a Spark RDD based on particular ...