About TimothySpann

TimothySpann · ‎05-25-2016

Twitter has opened source another real-time, distributed, fault-tolerant stream processing engine called Heron. They see as the successor for Storm. It is backwards compatible with Storm's topology API. First I followed the getting started guide. Downloading and installing on MacOsx. Downloads ./heron-client-install-0.14.0-darwin.sh --user Heron client installer ---------------------- Uncompressing...... Heron is now installed! Make sure you have "/usr/local/bin" in your path. See http://heronstreaming.io/docs/getting-started.html for how to use Heron. heron.build.version : 0.14.0 heron.build.time : Tue May 24 22:44:01 PDT 2016 heron.build.timestamp : 1464155053000 heron.build.host : tw-mbp-kramasamy heron.build.user : kramasamy heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35 heron.build.git.status : Clean ➜ Downloads export PATH=$PATH::/usr/local/bin ➜ Downloads ./heron-tools-install-0.14.0-darwin.sh --user Heron tools installer --------------------- Uncompressing...... Heron Tools is now installed! Make sure you have "/usr/local/bin" in your path. See http://heronstreaming.io/docs/getting-started.html for how to use Heron. heron.build.version : 0.14.0 heron.build.time : Tue May 24 22:44:01 PDT 2016 heron.build.timestamp : 1464155053000 heron.build.host : tw-mbp-kramasamy heron.build.user : kramasamy heron.build.git.revision : be87b09f348e0ed05f45503340a2245a4ef68a35 heron.build.git.status : Clean http://twitter.github.io/heron/docs/getting-started/ Run the example to make sure everything is installed heron submit local ~/.heron/examples/heron-examples.jar com.twitter.heron.examples.ExclamationTopology ExclamationTopology [2016-05-25 16:16:32 -0400] com.twitter.heron.scheduler.local.LocalLauncher INFO: For checking the status and logs of the topology, use the working directory /Users/tspann/.herondata/topologies/local/tspann/ExclamationTopology INFO: Topology 'ExclamationTopology' launched successfully INFO: Elapsed time: 4.722s. heron activate local ExclamationTopology [2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed INFO: Successfully activated topology 'ExclamationTopology' INFO: Elapsed time: 2.739s. heron activate local ExclamationTopology [2016-05-25 16:19:38 -0400] com.twitter.heron.spi.utils.TMasterUtils SEVERE: Topology is already activateed INFO: Successfully activated topology 'ExclamationTopology' INFO: Elapsed time: 2.739s. Run the UI sudo heron-ui 25 May 2016 16:20:31-INFO:main.py:101: Listening at http://192.168.1.5:8889 25 May 2016 16:20:31-INFO:main.py:102: Using tracker url: http://localhost:8888 To not step on HDP ports, I change the port sudo heron-tracker --port 8881 25 May 2016 16:24:14-INFO:main.py:183: Running on port: 8881 25 May 2016 16:24:14-INFO:main.py:184: Using config file: /usr/local/herontools/conf/heron_tracker.yaml Look at the heron website: http://localhost:8881/topologies {"status": "success", "executiontime": 4.291534423828125e-05, "message": "", "version": "1.0.0", "result": {}} Let's run the UI: sudo heron-ui --port 8882 --tracker_url http://localhost:8881 25 May 2016 16:28:53-INFO:main.py:101: Listening at http://192.168.1.5:8882 25 May 2016 16:28:53-INFO:main.py:102: Using tracker url: http://localhost:8881 Look at the Heron Cluster http://localhost:8881/clusters {"status": "success", "executiontime": 1.9073486328125e-05, "message": "", "version": "1.0.0", "result": ["localzk", "local"]} Using Heron CLI heron usage: heron <command> <options> ... Available commands: activate Activate a topology deactivate Deactivate a topology help Prints help for commands kill Kill a topology restart Restart a topology submit Submit a topology version Print version of heron-cli Getting more help: heron help <command> Prints help and options for <command> For detailed documentation, go to http://heronstreaming.io If you need to restart a topology: heron restart local ExclamationTopology INFO: Successfully restarted topology 'ExclamationTopology' INFO: Elapsed time: 3.928s. Look at my topology http://localhost:8881/topologies#/all/all/ExclamationTopology { "status": "success", "executiontime": 7.104873657226562e-05, "message": "", "version": "1.0.0", "result": {"local": {"default": ["ExclamationTopology"]}} } Adding --verbose will add a ton of debug logs. Attached are some screen shots. The Heron UI is decent. I am hoping Heron screens will be integrated into Ambari.

TimothySpann · ‎05-25-2016

was there anything on the spark history server or in logs.

TimothySpann · ‎05-24-2016

In the installed Zeppelin setup on HDP 2.4 it's available. Just run your queries %sql select * from hivetable %hive select * from hivetable You should be able to connect the Hive interpreter standard way.

TimothySpann · ‎05-23-2016

With PXF, you can do anything. Not sure PIG + HAWQ is the best combo though

TimothySpann · ‎05-22-2016

Create a Hive Table as ORC File through Spark SQL in Zeppelin. %sql create table default.logs_orc_table (clientIp STRING, clientIdentity STRING, user STRING, dateTime STRING, request STRING, statusCode INT, bytesSent FLOAT, referer STRING, userAgent STRING) stored as orc Load data from a DataFrame into this table: %sql insert into table default.logs_orc_table select t.* from accessLogsDF t I can create a table in the Hive View from Ambari. CREATE TABLE IF NOT EXISTS survey ( firstName STRING, lastName STRING, gender STRING, phone STRING, email STRING, address STRING, city STRING, postalcode STRING, surveyanswer STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE; Then really easy to load some data from a CSV file. LOAD DATA INPATH '/demo/survey1.csv' OVERWRITE INTO TABLE survey; I can create an ORC based table in Hive from Hive View in Ambari, Spark / Spark SQL or Hive areas in Zeppelin: create table survey_orc( firstName varchar(255), lastName varchar(255), gender varchar(255), phone varchar(255), email varchar(255), address varchar(255), city varchar(255), postalcode varchar(255), surveyanswer varchar(255) ) stored as orc tblproperties ("orc.compress"="NONE"); I can do the same insert into from Hive. %hive insert into table default.survey_orc select t.* from survey t I can query Hive tables from Spark SQL or Hive easily.

TimothySpann · ‎05-20-2016

I will give this a try and I'll post the results. For Windows and DBVisualizer, there's an article with step by step details. DBVisualizer Windows For Tableau: http://kb.tableau.com/articles/knowledgebase/connecting-to-hive-server-2-in-secure-mode For Squirrel SQL: https://community.hortonworks.com/questions/17381/hive-with-dbvisualiser-or-squirrel-sql-client.html

TimothySpann · ‎05-19-2016

I have used Apache Ignite with Spark 1.6 on a standalone cluster and it did provide acceleration for file reading

TimothySpann · ‎05-19-2016

Example Log 94.158.95.124 - - [24/Feb/2016:00:11:58 -0500] "GET / HTTP/1.1" 200 91966 "http://creativelabs.biz" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0" SBT name := "Logs" version := "1.0" scalaVersion := "2.10.6" jarName in assembly := "Logs.jar" libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.1" % "provided" libraryDependencies += "org.apache.spark" % "spark-sql_2.10" % "1.6.1" % "provided" libraryDependencies += "com.databricks" %% "spark-avro" % "2.0.1" Scala Program Pieces import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.serializer.{KryoSerializer} import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.serializer.{KryoSerializer, KryoRegistrator} import org.apache.log4j.Logger import org.apache.log4j.Level import org.apache.spark.sql.SQLContext import com.databricks.spark.avro._ case class LogRecord( clientIp: String, clientIdentity: String, user: String, dateTime: String, request:String,statusCode:Int, bytesSent:Long, referer:String, userAgent:String ) object Logs { val PATTERN = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+) (\S+)" (\d{3}) (\S+) "(\S+)" "([^"]*)"""".r def parseLogLine(log: String): LogRecord = { try { val res = PATTERN.findFirstMatchIn(log) if (res.isEmpty) { println("Rejected Log Line: " + log) LogRecord("Empty", "-", "-", "", "", -1, -1, "-", "-" ) } else { val m = res.get // NOTE: HEAD does not have a content size. if (m.group(9).equals("-")) { LogRecord(m.group(1), m.group(2), m.group(3), m.group(4), m.group(5), m.group(8).toInt, 0, m.group(10), m.group(11)) } else { LogRecord(m.group(1), m.group(2), m.group(3), m.group(4), m.group(5), m.group(8).toInt, m.group(9).toLong, m.group(10), m.group(11)) } } } catch { case e: Exception => println("Exception on line:" + log + ":" + e.getMessage); LogRecord("Empty", "-", "-", "", "-", -1, -1, "-", "-" ) } } //// Main Spark Program def main(args: Array[String]) { Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR) Logger.getLogger("com.hortonworks.spark.Logs").setLevel(Level.INFO) val log = Logger.getLogger("com.hortonworks.spark.Logs") log.info("Started Logs Analysis") val sparkConf = new SparkConf().setAppName("Logs") sparkConf.set("spark.cores.max", "16") sparkConf.set("spark.serializer", classOf[KryoSerializer].getName) sparkConf.set("spark.sql.tungsten.enabled", "true") sparkConf.set("spark.eventLog.enabled", "true") sparkConf.set("spark.app.id", "Logs") sparkConf.set("spark.io.compression.codec", "snappy") sparkConf.set("spark.rdd.compress", "false") sparkConf.set("spark.suffle.compress", "true") val sc = new SparkContext(sparkConf) val logFile = sc.textFile("data/access3.log") val accessLogs = logFile.map(parseLogLine).filter(!_.clientIp.equals("Empty")) log.info("# of Partitions %s".format(accessLogs.partitions.size)) try { println("===== Log Count: %s".format(accessLogs.count())) accessLogs.take(5).foreach(println) try { val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df1 = accessLogs.toDF() df1.registerTempTable("accessLogsDF") df1.printSchema() df1.describe("bytesSent").show() df1.first() df1.head() df1.explain() df1.write.format("avro").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("statusCode").avro("avroresults") } catch { case e: Exception => log.error("Writing files after job. Exception:" + e.getMessage); e.printStackTrace(); } // Calculate statistics based on the content size. val contentSizes = accessLogs.map(log => log.bytesSent) val contentTotal = contentSizes.reduce(_ + _) println("===== Number of Log Records: %s Content Size Total: %s, Avg: %s, Min: %s, Max: %s".format( contentSizes.count, contentTotal, contentTotal / contentSizes.count, contentSizes.min, contentSizes.max)) sc.stop() } } First, I set up a Scala Case Class that will hold the parsed record (clientIP, clientID, User, DateTime, Request, StatusCode, BytesSent, Referer, UserAgent). Next I have a regex and method to parse the logs (one line at a time) into case classes. I filter out the empty records. I use Spark SQL to examine the data. Then I write out the data to AVRO and finally do some counts.

TimothySpann · ‎05-19-2016

I would like to access SnappyData in-memory data from NIFI (Get and Put). Has anyone tried this? Connectors to in-memory stores like Apache Geode, Apache Ignite, SnappyData, Redis, GemfireXD, Memcache.

TimothySpann · ‎05-19-2016

Otherwise we can just send REST calls out, but it would be interesting to have NIFI or an agent on a phone. Anyone try this? I would like to send phone sensor data to NIFI.

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Investigating Twitter Heron

Re: Spark strange behavior: Im executing a query a...

Re: Apache Zeppelin with Hive

Re: Can pig load data into HAWQ?

Data Ingest with Apache Zeppelin + Apache Spark 1....

Re: Using GUI SQL Tools Against Hive on HDP from M...

Re: Has anyone tried to use Apache Ignite on Yarn ...

Parsing Apache Log Files with Spark

Has anyone used SnappyData with HDF?

Has anyone gotten Apache NiFi running on an Androi...