About TimothySpann

TimothySpann · ‎08-02-2016

Tips Read each processor directions carefully, inputs and outputs vary greatly from attributes, JSON, JSONPath statements and other entries. JSONPath supports a radically different syntax than NIFI Expression Language. Know your Expression language, there are a lot of useful functions like UUID() and nextint(). When you are doing a lot of streaming and testing data, you will find that logs grow huge on your Centos sandbox. I built up gigabytes of logs in Hive, Hadoop and NiFi in just a few days of operations. Check your /var/log every few days and watch the Ambari screens for data usage. When your data gets too large, things can start failing or slowing down due to a lack of temporary space, swap space and space for logs. Delete Test Files For Ever! hdfs dfs -rm -f -skipTrash /twitter/*.json Clean up junk files. cd / du -hsx * | sort -rh | head -10 ) PutHiveQL requires hiveql.args.1.value and hiveql.args.1.type for all fields. If you are using EvaluateJSONPath, you cannot set your types there. Do those in an UpdateAttribute processor after that. Then you need to turn your code in SQL for Hive to execute.

TimothySpann · ‎08-01-2016

I don't have a column called IS_AUTOINCREMENT. that's the something should be standard in JDBC. wonder if HIVE driver missing something

TimothySpann · ‎08-01-2016

How much memory do you have? How much is assigned to Spark? Do you have logging on so you can check logs and history UI? Turn off everything else you can. For debugging run through the Spark shell, Zeppelin adds over head and takes a decent amount of YARN resources and RAM. Run on Spark 1.6 / HDP 2.4.2 if you can. Allocate as much memory as possible. Spark is an all memory beast. sparkConf.set("spark.cores.max", "16") // all the cores you can sparkConf.set("spark.serializer", classOf[KryoSerializer].getName) sparkConf.set("spark.sql.tungsten.enabled", "true") sparkConf.set("spark.eventLog.enabled", "true") sparkConf.set("spark.app.id", "YourID") sparkConf.set("spark.io.compression.codec", "snappy") sparkConf.set("spark.rdd.compress", "true") I like to maximize my resources and performance.

TimothySpann · ‎07-30-2016

@dpinkston Not optimal, but this is a nice workaround: Use ReplaceText processor insert into twitter values (${tweet_id}, '${handle:urlEncode()}','${hashtag:urlEncode()}', '${msg:urlEncode()}','${time}', '${user_name:urlEncode()}','${tweet_id}', '${unixtime}','${uuid}') So that's attributes in there. I do url encode because of quotes and such. Would like a prepared statement or custom processor or call a groovy script. But this works.

TimothySpann · ‎07-30-2016

That did not work.

TimothySpann · ‎07-29-2016

It's very easy to integrate Apache Flume with Apache NiFi, either as a source or destination (sink). An example of using ExecuteFlumeSource and ExecuteFlumeSink in Apache NiFi. We use a Flume source to read Netcat and send to Slack. So you can send messages from Linux to a Slack chat. For the Flume sink to HDFS, we read data from a web feed. We can also connect our Flume Source to Flume Sink. Configuring a Flume Source If you have configured Apache Flume configuration files, this is the same thing. Data Received in Slack (Join and See Messages Live) Checking Provenance Event Configure The ExecuteFlumeSink to Store to HDFS -rw-r--r-- 3 flume hdfs 28 2016-07-25 16:14 /flume/events/16-07-25/1610/00/events-.1469463242392 To monitor what's going on in the Flume side of the process: tail -f /var/log/flume/flume-a1.log 25 Jul 2016 16:14:38,854 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.batch.empty == 302 25 Jul 2016 16:14:38,854 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.batch.underflow == 1 25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.closed.count == 1 25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.creation.count == 1 25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.failed.count == 0 25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.event.drain.attempt == 1 25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.event.drain.sucess == 1 25 Jul 2016 16:14:38,856 INFO [agent-shutdown-hook] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop:83) - Configuration provider stopping 25 Jul 2016 16:14:38,856 INFO [agent-shutdown-hook] (org.apache.flume.source.NetcatSource.stop:190) - Source stopping 25 Jul 2016 16:14:39,357 INFO [agent-shutdown-hook] (org.apache.hadoop.metrics2.sink.flume.FlumeTimelineMetricsSink.stop:74) - Stopping Flume Metrics Sink Make sure you are not running regular Flume Agent of same name! They will clash! Using the Netcat telnet localhost 44444 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Slack, Apache Nifi, Apache Flume, Netcat OK^] telnet> quit Connection closed. Resources: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.flume.ExecuteFlumeSource/additionalDetails.html https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.flume.ExecuteFlumeSink/additionalDetails.html https://community.hortonworks.com/questions/14713/flume-tutorials.html

TimothySpann · ‎07-29-2016

Can I add a HDF/NiFi node to a cluster created in http://hortonworks.github.io/hdp-aws/manage/ using the Hortonworks Cloud?

TimothySpann · ‎07-28-2016

There are a lot of excellent talks from the summit. Deep Learning Apache Spark with Machine Learning like TensorFlow Distributed Deep Learning on Hadoop Clusters (Yahoo) Apache Spark Big Data Heterogeneous Mixture Learning on Spark Integrating Apache Spark and NiFi for Data Lakes (ThinkBig) Operations Zero Downtime App Deployment Using Hadoop (Hortonworks) Debugging YARN Cluster in Production (Hortonworks) Yahoo's Experience Running Pig on Tez at Scale The DAP Where Yarn HBase Kafka and Spark go to Production (Cask) Extend Governance in Hadoop with Atlas Ecosystem (Hortonworks) Cost and Resource Tracking for Hadoop (Yahoo) Managing Hadoop HBase and Storm Clusters at Yahoo Scale Operating and Supporting Apache HBase Best Practices and Improvements (Hortonworks) Scheduling Policies in Yarn (Slides) Future of Data Arun Murthy, Hortonworks - Hadoop Summit 2016 San Jose - #HS16SJ - #theCUBE The Future of Hadoop : An Enterprise View (Slides) Streaming The Future of Storm (Hortonworks) Streaming ETL for All : Embeddable Data Transformation for Real Time Streams Real-time, Streaming Advanced Analytics, Approximations, and Recommendations using Apache Spark ML/GraphX, Kafka Stanford CoreNLP, and Twitter Algebird (Chris Fregly, IBM) Fighting Fraud in Real Time by Processing 1Mplus TPS Using Storm on Slider YARN (Rocketfuel) Lambda less Stream Processing at Scale in LinkedIn Make Streaming Analytics Work For You The Devil is in the Details (Hortonworks) Lego Like Building Blocks of Storm and Spark Streaming Pipelines for Rapid IOT and Streaming (StreamAnalytix) Performance Comparison of Streaming Big Data Platforms Machine Learning Prescient Keeps Travelers Safe with Natural Language Processing and Geospatial Analytics IoAT Internet Of Things What about Data Storage (Hortonworks) YAF (Yet Another Framework) Apache Beam A Unified Model for Batch and Streaming Data Processing (Google) Turning the Stream Processor into a Database Building Online Applications on Streams (Flink / DataArtisans) The Next Generation of Data Processing OSS (Google) Next Gen Big Data Analytics with Apache Apex SQL and Friends How We Re Engineered Phoenix with a Cost Based Optimizer Based on Calcite (Intel and Hortonworks) Hive Hbase Metastore Improving Hive with a Big Data Metadata Storage (Hortonworks) Phoenix plus HBase An Enterprise Grade Data Warehouse Appliance for Interactive Analytics (Hortonworks) Presto Whats New in SQL on Hadoop and Beyond (Facebook, Teradata) DataFlow Scalable Optical Character Recognition with Apache NiFi and Tesseract (Hortonworks) Building a Smarter Home with Nifi and Spark General Its Time Launching Your Advanced Analytics Program for Success in a Mature Industry Like Oil and Gas (Conoco Phillips) Instilling Confidence and Trust Big Data Security Governance (Mastercard) Hadoop in the Cloud The What Why and How from the Experts (Microsoft) War on Stealth Cyberattacks that Target Unknown Vulnerabilities Hadoop and Cloud Storage Object Store Integration in Production (Hortonworks) There is a New Ranger in Town End to End Security and Auditing in a Big Data as a Service Deployment Building A Scalable Data Science Platform with R (Microsoft) A Data Lake and a Data Lab to Optimize Operations and Safety Within a Nuclear Fleet Reliable and Scalable Data Ingestion at Airbnb

TimothySpann · ‎07-28-2016

Accessing public social data from Facebook for a company's page is easy. Find your Facebook page, say Hortonworks. Run http://findmyfbid.com/ and get the Page ID (289994161078999) for your Page. Create a Facebook Application and Add a New Application. Create a Facebook Access Token in Graph API Explorer using your Application. Create your Facebook Graph API URL. https://graph.facebook.com/v2.7/289994161078999/tagged?access_token=ACCESSTOKENFROMFACEBOOK&limit=100. Because we are using a Facebook App Token, we need to use HTTPS/SSL. To access an SSL site in GetHTTP, we need a SSL Service with a Trust Store. Facebook Graph API Explorer Facebook Graph API Explorer Test Add the URL to the GetHTTP Processor. Create a Standard SSL Context Service (Controller Service), for the Sandbox, use the Java SSL Trust Store. Add the SSL Context Service to the GetHTTP Processor. Save to HDFS (PutHDFS). Download [root@sandbox demo]# hdfs dfs -cat /social/facebook1469644415053.json { "data": [ { "message": "Speakers of Crunch Big Data Conference 2016\nCASEY STELLA - Principal Architect of Hortonworks \nTalk: Data Preparation for Data Science: A Field Guide\n\n\"Any data scientist who works with real data will tell you that the hardest part of any data science task is the data preparation. Everything from cleaning dirty data to understanding where your data is missing and how your data is shaped, the care and feeding of your data is a prime task for the working data scientist.\n\nI will describe my experiences in the field and present an open source utility written with Apache Spark to automate some of the necessary but insufficient things that I do every time I'm presented new data. In particular, we'll talk about discovering missing values, values with skewed distributions and discovering likely errors within your data.\"\n\nSee you at Crunch Big Data Conference in 2016!\n#bigdata #dataanalytics #crunchconf #crunch", "created_time": "2016-07-22T11:28:03+0000", "id": "430175213820486_609858935852112" }, { "message": "Get up to date on #Hadoop by checking out Hortonworks top 5 articles on the subject. Then when you need something to monitor your Hadoop, check out Centerity (<a href="http://www.centerity.com/big-data-sap-hana/hadoop/">http://www.centerity.com/big-data-sap-hana/hadoop/</a>)", "created_time": "2016-07-12T16:27:00+0000", "id": "311930585656230_569713563211263" }, { "message": "Hortonworks | Learn how #ApacheMetron detect #bigdata #cybersecurity threat in real-time? SpringPeople is an Authorized Training Partner of Hortonworks and provides hortonworks certified courses: <a href="http://bit.ly/29Ibe7G/n/n#hadoop">http://bit.ly/29Ibe7G\n\n#hadoop</a> #DataScience", "created_time": "2016-07-11T07:26:43+0000", "id": "188518004538277_1136733933050008" }, { "message": "Learn how to protect your #data lifecycle w/ Hortonworks Data Flow & WANdisco Fusion <a href="http://bit.ly/1WO07On">http://bit.ly/1WO07On</a>", "created_time": "2016-06-30T19:30:00+0000", "id": "114198121933673_1176359322384209" }, { "message": "Hortonworks announces new MSP and ISV programmes #HadoopSummit <a href="http://bit.ly/29sXPiI">http://bit.ly/29sXPiI</a>", "created_time": "2016-06-30T13:02:28+0000", "id": "179830977794_10153982072807795" }, { "message": "Breakfast meeting at the Hadoop Summit in San Jose with Vishal Dhanuka of Hortonworks. It's going to be a great day discussing with conference attendees how we can work together to harness the power of big data in healthcare. #HS16SJ", "created_time": "2016-06-29T15:09:43+0000", "id": "1442034199422403_1596597077299447" }, { "message": "#Data lakes need control & safety against failure. That's where we come in <a href="http://bit.ly/1WO07On">http://bit.ly/1WO07On</a> Hortonworks", "created_time": "2016-06-29T15:00:01+0000", "id": "114198121933673_1176301025723372" }, DataFlow is available for download from Github.

TimothySpann · ‎07-28-2016

Andread B You really want NiFi on a seperate server if possible. Sqoop is really fast as it designed for accessing RDBMS data. NiFi is a great solution for a continuous feed. See: https://community.hortonworks.com/questions/25228/can-i-use-nifi-to-replace-sqoop.html https://community.hortonworks.com/questions/36464/how-to-use-nifi-to-incrementally-ingest-data-from.html http://www.batchiq.com/database-injest-with-nifi.html http://funnifi.blogspot.com/2016/04/sql-in-nifi-with-executescript.html

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Connected Platform Development and Maintenance Tip...

Re: ConvertJSONtoSQL in Apache NiFi for Sending to...

Re: SparkException caused by GC overhead limit exc...

Re: ConvertJSONtoSQL in Apache NiFi for Sending to...

Re: ConvertJSONtoSQL in Apache NiFi for Sending to...

Using Apache Flume Sources and Sinks with Apache N...

HDP-AWS: Adding an HDF Node to the Cluster

Talks from Hadoop Summit San Jose 2016

Accessing Facebook Page Data from Apache NiFi 1.2

Re: Using NiFi to quey RDBMS