1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1842 | 04-03-2024 06:39 AM | |
| 2863 | 01-12-2024 08:19 AM | |
| 1582 | 12-07-2023 01:49 PM | |
| 2345 | 08-02-2023 07:30 AM | |
| 3233 | 03-29-2023 01:22 PM |
08-02-2016
12:08 AM
2 Kudos
Tips Read each processor
directions carefully, inputs and outputs vary greatly from attributes, JSON, JSONPath statements and other
entries. JSONPath supports a radically different syntax than NIFI Expression Language. Know your Expression
language, there are a lot of useful
functions like UUID() and nextint(). When you are doing a lot of streaming and testing data, you will find that logs grow huge on your Centos sandbox. I built up gigabytes of logs in
Hive, Hadoop and NiFi in just a few days of operations. Check your /var/log every few days and watch the Ambari screens for data usage. When your data gets too large, things can start failing or slowing down due to a lack of temporary space, swap space and space for logs. Delete Test Files
For Ever! hdfs dfs -rm -f -skipTrash /twitter/*.json Clean up junk files. cd /
du -hsx * | sort -rh | head -10 ) PutHiveQL requires
hiveql.args.1.value and hiveql.args.1.type for all fields. If you are using EvaluateJSONPath, you cannot
set your types there. Do those in an
UpdateAttribute processor after that.
Then you need to turn your code in SQL for Hive to execute.
... View more
Labels:
08-01-2016
03:45 AM
I don't have a column called IS_AUTOINCREMENT. that's the something should be standard in JDBC. wonder if HIVE driver missing something
... View more
08-01-2016
02:37 AM
How much memory do you have? How much is assigned to Spark? Do you have logging on so you can check logs and history UI? Turn off everything else you can. For debugging run through the Spark shell, Zeppelin adds over head and takes a decent amount of YARN resources and RAM. Run on Spark 1.6 / HDP 2.4.2 if you can. Allocate as much memory as possible. Spark is an all memory beast. sparkConf.set("spark.cores.max", "16") // all the cores you can sparkConf.set("spark.serializer", classOf[KryoSerializer].getName) sparkConf.set("spark.sql.tungsten.enabled", "true") sparkConf.set("spark.eventLog.enabled", "true") sparkConf.set("spark.app.id", "YourID") sparkConf.set("spark.io.compression.codec", "snappy") sparkConf.set("spark.rdd.compress", "true")
I like to maximize my resources and performance.
... View more
07-30-2016
08:12 PM
1 Kudo
@dpinkston Not optimal, but this is a nice workaround: Use ReplaceText processor insert into twitter
values
(${tweet_id},
'${handle:urlEncode()}','${hashtag:urlEncode()}',
'${msg:urlEncode()}','${time}',
'${user_name:urlEncode()}','${tweet_id}',
'${unixtime}','${uuid}') So that's attributes in there. I do url encode because of quotes and such. Would like a prepared statement or custom processor or call a groovy script. But this works.
... View more
07-30-2016
08:12 PM
That did not work.
... View more
07-29-2016
09:08 PM
It's very easy to integrate Apache Flume with Apache NiFi, either as a source or destination (sink). An example of using ExecuteFlumeSource and ExecuteFlumeSink in Apache NiFi. We use a Flume source to read Netcat and send to Slack. So you can send messages from Linux to a Slack chat. For the Flume sink to HDFS, we read data from a web feed. We can also connect our Flume Source to Flume Sink. Configuring a Flume Source If you have configured Apache Flume configuration files, this is the same thing.
Data Received in Slack (Join and See Messages Live) Checking Provenance Event Configure The ExecuteFlumeSink to Store to HDFS -rw-r--r-- 3 flume hdfs 28 2016-07-25 16:14 /flume/events/16-07-25/1610/00/events-.1469463242392 To monitor what's going on in the Flume side of the process: tail -f
/var/log/flume/flume-a1.log 25 Jul 2016 16:14:38,854 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.batch.empty == 302
25 Jul 2016 16:14:38,854 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.batch.underflow == 1
25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.closed.count == 1
25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.creation.count == 1
25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.connection.failed.count == 0
25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.event.drain.attempt == 1
25 Jul 2016 16:14:38,855 INFO [agent-shutdown-hook] (org.apache.flume.instrumentation.MonitoredCounterGroup.stop:177) - Shutdown Metric for type: SINK, name: k1. sink.event.drain.sucess == 1
25 Jul 2016 16:14:38,856 INFO [agent-shutdown-hook] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.stop:83) - Configuration provider stopping
25 Jul 2016 16:14:38,856 INFO [agent-shutdown-hook] (org.apache.flume.source.NetcatSource.stop:190) - Source stopping
25 Jul 2016 16:14:39,357 INFO [agent-shutdown-hook] (org.apache.hadoop.metrics2.sink.flume.FlumeTimelineMetricsSink.stop:74) - Stopping Flume Metrics Sink
Make sure you are
not running regular Flume Agent of same name! They will clash! Using the Netcat telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Slack, Apache Nifi,
Apache Flume, Netcat
OK^]
telnet> quit
Connection closed. Resources: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.flume.ExecuteFlumeSource/additionalDetails.html https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.flume.ExecuteFlumeSink/additionalDetails.html https://community.hortonworks.com/questions/14713/flume-tutorials.html
... View more
Labels:
07-29-2016
03:01 AM
Can I add a HDF/NiFi node to a cluster created in http://hortonworks.github.io/hdp-aws/manage/ using the Hortonworks Cloud?
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
07-28-2016
08:34 PM
5 Kudos
There are a lot of excellent talks from the summit. Deep Learning Apache Spark with Machine Learning like TensorFlow Distributed Deep Learning on Hadoop Clusters (Yahoo) Apache Spark Big Data Heterogeneous Mixture Learning on Spark
Integrating Apache Spark and NiFi for Data Lakes (ThinkBig) Operations Zero Downtime App Deployment Using Hadoop (Hortonworks) Debugging YARN Cluster in Production (Hortonworks) Yahoo's Experience Running Pig on Tez at Scale The DAP Where Yarn HBase Kafka and Spark go to Production (Cask) Extend Governance in Hadoop with Atlas Ecosystem (Hortonworks) Cost and Resource Tracking for Hadoop (Yahoo) Managing Hadoop HBase and Storm Clusters at Yahoo Scale Operating and Supporting Apache HBase Best Practices and Improvements (Hortonworks) Scheduling Policies in Yarn (Slides) Future of Data Arun Murthy, Hortonworks - Hadoop Summit 2016 San Jose - #HS16SJ - #theCUBE The Future of Hadoop : An Enterprise View (Slides) Streaming The Future of Storm (Hortonworks) Streaming ETL for All : Embeddable Data Transformation for Real Time Streams Real-time, Streaming Advanced Analytics, Approximations, and Recommendations using Apache Spark ML/GraphX, Kafka Stanford CoreNLP, and Twitter Algebird (Chris Fregly, IBM) Fighting Fraud in Real Time by Processing 1Mplus TPS Using Storm on Slider YARN (Rocketfuel) Lambda less Stream Processing at Scale in LinkedIn Make Streaming Analytics Work For You The Devil is in the Details (Hortonworks) Lego Like Building Blocks of Storm and Spark Streaming Pipelines for Rapid IOT and Streaming (StreamAnalytix) Performance Comparison of Streaming Big Data Platforms Machine Learning
Prescient Keeps Travelers Safe with Natural Language Processing and Geospatial Analytics
IoAT Internet Of Things What about Data Storage (Hortonworks) YAF (Yet Another Framework) Apache Beam A Unified Model for Batch and Streaming Data Processing (Google) Turning the Stream Processor into a Database Building Online Applications on Streams (Flink / DataArtisans) The Next Generation of Data Processing OSS (Google) Next Gen Big Data Analytics with Apache Apex SQL and Friends How We Re Engineered Phoenix with a Cost Based Optimizer Based on Calcite (Intel and Hortonworks) Hive Hbase Metastore Improving Hive with a Big Data Metadata Storage (Hortonworks) Phoenix plus HBase An Enterprise Grade Data Warehouse Appliance for Interactive Analytics (Hortonworks) Presto Whats New in SQL on Hadoop and Beyond (Facebook, Teradata) DataFlow Scalable Optical Character Recognition with Apache NiFi and Tesseract (Hortonworks) Building a Smarter Home with Nifi and Spark General
Its Time Launching Your Advanced Analytics Program for Success in a Mature Industry Like Oil and Gas (Conoco Phillips) Instilling Confidence and Trust Big Data Security Governance (Mastercard) Hadoop in the Cloud The What Why and How from the Experts (Microsoft) War on Stealth Cyberattacks that Target Unknown Vulnerabilities Hadoop and Cloud Storage Object Store Integration in Production (Hortonworks) There is a New Ranger in Town End to End Security and Auditing in a Big Data as a Service Deployment Building A Scalable Data Science Platform with R (Microsoft) A Data Lake and a Data Lab to Optimize Operations and Safety Within a Nuclear Fleet Reliable and Scalable Data Ingestion at Airbnb
... View more
Labels:
07-28-2016
02:57 PM
6 Kudos
Accessing public social data from Facebook for a company's page is easy. Find your Facebook page, say Hortonworks. Run http://findmyfbid.com/ and get the Page ID (289994161078999) for your Page. Create a Facebook Application and Add a New Application. Create a Facebook Access Token in Graph API Explorer using your Application. Create your Facebook Graph API URL. https://graph.facebook.com/v2.7/289994161078999/tagged?access_token=ACCESSTOKENFROMFACEBOOK&limit=100. Because we are using a Facebook App Token, we need to use HTTPS/SSL. To access an SSL site in GetHTTP, we need a SSL Service with a Trust Store. Facebook Graph API Explorer Facebook Graph API Explorer Test Add the URL to the GetHTTP Processor. Create a Standard SSL Context Service (Controller Service), for the Sandbox, use the Java SSL Trust Store. Add the SSL Context Service to the GetHTTP Processor. Save to HDFS (PutHDFS). Download [root@sandbox demo]#
hdfs dfs -cat /social/facebook1469644415053.json
{
"data": [
{
"message": "Speakers of
Crunch Big Data Conference 2016\nCASEY STELLA - Principal Architect of
Hortonworks \nTalk: Data Preparation for Data Science: A Field
Guide\n\n\"Any data scientist who works with real data will tell you that
the hardest part of any data science task is the data preparation. Everything
from cleaning dirty data to understanding where your data is missing and how
your data is shaped, the care and feeding of your data is a prime task for the
working data scientist.\n\nI will describe my experiences in the field and
present an open source utility written with Apache Spark to automate some of
the necessary but insufficient things that I do every time I'm presented new data.
In particular, we'll talk about discovering missing values, values with skewed
distributions and discovering likely errors within your data.\"\n\nSee you
at Crunch Big Data Conference in 2016!\n#bigdata #dataanalytics #crunchconf
#crunch",
"created_time":
"2016-07-22T11:28:03+0000",
"id":
"430175213820486_609858935852112"
},
{
"message": "Get up to
date on #Hadoop by checking out Hortonworks top 5 articles on the subject. Then
when you need something to monitor your Hadoop, check out Centerity (<a href="http://www.centerity.com/big-data-sap-hana/hadoop/">http://www.centerity.com/big-data-sap-hana/hadoop/</a>)",
"created_time":
"2016-07-12T16:27:00+0000",
"id":
"311930585656230_569713563211263"
},
{
"message": "Hortonworks
| Learn how #ApacheMetron detect #bigdata #cybersecurity threat in
real-time? SpringPeople is an Authorized
Training Partner of Hortonworks and provides hortonworks certified courses: <a href="http://bit.ly/29Ibe7G/n/n#hadoop">http://bit.ly/29Ibe7G\n\n#hadoop</a>
#DataScience",
"created_time":
"2016-07-11T07:26:43+0000",
"id":
"188518004538277_1136733933050008"
},
{
"message": "Learn how
to protect your #data lifecycle w/ Hortonworks Data Flow & WANdisco Fusion <a href="http://bit.ly/1WO07On">http://bit.ly/1WO07On</a>",
"created_time":
"2016-06-30T19:30:00+0000",
"id":
"114198121933673_1176359322384209"
},
{
"message": "Hortonworks
announces new MSP and ISV programmes #HadoopSummit <a href="http://bit.ly/29sXPiI">http://bit.ly/29sXPiI</a>",
"created_time":
"2016-06-30T13:02:28+0000",
"id":
"179830977794_10153982072807795"
},
{
"message": "Breakfast
meeting at the Hadoop Summit in San Jose with Vishal Dhanuka of Hortonworks.
It's going to be a great day discussing with conference attendees how we can
work together to harness the power of big data in healthcare. #HS16SJ",
"created_time":
"2016-06-29T15:09:43+0000",
"id":
"1442034199422403_1596597077299447"
},
{
"message": "#Data lakes
need control & safety against failure. That's where we come in <a href="http://bit.ly/1WO07On">http://bit.ly/1WO07On</a> Hortonworks",
"created_time":
"2016-06-29T15:00:01+0000",
"id":
"114198121933673_1176301025723372"
}, DataFlow is available for download from Github.
... View more
Labels:
07-28-2016
02:02 PM
Andread B You really want NiFi on a seperate server if possible. Sqoop is really fast as it designed for accessing RDBMS data. NiFi is a great solution for a continuous feed. See: https://community.hortonworks.com/questions/25228/can-i-use-nifi-to-replace-sqoop.html https://community.hortonworks.com/questions/36464/how-to-use-nifi-to-incrementally-ingest-data-from.html http://www.batchiq.com/database-injest-with-nifi.html http://funnifi.blogspot.com/2016/04/sql-in-nifi-with-executescript.html
... View more