Member since
09-14-2015
47
Posts
89
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1108 | 09-22-2017 12:32 PM | |
7730 | 03-21-2017 11:48 AM | |
527 | 11-16-2016 12:08 PM | |
781 | 09-15-2016 09:22 AM | |
1926 | 09-13-2016 07:37 AM |
10-26-2017
11:40 PM
4 Kudos
Description Learn how to consume real-time data from the Satori RTM platform using NiFi. Background Satori is a cloud-based live platform that provides a publish-subscribe messaging service called RTM, and also makes available a set of free real-time data feeds as part of their Open Data Channels initiative: https://www.satori.com/docs/using-satori/overview https://www.satori.com/opendata/channels This article steps through how to consume from Satori's Open Data Channels in NiFi, using a custom NiFi processor. Note - the article assumes you already have a working version of NiFi up and running. Link to code on github: https://github.com/laurencedaluz/nifi-satori-bundle Installing the custom processor To create the required nar file, simply clone and build the following repo with maven: git clone https://github.com/laurencedaluz/nifi-satori-bundle.git
cd nifi-satori-bundle
mvn clean install
This will make the following .nar file under the nifi-satori-bundle-nar/target/ directory: nifi-satori-bundle-nar-<version>.nar Copy this file into the lib directory of your NiFi instance. If using HDF, it exists at: /usr/hdf/current/nifi/lib Restart NiFi for the nar to be loaded. Using the ConsumeSatoriRtm processor The ConsumeSatoriRtm accepts the following configurations: At a minimum, you will just need to provide the following configurations (which you can get directly from the satori open channels site):
Endpoint Appkey Channel In this example, I've chosen to consume from the 'big-rss' feed using the configurations provided here: https://www.satori.com/opendata/channels/big-rss That's it! after starting the ConsumeSatoriRtm process you will see data flowing: Additional Features
The processor also supports using Satori's Streamview filters, which allow you to provide SQL-like queries to select, transform, or aggregate messages from a subscribed channel:
In the 'big-rss' example above, the following filter configuration would limit the stream to messages containing the word "jobs". select * from `big-rss` where feedURL like '%jobs%'
The NiFi processor also supports batching of multiple messages into a single FlowFile, which will provide a new-line delimited list of messages in each file (based on a 'minimum batch size' configuration):
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- How-ToTutorial
- Kafka
- NiFi
- satori
- streaming
- subscribe
Labels:
09-22-2017
12:32 PM
2 Kudos
@Biswajit Chakraborty A simple way to do this is to use a ReplaceText processor after the TailFile. ReplaceText gives you the option to configure which Replacement Strategy to use. Using 'Prepend' will insert the replacement value at the start of each file or line (depending on what you've configured 'Evaluation Mode' to be): So a ReplaceText with the following configs will give you what you need: Replacement Value: <set your hostname or filename> Replacement Strategy: Prepend Evaluation Mode: Line-by-Line
... View more
03-21-2017
11:48 AM
4 Kudos
@jpetro416 Another possible solution until the Wait and Notify processors are available. You can use MergeContent configured to only merge when it hits 2 files (which may work for you because you don't need the final flow files and you just want a trigger in this case): This first flow trigger will be held in queue before the MergeContent, and will only be merged once the second flow trigger has arrived. The merged flowfile that comes out of the MergeContent will be your final trigger. Note, I used the following configurations to get this to work:
Back pressure on both 'success' flows going into MergeContent (set to only allow 1 file in queue at a time before the merge) MergeContent configured with:
Minimum Number of Entries = 2 Maximum Number of Entries = 2 Correlation Attribute Name = 'my-attribute' (I added this attribute to each UpdateAttribute processor - this would only be required if you wanted to have multiple triggers going into the MergeContent and you wanted to control the final trigger based on a correlation attribute) I've attached the xml flow template for your reference. mergecontent-trigger-wait.xml
... View more
03-21-2017
11:02 AM
1 Kudo
@Shikhar Agarwal Have you tried reading the Hive table using SparkSession instead? The following code works for me using Spark 2 on HDP 2.5.3: /usr/hdp/current/spark2-client/bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val hiveSession = SparkSession
.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.getOrCreate()
val hiveResult = sql("SELECT * FROM timesheet LIMIT 10")
hiveResult.show()
Details for this can be found in the Apache Spark docs: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
... View more
03-08-2017
12:31 PM
2 Kudos
@Michał Kabocik You can assign access control filters to the shiro urls to restrict on certain roles, in your case it would look something like this: [urls]
# authentication method and access control filters
/api/version = anon
/** = authc, roles[role1]
You can even take it a step further limit access to a subset of configuration pages within Zeppelin (i.e, lock down the 'Interpreter' or 'Configurations' page to be used by admins only): [urls]
# authentication method and access control filters
/api/version = anon
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
#/** = anon
/** = authc See the following documentation for an example Zeppelin shiro config that does this: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_zeppelin-component-guide/content/zepp-shiro-ref.html
... View more
12-22-2016
03:09 AM
1 Kudo
@sprakash That is correct - NiFi LDAP authentication only works over HTTPS. If you want to secure user access via LDAP in NiFi, you need to configure NiFi in HTTPS mode and remove HTTP access. Check out the following article for details of configuring LDAP auth via NiFi: https://community.hortonworks.com/articles/7341/nifi-user-authentication-with-ldap.html
... View more
12-15-2016
03:11 PM
3 Kudos
Auto-recovery in Ambari is a useful way of getting cluster components restarted automatically in the event that a component fails (without the need for human intervention). Ambari 2.4.0 introduced dynamic auto-recovery, which allows auto-start properties to be configured without needing an ambari-agent / ambari-server restart. Currently, the simplest way to manage the auto-recovery features within Ambari is via the REST API (documented within this article), although on-going work in the community will bring the feature to the UI: https://issues.apache.org/jira/browse/AMBARI-2330 Check Auto-Recovery Settings To check if auto-recovery is enabled for all components, run the following command on the Ambari server node: curl -u admin:<password> -i -H 'X-Requested-By: ambari' -X GET http://localhost:8080/api/v1/clusters/<cluster_name>/components?fields=ServiceComponentInfo/component_name,ServiceComponentInfo/service_name,ServiceComponentInfo/category,ServiceComponentInfo/recovery_enabled Note, you will need to replace with your own <password> and <cluster_name>. The output of the above command will look something like this: ...
"items" : [
{
"href" : "http://localhost:8080/api/v1/clusters/horton/components/APP_TIMELINE_SERVER",
"ServiceComponentInfo" : {
"category" : "MASTER",
"cluster_name" : "horton",
"component_name" : "APP_TIMELINE_SERVER",
"recovery_enabled" : "false",
"service_name" : "YARN"
}
},
{
"href" : "http://localhost:8080/api/v1/clusters/horton/components/DATANODE",
"ServiceComponentInfo" : {
"category" : "SLAVE",
"cluster_name" : "horton",
"component_name" : "DATANODE",
"recovery_enabled" : "false",
"service_name" : "HDFS"
}
},
...
Notice the "recovery_enabled" : "false" flag on each component. Enable Auto-Recovery for HDP Components To enable auto-recovery for a single component (in this case HBASE_REGIONSERVER): curl -u admin:<password> -H "X-Requested-By: ambari" -X PUT 'http://localhost:8080/api/v1/clusters/<cluster_name>/components?ServiceComponentInfo/component_name.in(HBASE_REGIONSERVER)' -d '{"ServiceComponentInfo" : {"recovery_enabled":"true"}}' To enable auto-recovery for multiple HDP components: curl -u admin:<password> -H "X-Requested-By: ambari" -X PUT 'http://localhost:8080/api/v1/clusters/<cluster_name>/components?ServiceComponentInfo/component_name.in(APP_TIMELINE_SERVER,DATANODE,HBASE_MASTER,HBASE_REGIONSERVER,HISTORYSERVER,HIVE_METASTORE,HIVE_SERVER,INFRA_SOLR,LIVY_SERVER,LOGSEARCH_LOGFEEDER,LOGSEARCH_SERVER,METRICS_COLLECTOR,METRICS_GRAFANA,METRICS_MONITOR,MYSQL_SERVER,NAMENODE,NODEMANAGER,RESOURCEMANAGER,SECONDARY_NAMENODE,WEBHCAT_SERVER,ZOOKEEPER_SERVER)' -d '{"ServiceComponentInfo" : {"recovery_enabled":"true"}}' Enable Auto-Recovery for HDF Components The process is the same for an Ambari managed HDF cluster, here is an example of enabling auto-recovery for the HDF services: curl -u admin:<password> -H "X-Requested-By: ambari" -X PUT 'http://localhost:8080/api/v1/clusters/<cluster_name>/components?ServiceComponentInfo/component_name.in(NIFI_MASTER,ZOOKEEPER_SERVER,KAFKA_BROKER,INFRA_SOLR,LOGSEARCH_LOGFEEDER,LOGSEARCH_SERVER,METRICS_COLLECTOR,METRICS_GRAFANA,METRICS_MONITOR)' -d '{"ServiceComponentInfo" : {"recovery_enabled":"true"}}' Using an older version of Ambari? If you're using an older version of Ambari (older than 2.4.0), check out the following ambari doc for details on how to enable auto-recovery via the ambari.properties file: https://cwiki.apache.org/confluence/display/AMBARI/Recovery%3A+auto+start+components
... View more
- Find more articles tagged with:
- Ambari
- automation
- Cloud & Operations
- FAQ
- hdp2.5
- recovery
11-16-2016
12:08 PM
1 Kudo
@Avijeet Dash I'd suggest using NiFi for this. You can read from the weather api using NiFi's GetHTTP processor, and use NiFi to process the data before loading it into another system (not sure what other components you're using but NiFi integrates with most systems, so you can use it to directly load the data into HDFS, HBase, Kafka, Cassandra, relational DBs etc..). Check out the "Fun_with_Hbase" template on the NiFi website to help you get started, it gets random data from an API call before loading into HBase. https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates Also check out this article which uses NiFi to pull data from the Google Finance API: https://community.hortonworks.com/content/kbentry/8422/visualize-near-real-time-stock-price-changes-using.html
... View more
10-07-2016
02:22 PM
1 Kudo
@jbarnett To answer your question - no, if you have SSSD configured you do not need to also configure core-site mapping with LDAP. Regarding your issue, it could be related to the space in your group name - could you try remove the space in the 'domain users' or test with a group that doesn't contain any spaces.
... View more
09-15-2016
09:22 AM
6 Kudos
@Sunil Mukati Your question is very broad, so if you need more info please be specific. Based on the question tags I'm going to assume you're asking about NiFi integration with Spark, Kafka, Storm, and Solr. The short answer is yes - we can integrate NiFi with other Apache Software 🙂 NiFi provides an easy way to stream data between different systems, and has in-built processors for dealing with most of the common Apache stack. Kafka NiFi has in-built processors to stream into and read data from Kafka: PutKafka GetKafka PublishKafka ConsumeKafka I'd suggest checking out the following tutorial to get you started: http://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/ Solr NiFi has a PutSolrContentStream processor that allows you to stream data directly into a Solr index. Check out the following tutorial that uses NiFi to index twitter data directly into Solr and visualises it using banana: https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html Spark Spark doesn't supply a mechanism to have data pushed to it, and instead pulls from other sources. You can integrate NiFi directly with Spark Streaming by exposing an Output Port in NiFi that Spark can consume from. The following article explains how to set up this integration: https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html Note that typical streaming architecture involves NiFi pushing data to Kafka, and Spark Streaming (or Storm) reading from Kafka. Storm As above, typically NiFi is integrated with Storm with Kafka acting as the message buffer. The tutorial realtime event processing tutorial linked above covers the details of building a streaming application in NiFi Kafka & Storm.
... View more
09-13-2016
12:46 PM
@Vijay Kumar J NiFi is definitely feasible for production use, and it is perfectly suited for your MongoDB to HBase data movement use case. NiFi is a tool used for managing dataflow and integration between systems in an automated and configurable way. It allows you to stream, transform, and sort data and uses a drag-and-drop UI. Dealing with failures - NiFi is configurable - when you build your flow within NiFi you can determine how you want to handle failures. In your case, you could build a flow in NiFi that retries on failure, and sends out an email on failure (this is an example, how you want to handle failures for fetch and storing data can be configured however you need) Execute NiFi on an Hourly Basis - NiFi isn't like traditional data movement schedulers, and flows built using NiFi are treated as 'always-on' where data can be constantly streamed as it is received. That being said, NiFi provides the ability to schedule each processor if needed, so in your case you could have your GetMongo processor set to run once every hour, and your PutHBaseJSON processor to push data to HBase as soon as it is received from the GetMongo processor. Check out this tutorial for getting started and building your first flow: http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi/
... View more
09-13-2016
07:37 AM
3 Kudos
@Vijay Kumar J Have you considered using Apache NiFi for this? NiFi has inbuilt processors to work with data in both MongoDB and HBase. You could use NiFi's GetMongo processor followed by the PutHbaseJSON processor to move the data from MongoDB to HBase. Check out the following article for more info on using NiFi to interact with MongoDB: https://community.hortonworks.com/articles/53554/using-apache-nifi-100-with-mongodb.html
... View more
09-06-2016
05:08 PM
8 Kudos
@justin kuspa In HDP 2.5, r is provided in Zeppelin via the Livy interpreter. Try using the following: %livy.sparkr Note, you will need to make sure you have R installed on your machine first. If you haven't already, install it with the following (on all nodes): yum install R R-devel libcurl-devel openssl-devel Validate it was installed correctly: R -e "print(1+1)" Once it is installed, test out sparkr in Zeppelin with Livy to confirm it is working: %livy.sparkr
foo <- TRUE
print(foo)
... View more
09-05-2016
04:53 PM
2 Kudos
@Piyush Jhawar The Ranger Hive plugin protects Hive data when it is accessed via HiveServer2. When you access these tables using HCatalog in Pig you are not going through HiveServer2, but instead Pig is using the files directly from HDFS (HCatalog is just used to map the table metadata to the HDFS files in this case). In order to protect this data, you should also define a Ranger HDFS policy to protect the underlying HDFS directory that is used to store the marketingDb.saletable data. To clarify: Ranger Hive Plugin - Used to protect Hive data when accessed via HiveServer2 (e.g, a user connecting to Hive via JDBC) Ranger HDFS Plugin - Used to protect HDFS files and directories (suitable if users need to access the data outside of HiveServer2 - Pig, Spark etc)
... View more
07-05-2016
05:47 AM
2 Kudos
@Manikandan Durairaj Within the PutHDFS processor, you can set the HDFS owner/group using the 'Remote Owner' and 'Remote Group' properties: Note, this will only work if NiFi is running as a user that has HDFS super-user privilege to change owner/group.
... View more
06-22-2016
11:37 AM
1 Kudo
@Predrag Minovic - Slight update on storm, we can run multiple Nimbus servers: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Ambari_Users_Guide/content/ch05s05.html This is to deal with cases where Nimbus can't be automatically restarted (e.g disk failure on the node). Details of Nimbus HA is outlined here: http://hortonworks.com/blog/fault-tolerant-nimbus-in-apache-storm/
... View more
06-17-2016
01:16 AM
@sankar rao To elaborate on the answer provided by @Artem Ervits The edge node is typically used to install client tools, so it will make sense to install the AWS S3 CLI on the edge. For adding new users to the cluster, you need to ensure that the new users exist in ALL nodes. The reason is that Hadoop by default takes the user/group mappings from the UNIX users by default. So for Hadoop to 'know' about the new user you've created on the edge node, that same userid should exist on all nodes.
... View more
06-09-2016
10:12 AM
2 Kudos
@KC
Your 'InferAvroSchema' is likely capturing the schema as an attribute called 'inferred.avro.schema' (assuming you followed the tutorial here: https://community.hortonworks.com/articles/28341/converting-csv-to-avro-with-apache-nifi.html )
If that's the case, you can view its output by looking at one of the flowfiles in queue after 'InferAvroSchema' (List queue > select a flowfile > view attributes > view inferred.avro.schema property).
If you want to manually define the schema without changing too much of your flow, you can directly replace your 'InferAvroSchema' processor with an 'UpdateAttribute' processor - within the 'UpdateAttribute' define a new property called inferred.avro.schema and paste in your avro schema as the value (json format).
... View more
06-09-2016
09:46 AM
4 Kudos
@KC How are you defining your Avro Schema? Typically the 'failed to convert' errors occur when the csv records don't fit the data types defined in your avro schema. If you're using the 'InferAvroSchema' processor or Kite SDK to define the schema, it is possible that the inferred schema isn't a true representation of your data (keep in mind that these methods infer the schema based on a subset of the data, so if your data isn't very consistent then it is likely that they will misinterpret what the field types are and hit errors during converting). If you know the data, you could get around this by manually defining the Avro schema based on the actual data types.
... View more
04-15-2016
05:35 AM
1 Kudo
@Indrajit swain The reason you can't see the Action option is because you are currently logged in as the 'maria_dev' user which isn't given full Ambari access by default. You can log in as the 'Admin' user account to change this. Note that the default password for the 'Admin' user in HDP 2.4 sandbox has been changed. Refer to the following thread with details on resetting the admin account password: https://community.hortonworks.com/questions/20960/no-admin-permission-for-the-latest-sandbox-of-24.html
... View more
03-13-2016
03:24 AM
1 Kudo
@zhang dianbo Which web browser are you using? I'm able to download the data using the link you've provided without any issues (using Chrome). If you still can't get the box download to work, here is the file you are looking for (uploaded to this post directly): geolocation.zip
... View more
12-08-2015
05:42 AM
5 Kudos
How do we manage authorization control over tables within SparkSQL? Will ranger enforce existing Hive policies when these Hive tables are accessed via SparkSQL?
If not, what is the recommended approach.
... View more
Labels:
- Labels:
-
Apache HCatalog
-
Apache Ranger
-
Apache Spark
11-13-2015
06:15 AM
2 Kudos
What are the recommended configurations and suggested tuning parameters when building a cluster for Spark workloads? Do we have a best practices guide around this?
... View more
Labels:
- Labels:
-
Apache Spark
11-13-2015
05:51 AM
1 Kudo
What is the recommended configuration for the vm.overcommit_memory [0|1|2] and vm.overcommit_ratio settings in sysctl? Looking specifically for a Spark cluster. I found the following link that suggests vm.overcommit_memory should be set to 1 for mapreduce streaming use cases: https://www.safaribooksonline.com/library/view/hadoop-operations/9781449327279/ch04.html Do we have any best practices around this?
... View more
Labels:
- Labels:
-
Apache Spark
11-03-2015
08:14 PM
@Neeraj The docs are conflicting.
It is recommending "use HiveContext" in secure mode (with kerberos):
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_installing-kerb-spark.html#spark-kerb-access-hive
and under best practices it states that it is not supported with Kerberos:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_practices-spark.html
... View more
10-19-2015
01:40 PM
12 Kudos
Here are some details on setting up WASB as the default filesystem (fs.defaultFS) using HDP 2.3.x on Azure IaaS: Architecture Considerations Before using WASB as the defaultFS, it is important to understand the architectural impact this will have on the cluster. What is WASB? + Pros Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs that interfaces with data stored within an Azure Blob Storage account. One of the key advantages of using WASB is that it creates a layer of abstraction that enables separation of storage from compute. You can also add/remove/modify files in the Azure blob store without regard to the Hadoop cluster, as directory and file references are checked at runtime for each job. This allows you to better utilise the flexibility of a cloud deployment, as data can persist without the need for a cluster / compute nodes. You have the following deployment options when using WASB: HDFS as defaultFS - Deploy a HDFS cluster and configure an external WASB filesystem to interface with a separate Blob Storage account. In this scenario your default Hadoop data would still be stored on HDFS, and you would have some data stored in a Blob Storage account that can be accessed when needed. WASB as defaultFS - Deploy the cluster with WASB as the default filesystem to completely replace HDFS with WASB. In this scenario HDFS still technically exists but will be empty as all files / folders deployed with the cluster are stored in WASB (blob storage). Note, the steps below describe how to configure a cluster for this approach. In many ways you can interface with the WASB filesystem as you would with HDFS, by specifying the WASB URLs: wasb://<containername>@<accountname>.blob.core.windows.net/<path> For example, the following hadoop FileSystem Shell command to WASB would behave similar to the same call made to HDFS: hadoop fs -ls wasb://<containername>@<accountname>.blob.core.windows.net/ However, it is important to note that WASB and HDFS are separate filesystems and WASB currently has some limitations that should be considered before implementing. Current Limitations - As at October 2015 The following limitations currently exist within the WASB filesystem: WebHDFS is not compatible with WASB
This limits the functionality within both Hue and Ambari Views (and any application that interfaces with the WebHDFS APIs) Security permissions are persisted in WASB, but are not enforced. File owner and group are persisted for all directories and files, but the permission model is not enforced at the filesystem level and all authorisation occurs at the level of the entire Azure Blob Storage account. When deciding to use WASB as the defaultFS, it is recommended that you first review and assess your security and data access requirements and ensure you can meet them against the above limitations. Installation + Configuration Steps Important: There is a known bug in HDP 2.3.0 that prevents non HDFS filesystems (S3, WASB etc) being set as the default filesystem [HADOOP-11618, HADOOP-12304]. For the following steps to work, you will need to use either a patched version of HDP 2.3 or use HDP 2.3.2 or later. To configure WASB as the default filesystem in-place of HDFS, the following Azure information is required: Azure Storage Account URL Container Name Storage Access Key This information is used along with the Hadoop configurations within core-site.xml to configure WASB. For a fresh install, follow the standard documentation for installing HDP via Ambari found here until you get to Customize Services, where you will need to configure your WASB properties. Also note that when using WASB as the defaultFS, you do not need to mount any additional data drives to the servers (as you would for a HDFS cluster). Customize Services The following is a list of configurations that should be modified to configure WASB: fs.defaultFS
wasb://<containername>@<accountname>.blob.core.windows.net fs.AbstractFileSystem.wasb.impl
org.apache.hadoop.fs.azure.Wasb fs.azure.account.key.<accountname>.blob.core.windows.net
<storage_access_key> Even though WASB will be set as the fs.defaultFS, you still need to define DataNode directories for HDFS. As the intent here is to use WASB as the primary FS, you can set the HDFS datanode directories to the temporary /mnt/resource mount point that is provided with Azure compute servers if you only plan to use HDFS for temporary job files.
DataNode Directories
/mnt/resource/Hadoop/hdfs/data Outside of these core-site.xml configurations, hive has the following requirements when working with blob storage on Azure: Point archive and jar files to 'wasb://' instead of 'hdfs://'
templeton.hive.archive
templeton.pig.archive
templeton.sqoop.archive
templeton.streaming.jar
wasb:///hdp/apps/${hdp.version}/hive/hive.tar.gz
wasb:///hdp/apps/${hdp.version}/pig/pig.tar.gz
wasb:///hdp/apps/${hdp.version}/sqoop/sqoop.tar.gz
wasb:///hdp/apps/${hdp.version}/mapreduce/hadoop-streaming.jar
Skip azure metrics in custom webhcat-site
When WASB is used, metrics collection is enabled by default. For webhcat server, this causes an unnecessary overhead that we can disable.
fs.azure.skip.metrics
true Further Reading WASB - Apache Docs MSDN Blog - Understanding WASB MSDN Blog - Why WASB
... View more