About ahadjidj

ahadjidj · ‎04-17-2017

Hi @Yahya Najjar You can use ExtractText processor to extract these field as attributes. Below a test I did. This configuration will extract your CSV fields as myfields.1, myfields.2, etc As you can see in the provenance, these informations are added to the flow files as attributes Below the complete flow

ahadjidj · ‎03-13-2017

Introduction NiFi Site to Site (S2S) is a communication protocol used to exchange data between NiFi instances or clusters. This protocol is useful for use case where we have geographically distributed clusters that need to communicate. Examples include: IoT: collect data from edge node (MiNiFi) and send them to NiFi for aggregation/storage/analysis Connected cars : collect data locally by city or country with a local HDF cluster, and send it back to a global HDF cluster in core Data Center Replication : synchronization between two HDP clusters (on prem/cloud or Principal/DR) S2S provides several benefits such as scalability, security, load balancing and high availability. More information can be found here Contexte NiFi can be secured by enabling SSL and requiring users/nodes to authenticate with certificates. However, in some scenarios, customers have secured and unsecured NiFi clusters that should communicate. The objective of this tutorial is to show two approaches to achieve this. Discussions on having secure and unsecured NiFi cluster in the same application are outside the topic of this tutorial. Prerequisites Let's assume that we have already installed an unsecure HDF cluster (Cluster2) that needs to send data to a secure cluster (Cluster1). Cluster1 is a 3 node NiFi cluster with SSL : hdfcluster0, hdfcluster1 and hdfcluster2. We can see the HTTPS in the URLs as well as the connected user 'ahadjidj'. Cluster2 is also a 3 nodes NiFi cluster but without SSL enabled : hdfcluster20, hdfcluster21 and hdfcluster22 Option 1: the lazy option The easiest way to get data from cluster 2 to cluster 1 is to use a Pull method. In this approach, cluster 1 will use a Remote Process Group to pull data from cluster 2. We will configure the RPG to use HTTP and no special configurations are required. However, data will go unencrypted over the network. Let's see how to implement this. Step 1: configure Cluster2 to generate data The easiest way to generate data in cluster 2 is to use a GenerateFlowFile processor. Set the File Size to something different from 0 and Run Schedule to 60 sec Add an ouput port to the canvas and call it 'fromCluster2' Connect and start the two processors At this level, we can see data being generated and queued before the output port Step 2: configure Cluster1 to pull data Add a RPG and configure it with HTTP addresses of the three Cluster2' nodes. Use HTTP as Transport Protocol and enable the transmission. Add a PutFile processor to grab the data. Connect the RPG to the PutFile and chose the 'fromCluster2' output when you are asked for. Right click on the RPG and activate the toggle next 'fromCluster2' We should see flow files coming from the RPG and buffering before the PutFile processor. Option 2: the secure option The first approach was easy to configure but data was sent unencrypted over the wire. If we want to leverage SSL and send data encrypted even between the two clusters, we need to generate and use certificates for each node in the Cluster2. The only difference here is that we don't activate SSL. Step 1: generate and add Cluster2 certs I suppose that you already know how to generate certificates for CA/nodes and add them to Truststore/KeyStore. Otherwise, there are several HCC articles that explain how to do it. We need to configure Cluster2 with its certificats Upload nodes' certificate to each node and add it to the KeyStore (eg. keystore.pfx). Set also the KeyStore type and password. Upload the CA (Certificate Authority) certificate to each node and add it to the TrustStore (eg. truststore.jks). Set also the TrustStore type and password. Step 2: configure Cluster2 to push data to Cluster1 In Cluster1, add an input port (toCluster1) and connect it to a PutFile processor. Use a GenerateFlowFile to generate data in Cluster2 and a RPG to push data to Cluster1. Here we will use HTTPS addresses when configuring the RPG. Cluster2 should be able to send data to Cluster1 via the toCluster1 input port. However, the RPG shows a Forbidden error Step 3: add policies to authorize cluster2 to use the S2S protocol The previous error is triggered because nodes belonging to Cluster2 are not authorized to access to Cluster1 resources. To solve the problem, let's do the following configurations: 1) Go the users menu in Cluster1 and add a user for each node from Cluster2 2) Go to the policies menu in Cluster1, and add each node from Cluster2 to the retrieve site-to-site details policy At this point, the RPG in Cluster2 is working however the input port is not visible yet 3) The last step is editing the input port policy in Cluster1 to authorize nodes from Cluster2 to send data through S2S. Select the toCluster1 Input port and click on the key to edit it's policies. Add cluster2 nodes to the list. 4) Now, go back to cluster2 and connect the GenerateFlowFile with the RPG. The input port should be visible and data start flowing "securely" 🙂

ahadjidj · ‎03-12-2017

Hi @Joe Harvy The easiest way to achieve this is to pull data from the unsecure cluster rather than push to the secure cluster. You can achieve this by using an output port in the unsecure cluster and a remote process group that connect to this outport in the secure cluster. Since the RPG is directed to an unsecure cluster, no need to config certs. The other approach is to configure your unsecure cluster by setting the Keystore/Truststore as you did for the secure cluster but without activating SSL. You will need also to add nodes in the secure cluster and give them the right to retrieve S2S details (see policies) Edit: I've been asked this question several times by customers so I wrote a tutorial on these two option : https://community.hortonworks.com/articles/88473/site-to-site-communication-between-secured-https-a.html

ahadjidj · ‎01-01-2017

HI @Vivek Sharma Documentation for using S3 is available in the HDC doc page . For instance, you have pages to use S3 with Hive and Spark. You have also a page on performance tuning. Regarding your last question, S3 won't replace HDFS. HDFS still the default storage system as explained here: While Amazon S3 can be used as the source and store for persistent data, it cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS. This is important to know, as the fact that it is accessed with the same APIs can be misleading. This being said, you can absolutely access data directly from S3 without copying it to HDFS. You have examples on how to do it in the Hive and Spark docs pages that I provided you before.

ahadjidj · ‎12-27-2016

Hi @Smr Srid This is already available in HDF 2.1. You can install it (doc) or upgrade your existing cluster (doc)

ahadjidj · ‎12-27-2016

I don't have a ready example for your use case. Look to the documentation I gave you, you can find an example with HBase. You just need to adapt it to your needs. Hope this helps.

ahadjidj · ‎12-27-2016

@Manoj Dhake As with any tools, modifying the database directly is dangerous and can lead to inconsistency. For instance, some operations needs to create/modify several data. If you modify data directly, you can miss one step on the road. Also, Atlas uses an index store (Solr) in addition of the metadata store (HBase). This index should be up to date and contains the last information which you can not guarantee when accessing the database. Atlas comes with integration points that have been developed especially to let you enrich your data governance and customize your management. These integrations points are the secure path to implement your logic: Rest API as described before Messaging integration through Kafka (look to the same documentation that I previously provided)

ahadjidj · ‎12-27-2016

Hi @Manoj Dhake I've never tried to implement your use case by this should be possible using Atlas API. I do not recommend altering data directly in HBase. You can follow these steps: Create a Hive Table entity for "patient_info_raw" if it doesn't exist in Atlas. Use the REST API call "POST http://<Atlas_Server:Atlas_Port>/api/atlas/entities" where the body is the table EntityDefinition structures. More information on Rest API can be found in this guide. Create a Hive Table entity for "patient_validated_dataset" if it doesn't exist in Atlas. Use the same method as above Create a Lineage: to do this, you need two DataSets entities and a Process instance. You already have two DataSets entities (your Hive tables) as Hive Table is a subtype of DataSet. For Process instance, you can use and existing process type or create your own. When you create your process instance, you will set Input and Output as your Hive Tables GUID. This models the lineage and the link between them. I hope this will help you implement your use case. My advice is to read the Atlas Rest API before implementing this https://atlas.incubator.apache.org/AtlasTechnicalUserGuide.pdf Abdelkrim

ahadjidj · ‎12-26-2016

@David Sheard Can you ping other nodes from Ambari nodes ?

ahadjidj · ‎10-12-2016

Hi @Stéphane Couzigou Ranger in HDP and HDF use different plugins so NiFi in not available in your HDP 2.5 Ranger. Using same Ambari and Ranger for HDF and HDP is on the roadmap. Regarding your other question, it should be possible to have HDP tools and NiFi in the same Ranger if you install it manually. I didn't try it but it should work. You can look at these links to have an idea on how to activate all plugins https://cwiki.apache.org/confluence/display/RANGER/NiFi+Plugin https://github.com/bbende/apache-ranger-vagrant/tree/master/scripts In the Ranger.env.sh you can see the list of plugins installed : SUPPORTED_COMPONENTS=tag,hdfs,hbase,hive,kms,knox,storm,yarn,kafka,solr,atlas,nifi

Online	Offline
Last Visited	‎08-19-2019 05:07 AM

Member Since	‎01-11-2016 06:11 PM
Last Visited	‎08-19-2019 05:07 AM
Posts	355
Kudos received	230

Cloudera Community

Re: How to access NIFI Process Group variable in E...

Re: GETSFTP with NiFi cluster

Re: how is Kafka different from Mosquitto(MQTT) ?

Re: Whitelisting using LookupAttribute

Re: Is there any ways if we can schedule or trigge...

Re: ExtractText from file content and put it as fi...

Site-To-Site communication between secured (HTTPS)...

Re: NiFi S2S between secure and unsecure clusters

Re: HDCloud with AWS S3

Re: NiFi WebSocket support

Re: can we store additional Apache atlas metadata ...

Re: can we store additional Apache atlas metadata ...

Re: can we store additional Apache atlas metadata ...

Re: Install HDP 2.5 using Ambari 2.4 on AWS Fails

Re: Is it possible to use a single Ranger instance...