Member since
01-07-2019
220
Posts
23
Kudos Received
30
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 9957 | 08-19-2021 05:45 AM | |
| 2594 | 08-04-2021 05:59 AM | |
| 1313 | 07-22-2021 08:09 AM | |
| 5056 | 07-22-2021 08:01 AM | |
| 4777 | 07-22-2021 07:32 AM |
08-12-2020
02:06 PM
I am not sure if this is still relevant, but the root cause is shown as: Cannot migrate key if no previous encryption occurred I did not find much about this error, did you perhaps change your encryption settings? Or under which conditions did this problem occur?
... View more
08-12-2020
01:58 PM
The ListSFTP processor does not actually do anything with the file, it just builds a list of files that exist. Typically this would then feed into a GetSFTP processor. In the GetSFTP processor you can configure whether the original should be deleted, by default this would indeed happen. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.GetSFTP/index.html
... View more
08-12-2020
01:55 PM
I believe the root cause here is likely that NiFi has some limits in how accurately it can store numeric data types internally. If you do not want to lose precesion, the best course of action is likely to indeed use a string under the hood. I know an improvement is requested in this area to allow for greater numeric precision, but at this time I do not know the status of this. --- Alternately, if I read your question the wrong way: The solution might also be to explicitly define the column type in Hive before writing, to avoid landing on string where it is not needed.
... View more
08-12-2020
01:43 PM
I am not aware of any bulk update capability through the UI. At a glance I did not see this option yet via the API, so the following may be a reasonable workaround (disclaimer: I did not try something like this myself): 1. Export the process group in which you want to update all the queues (perhaps manually update 1 connection to see what it looks like) 2. Write a script to update all the connections in the template This would still involve manual steps, but if you have a few groups with 100 queues it could save a lot of time. Also, don't hesitate to share if this indeed worked out.
... View more
07-29-2020
02:27 PM
Thanks, will think on refining the distinction between kudu and druid. Currently i would not want to include the fact that flink has state as 'storage', but regarding flink SQL, i may actually make another post later to talk about the way to interact with/access different kinds of data. (As someone also noticed, impala is also not here because it is not a store in itself, but works with stored data).
... View more
07-28-2020
01:34 AM
Thanks for this reply I want to upgrade to the version 3.1.4, because the download of the packages of the version 3.1.5 requires a login/passwd that I don't have. Regarding the red flag on Oozie, I think that it's due to the BUG-123169 that I will try to workaround with a upgrade of bigtop-tomcat
... View more
07-27-2020
11:43 AM
The important thing to keep in mind is that NiFi is built for distributed processing. As such, there is essentially one queue per node. The position in the queue is therefore unique within that node, but it is expected that each node will have a message with position 1. Sidenote, it is therefore also expected that if you set a queue size of 10000, you will end up with a queue of size 30000.
... View more
07-25-2020
01:52 AM
There will be multiple form factors available in the future, for now, I will assume you have an environment that contains 1 datahub with NiFi, and 1 Data Hub with both Impala and Kudu. (The answer still works if all are on the same datahub).
Prerequisites
Data Hubs with NiFi and Impala+Kudu
Permission to access these (e.g. add a processor, create table)
Know your Workload User Name (CDP Management Console > Your name (bottom left) > Profile)
You should have set your Workload Password in the same location
Steps to write data from NiFi to Kudu in CDP Public Cloud
Unless mentioned otherwise, I have kept everything to its default settings.
In Kudu Data Hub Cluster:
Gather the FQDN links of the brokers and the used ports. Go to the Data Hub > Click Kudu Master > Click Masters.
Combine the RPC addresses together in the following format: host1:port,host2:port,host3:port
Example:
```master1fqdn.abc:7051,master2fqdn.abc:7051,master3fqdn.abc:7051```
In HUE on Impala/Kudu Data Hub Cluster:
Run the following query to create the kudu table (the little triangle makes it run):
`CREATE TABLE default.micro (id BIGINT, name STRING, PRIMARY KEY(id)) STORED AS KUDU;`
In NiFi GUI:
Ensure that you have some data in NiFi to that fits in the table. Use the `GenerateFlowFile` processor.
In Properties, configure the Custom Text to contain the following (copy carefully or use shift+enter for newline): id, name
1,dennis
Select the PutKudu processor, configure it as the following: - Settings
- Automatically terminate relationships: Tick both success and faillure
- Scheduling
- 1 sec
- Properties
- Kudu Masters: The combined list we created earlier
- Table Name: impala::default.micro
- Kerberos Principal: your Workload User Name (see prerequisites above)
- Kerberos Password: your Workload Password
- Record Reader: Create new service>CSV Reader
- Kudu Operation Type: UPSERT
Right-click the Nifi Canvas> Configure > The little cogwheel bolt of CSV Reader, set the following property and then apply: Treat First Line as Header: True
Click the little lightning bolt of CSV Reader > Enable
On the canvas connect your GenerateFlowFile processor to your PutKudu processor and start the flow.
You should now be able to select your table through HUE and see that a single record has been added: `select * from default.micro`
These are the minimal steps, a more extensive explanation can be found on in the Cloudera Documentation.
Potential refinements
A describe extended also exposes the hostnames. For me, this also worked, but I am unsure how safe it is not to explicitly define the ports.
... View more
Labels:
07-23-2020
01:49 PM
1 Kudo
The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises. Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this: Steps for finding the right tool to move your data Staying within Hive and SQL queries suffice? > Hive otherwise No complex operations (e.g. joins) > Nifi otherwise Batch > Spark otherwise Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise Flink Some notes: If you can use Nifi or a more complex solution, use Nifi Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines. Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this Also see my related article: Choose the right place to store your data Full Disclosure & Disclaimer: I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.
... View more
06-29-2020
02:45 PM
Too cool man, great work!
... View more