About stevenmatison

stevenmatison · ‎08-13-2020

@ManuN Anyway you go about this task, you are going to have to execute the commands against the tables to get sizes. With a large number of tables this should be a script, program, or process. The common methods are to query the table with hive: -- gives all properties show tblproperties yourTableName -- show just the raw data size show tblproperties yourTableName("rawDataSize") Or the most accurate is to look at the table location in HDFS: hdfs dfs -du -s -h /path/to/table There are also some methods to try and get this data directly from the Hive Metastore, assuming the table is an internal Hive table. In the past I have completed this with a basic bash/shell script. I have also done similar in NiFI and prefer to do it like this without coding. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@Seetha This is a very common use case for NiFi and JSON processing pipelines. Here is a link that explains a solution (ExecuteScript) you could use: https://community.cloudera.com/t5/Support-Questions/Apache-Nifi-How-to-calculate-SUM-or-AVERAGE-of-values-in-a/td-p/164131 Additional @mburgess in that posts links a JIRA for a new Processor he was trying to work on at the time. The end result of that JIRA is his recommendation that QueryRecord processor should give you the ability to calculate the sum. Using QueryRecord you would read the values and be able to create a fabricated sql query to calculate the sums. Then you would use a RecordWriter to re-write the orginal json object with the sums, or to create completely different json object with the sums. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-13-2020

@ang_coder Depending on the number of unique values you need to add, updateAttribute + expression language will allow you to create flowfile attribute based on the table results in a manner I would call "manually". These can be used in routing or further manipulating the content (original database rows) according to your match logic. For example with ReplaceText you can replace the original value with the original value + the new value. Additionally during your flow you can programmatically change the results of the content of the flowfile to add the new column using the attribute from above, or with a fabricated query. In the latter example you would use a RecordReader/RecordWriter/UpdateRecord on your data. In a nutshell you create a translation on the content that includes adding the new field. This is a common use case for nifi and there are many different ways to achieve it. To have a more complete reply that better matches your use case, you should provide more information, sample input data, the expected output data, your flow, a template of your flow, and maybe what you have tried already. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-12-2020

@Deenag Yes, this is a typical method to filter out flowfiles based on attributes matching expression language. You setup the routes you want and ignore the rest.

stevenmatison · ‎08-12-2020

@Wilber My suggestion would be to not over write existing files, but write to staging location as external table. Then merge staging location data into final internal hive table with INSERT INTO final_table SELECT * FROM staging_table;

stevenmatison · ‎08-12-2020

@Nidutt you should be able to use NifI expression language in the flow to change date int to ISO timestamps. Here is a template you can use that shows many examples of timestamp formatting: https://github.com/steven-matison/NiFi-Templates/blob/master/Working_with_TimeStamps.xml I think you may find that nifi attributes remain strings in your flow without a strict date type, after all an ISO timestamp is really a string, your end point database just knows it is a "timestamp".... If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-07-2020

@jloormoreira A couple suggestions: If you are using the hdp sandbox, you should be be able to spin up the vm and have ambari and services installed out of the box. No need to run cluster install wizard. This is the HDP Sandbox: https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html?utm_source=mktg-tutorial If you are installing a cluster yourself, you need to back up and complete the documentated steps to install ambari-server & ambari-agent and setup ambari-server. Install as root user and do not do anything for "SSL" until you have base install completed. Complete host mapping for your fqdn (/etc/hosts) in your machine and in the VM. Make sure these are right. Make sure the VM hostname is right and persisting. Next, in the VM complete password less auth steps for the node by creating ssh keys (ssh-keygen) and adding that ~/.ssh/id_rsa.pub key to root ~/.ssh/authorized_keys. Be sure to do the initial login ssh root@yourfqdn and click Y the first time. Now you can go to ambari to install cluster. During cluster install use the id_rsa (not id_rsa.pub) for register host. Once these basics are completed the rest of the install should go fine. My last bit of advice is, if the size of the vm you are using is limited do no try to install all ambari services. It will take a 16-32gb instance to run whole stack and even this is almost too much for single node. In the 8-16gb range you will have issues trying to install everything so recommend installing only basics (yarn,hdfs,ambari metrics) + the other components you need.

stevenmatison · ‎08-06-2020

Python 3 is not supported by any versions of Ambari. You can reference the following posts for more information: https://community.cloudera.com/t5/Support-Questions/Python3-x-Compatibility-in-HDP-2-6-x-and-3-x/td-p/242185 If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

stevenmatison · ‎08-06-2020

@Mondi you should be able to enable the HBASE plugin by editing the hue.ini file from your admin console and telling it the HBASE Thrift Server. Reference https://docs.gethue.com/administrator/configuration/connectors/#hbase for the information below. HBase Specify the comma-separated list of HBase Thrift servers for clusters in the format of “(name|host:port)": [hbase] hbase_clusters=(Cluster|localhost:9090) In the full reference above there are some additional hbase artifacts for impersonization and kerberos.

stevenmatison · ‎06-30-2020

@redmonc2 You should update the post with the input data, and a screen shot of your flow, for better responses from your peers. If you provide this info I will update my response below. Without being able to see the input data I believe you just need to adjust your flow so that you are breaking up the input data into multiple flowfiles. For example, if that input data is lines of dates you want to change formats, your flow should split the lines with SplitText, then get each split FlowFiles date to an attribute called date (${date}) with ExtractText that uses regex to get the entire split content to the date attribute. With an attribute called date in each flowfile, you can then use your expression language in updateAttribute: ${date:toDate("ddHHmm:ssMMMyy"):format("yyyy/MM/dd HH:mm:ss")} Once you have the format correct for each date, you can proceed with the dates downstream as attributes or write them back to the content of the flowfile and merge them together. If this answer resolves your issue or allows you to move forward, please choose to ACCEPT this solution and close this topic. If you have further dialogue on this topic please comment here or feel free to private message me. If you have new questions related to your Use Case please create separate topic and feel free to tag me in your post. Thanks, Steven @ DFHZ

Online	Offline
Last Visited	‎06-01-2022 03:47 PM

Name	Steven Matison
Location	Florida
Member Since	‎07-19-2018 04:45 PM
Last Visited	‎06-01-2022 03:47 PM
Posts	613
Kudos received	101

Cloudera Community

Re: Apache nifi - how to convert a file .txt into ...

Re: Apache Nifi - Using PutParquet, the HDFS file ...

Re: How to extract csv column record and used it f...

Re: Could not connect to Distributed Map Cache ser...

Re: NiFi InvokeHTTP POST JSON

Re: Knowing size of Hive tables

Re: Apache Nifi to do aggregation for the given tr...

Re: Is there any processor in NiFi which helps me ...

Re: Nifi: how to use folderFilter for fetching fil...

Re: Writing parquet file to hdfs for internal Hive...

Re: Convert date to ISO format

Re: Error in Step 3 Ambari: automatically register...

Re: Ambari server installation on Centos 7 with py...

Re: how to install Hbase on my existing Hue

Re: Apache NiFi Extract multiple dates and change ...