About gkeys

gkeys · ‎11-18-2016

I believe that is a cluster-wide setting for all client interactions with the the namenode/HDFS. I was hoping to isolate the encryption to specific flows on the NiFi side. Thoughts?

gkeys · ‎11-18-2016

In NiFi, how do I put data to HDFS with the data encrypted across the wire? NiFi cluster would be on separate cluster than HDFS.

gkeys · ‎11-18-2016

You should use --validate in your import or export statement (single tables only ... not for entire db import/export). This validates that the number of rows are indentical btw source and target tables. If you want to be super-cautious, you could do a check-sum for each, most likely you would have to reload the data landed in hadoop back to the source db and compare check-sums there.

gkeys · ‎11-18-2016

From the information given, there is not a load problem just an explicit warning that the data loaded is being cast to chararray (string) during the filter operation. A couple points: If you do not specify type on load the default is bytearray. When you are filtering, you are treating it as a string (chararray type in pig) and pig will convert the bytearray to charraray during this operation. TextLoader() will load all data as a single record (no delimiters). If you want to load delimited file (fields, eg a CSV) then you use PigStorage(). You can specify the delimiter, e.g. PigStorage(',') and if not specified it uses the default of tab delim. http://pig.apache.org/docs/r0.16.0/basic.html#Data+Types+and+More Not sure if that is what you were looking for ... if so, let me know by accepting the answer; if not, let me know more specifics.

gkeys · ‎11-17-2016

Good to hear. Would love to know why it is working (especially since my stripped down version replicated the issue).

gkeys · ‎11-17-2016

@Sree Kupp I have tested this extensively and believe there is a bug in SPLIT OTHERWISE against pig EVAL functions (like avg). I will submit a JIRA and update with the number. But there is a workaround shown below ------------------------- Here is what I did to replicate the issue and identify a workaround: Replicate issue dataset 777576939330699265,0,3 777576939330699261,1,3 777576939330699262,2,2 777576939330699263,3,1 777576939330699264,4,1 script 1 (replicates your issue) data_input_2 = LOAD 'data_tweets' USING PigStorage(','); i1 = FOREACH data_input_2 GENERATE $0 as user_id, (int)$1 as user_followers_count, (int)$2 as avl_user_total_retweets; i2 = GROUP i1 ALL; i3 = FOREACH i2 GENERATE AVG(i1.user_followers_count) AS avg_user_followers_count, AVG(i1.avl_user_total_retweets) AS avg_avl_user_total_retweets; SPLIT i1 INTO top IF(user_followers_count > i3.avg_user_followers_count), bot IF(user_followers_count < i3.avg_user_followers_count), med OTHERWISE; STORE i1 INTO 'tmp/inf_1' USING JsonStorage(); STORE i2 INTO 'tmp/inf_2' USING JsonStorage(); STORE i3 INTO 'tmp/inf_3' USING JsonStorage(); STORE top INTO 'tmp/split_top' USING JsonStorage(); STORE bot INTO 'tmp/split_bot' USING JsonStorage(); STORE med INTO 'tmp/split_med' USING JsonStorage(); Results tmp/inf_1 and tmp/inf_2: results as expected (5 records, correct grouping) tmp/inf_3: {"avg_user_followers_count":2.0,"avg_avl_user_total_retweets":2.0} tmp/split_top {"user_id":"777576939330699263","user_followers_count":3,"avl_user_total_retweets":1} {"user_id":"777576939330699264","user_followers_count":4,"avl_user_total_retweets":1} tmp/split_med: no records tmp/split_bot {"user_id":"777576939330699265","user_followers_count":0,"avl_user_total_retweets":3} {"user_id":"777576939330699261","user_followers_count":1,"avl_user_total_retweets":3} Workaround Here I took the average and hard-coded it into the split. This works (but is an inconvenient hack). script 2 (hard-coding of average, which works) Everything the same except I used the following for split, based on dump of i3 SPLIT i1 INTO top IF(user_followers_count > 2.0), bot IF(user_followers_count < 2.0), med OTHERWISE; Results Results were identical to first script, except: tmp/split_med: {"user_id":"777576939330699262","user_followers_count":2,"avl_user_total_retweets":2} Workaround for your script Find the values in i3 and then hardcode them in the SPLIT statement. ------------------------- If this answers your question (I hope so 🙂 ), please let me know by accepting the answer; else let me know if there are remaining gaps.

gkeys · ‎11-16-2016

@Garima Verma See the HCC article posted to the main answer. This will resolve your issues. Regarding merging files, I suggest posing this as a separate question in HCC ... it is distinct from your original question, and you will get more eyes on it 🙂

gkeys · ‎11-16-2016

Introduction In this article I demonstrate how to use NiFi to manipulate with data records structured in columns, by showing how to perform the following three ETL operations in one flow against a dataset: Remove one or more columns (fields) from the dataset Filter out rows based on one or more field values Change field values (e.g convert all nulls to empty characters) These operations are based strongly on use of regular expressions in native NiFi processors. I assume you have basic knowledge of regular expressions and point out the key regular expression pattern needed to work with data structured in columns (e.g to operate on the 3rd , 5th and 6th columns of a delimited file). For transformations more complex than shown here, you use the same flow and regex pattern but more complex expressions inside of the regex pattern. Note that if NiFi starts to feel limited as an ETL tool (extreme complexity of transformations or volumes), consider pig with UDFs or 3rd party ETL tools. Also keep in mind that if you are using NiFi to land data in Hadoop, it is a best practice to land and persist the data raw and then transform from there (pig is a good tool). So, be sure to ask yourself whether transformations should be done in NiFi or elsewhere. The Overall Flow The overall flow is shown below, with the processors in gray doing the transformations. Basic flow is: GetFile and SplitText feed records of a delimited file (e.g. csv) into the ETL processors. ExtractText filters out records (in my flow I match records to discard and flow the unmatched records) ReplaceText removes the same column(s) from the filtered records A processing group of ReplaceText processors changes values of fields in the record, based on specified condition(s) MergeContent and PutFile append the result into a single file. Note that the data can be gotten from any source (not necessarily a file as shown here) and put to any source. The only requirement is that the lines inputted to the ETL processing subflow are delimited into the same number of fields (comma-separated, tab-separated, etc). The Regular Expression Pattern to Work with Columns (Fields) The key regex pattern to work with data in columns is shown below. I will use the example of a comma-separated file. ((?:.*,){n}) represents any n consecutive fields, where .* represents any value of a field (including empty value) and commas are field delimiters ^ represents the beginning of a line (record) So, each record can be represented as: ^((?:.*,){n})(.*) where((?:.*,){n}) are the first n fields and (.*) is the last field Note that each outer () represents expression groups. So: ^((?:.*,){n})(.*) In regex $1 will represent the first expression group ((?:.*,){n}) which are the first n fields and $2 will represent the second expression group (.*) which is the last field ^((?:.*,){2})(.*,)((?:.*,){6})(.*) $1 represents the first 2 fields, $2 represents the 3rd field, $3 represents fields 4-9 and $4 represents the last field. The usefulness of this regular expression pattern should be clear as shown below. Example ETL I am going to use the following simple data set where the first column is referred to as column 1. 0001,smith,joe,garbage field,10-09-2015,null,6.8,T 0002,gupta,kaya,junk feild,08-01-2014,A,7.9,A 0003,harrison,audrey,no comment,01-17-2016,T,5.1,A 0004,chen,wen,useless words,12-21-2015,B,8.1,A 0005,abad,zaid,*65 fr @abc.com,03-21-2014,A,7.8,null and perform the following transformations: Filter out all rows with value T in column 6 Remove column 4 from all records Convert all null to empty characters The Flow Extract the data and feed to transform processing I get the data from a file (you could fetch from another type of data store) and split the lines into a sequence of flow files fed to the transformations. Filter Records Records are filtered with the ExtractText processor. Recall that SplitText is feeding each record to this processor as a sequence of single files. Key setting is the attribute "remove" that I have added. Any line that matches this regular expression will not be sent to the next processor (because the connection is to send unmatched). Note that if you want filter based on more conditions you can either expand the regex shown or added more attributes (which is an OR logic, any record satisfying one or more of the attributes added is considered matched). As explained above, the regex shown will match any record with T as value in the 6th field. Remove Columns Columns are removed with the ReplaceText processor. As explained earlier, the Search Value regular expression defines 4 expression groups (the first 3 columns, the 4th column, the next 3 columns, and the last column. The Replacement Value says to keep the 1st, 3rd and 4 expression groups, thus dropping the 4th column. Replace Values I use three ReplaceText Processors to replace field values for the first field (no delim before value), middle fields (delim before and after value) and last field (no delim after value). I use this to replace null with empty character. I also use a processor group to organize these conveniently. Load I put the data to a file but you could load to any target. I use the default settings for these processors, except for the path data I specify for the target file. Result 0001,smith,joe,10-09-2015,,6.8,T 0005,abad,zaid,03-21-2014,A,7.8, 0004,chen,wen,12-21-2015,B,8.1,A 0002,gupta,kaya,08-01-2014,A,7.9,A Conclusion Using NiFi to transforming fields of data (remove columns, change field values) is fairly straightforward if you are strong in regular expressions. For data structured as columns, the key regular expression concept to leverage is repeated expression groups to represent column positions as described earlier in the article. Beyond this, the more skilled you are in regular expressions the greater you will be able to leverage more complex transformations using NiFi. And as stated in the introduction, be sure to ask if transformations are more apropriate in NiFi or elsewhere. Resources http://hortonworks.com/apache/nifi https://nifi.apache.org/docs/nifi-docs http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi https://nifi.apache.org/docs/nifi-docs/html/getting-started.html https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html https://community.hortonworks.com/articles/7999/apache-nifi-part-1-introduction.html http://www.regular-expressions.info http://shop.oreilly.com/product/9780596528126.do

gkeys · ‎11-15-2016

You will need to use OpenCSVSerde: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde Just add this to your create table ddl (and use the appropriate delim for separator character) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) A limitation is that it stores all fields as string. See link above and this one: https://community.hortonworks.com/questions/56611/hive-ignoring-data-type-declarations-in-create-tab.html There are workarounds like loading using OpenCSVSerde into a temp table and then load that (Create table as select...) into an ORC table. Alternatively, you could use pig to clean double quotes first and then load that data. If this is what you were looking for, let me know by accepting the answer; else, let me know of any gaps.

gkeys · ‎11-15-2016

Otherwise was introduced in pig 0.10 and is a very solid feature. There should not be an issue with it. Could you: provide full script provide sample data set verify that i1 has 165 records You can add to these comments

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: NiFi: How to put to HDFS with wire encryption?

NiFi: How to put to HDFS with wire encryption?

Re: Data validation after Sqoop command execution

Re: Pig data load problem

Re: Pig: OTHERWISE keyword in SPLIT not working.,P...

Re: Pig: OTHERWISE keyword in SPLIT not working.,P...

Re: How to delete a row/drop a column?

NiFi ETL: Removing columns, filtering rows, changi...

Re: How to remove double quote from csv file at ti...

Re: Pig: OTHERWISE keyword in SPLIT not working.,P...