I have a test file containing comma separated values.
I want to select only the 2nd and 3rd column and put into HDFS.How do I select only the 2nd and 3 rd column only from each line?The output should be
If the CSV is simple (i.e. without commas in string literals), you can use ExtractText with a regular expression to get the 2nd and third values. There is an example template that does something similar: CSV-to-JSON.
Alternatively, if you are comfortable with a programming language such as Groovy, you can use ExecuteScript to parse the column values, there is an example on my blog on how to parse lines, split on a delimiter, and select the column values you want.
Thats a nice approach!
I have had the same problem statement, and my thoughts over it is little more apropos of a business scenario. Since NIFI is quite seamless with Hadoop environment, lets suppose we are dealing with a workflow on a BigData platform where we need to offload a very large amount of data (read TBs) from a streaming source to HDFS, and we would want the same partial selection of data based on some specific columns.
Now, in your approach we are basically entrusting an External Script to do the selection work, which may or may not use the distributed processing capabilities of Hadoop, also there is context switching overhead, all these may contribute to a significant performance hit. Our motto was to use in-house utilities of NIFI to work with the data and select columns as we require so that the entire workflow can leverage the speed.
Now my question is
1. Can this be done in NIFI and without using any external script and tweaking with the functions and API?
2. If the above is possible, would it be wise to do that using NIFI keeping in mind the performance question.
3. If external scripts are the only way to do it, would it be a good idea to use Hadoop specific components like Hive or Hbase to do the column extract operation in terms of performance benefits?