Support Questions

aliyesami · ‎08-30-2016

I have a file with data like below , I want to grab the info between each START - START-END lines and save it using a KEY value on first line. how can I do that ? please note that the number of lines between the START and START-END is varying. I would like to put this in HIVE database also.

START - KEY-VALUE and bunch of info on this line bunch of info on this line bunch of info on this line bunch of info on this line START-END bunch of info on this line START - KEY-VALUE and bunch of info on this line bunch of info on this line START-END bunch of info on this line

regards

cstanca · ‎08-30-2016

@Sami Ahmad

It would have helped if you could provide some sample data, but as I understand your data, you have a data payload with a key value and a variable number of parameters associated with that key. This is an usual problem of JSON or XML data format, but also with various logs as text. There is an important question to ask yourself: "how do I plan to access the data after I am storing it? Hive is an option, but variable number of attributes is more appropriate for a columnar database like HBase. However, Hive allows you store it as JSON, Avro in a text field and if you know how to parse it you can still achieve your goals. If you want each attribute to be a column and not have to deal with JSON or Avro parsing, then HBase columnar store is another option and you can use apache Phoenix for SQL in top of HBase. It depends on what type of queries you plan to execute.

Anyhow, your question is about parsing and making sense of the data, that even before storing it. Let's focus on that. Assuming that your data is just plain text with structure described above, you have many ways to parse the data and format it, however, Hortonworks DataFlow includes apache NiFi which is an awesome tool to take your file split it by line and convert it to JSON, for example. That will include the key, as well as the variable payload. Once you have the data formated as JSON you can use another processor available in NiFi to post it to Hive or HBase.

To learn more about NiFi: http://hortonworks.com/apache/nifi/

To see all available processors: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2.0.1/bk_Overview/content/index.html

In your specific case, assuming your file is text, you would build a template and use processors like: FetchFile, SplitText, etc. and once you have the data in the proper format you can use PutHiveQL, PutHBaseJson ... Look at all processors to see how much productivity you can gain without programming, at most you would have to use Regex.

Getting started: https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

Look at the following tutorials: http://hortonworks.com/apache/nifi/#tutorials. In your case, the log data tutorials seem close enough match: http://hortonworks.com/apache/nifi/#tutorials

You can import several Nifi Templates from https://github.com/hortonworks-gallery/nifi-templates/tree/master/templates and learn even more.

If this response was helpful, please vote/accept answer.

View solution in original post

cstanca · ‎08-30-2016

@Sami Ahmad

It would have helped if you could provide some sample data, but as I understand your data, you have a data payload with a key value and a variable number of parameters associated with that key. This is an usual problem of JSON or XML data format, but also with various logs as text. There is an important question to ask yourself: "how do I plan to access the data after I am storing it? Hive is an option, but variable number of attributes is more appropriate for a columnar database like HBase. However, Hive allows you store it as JSON, Avro in a text field and if you know how to parse it you can still achieve your goals. If you want each attribute to be a column and not have to deal with JSON or Avro parsing, then HBase columnar store is another option and you can use apache Phoenix for SQL in top of HBase. It depends on what type of queries you plan to execute.

Anyhow, your question is about parsing and making sense of the data, that even before storing it. Let's focus on that. Assuming that your data is just plain text with structure described above, you have many ways to parse the data and format it, however, Hortonworks DataFlow includes apache NiFi which is an awesome tool to take your file split it by line and convert it to JSON, for example. That will include the key, as well as the variable payload. Once you have the data formated as JSON you can use another processor available in NiFi to post it to Hive or HBase.

To learn more about NiFi: http://hortonworks.com/apache/nifi/

To see all available processors: https://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2.0.1/bk_Overview/content/index.html

In your specific case, assuming your file is text, you would build a template and use processors like: FetchFile, SplitText, etc. and once you have the data in the proper format you can use PutHiveQL, PutHBaseJson ... Look at all processors to see how much productivity you can gain without programming, at most you would have to use Regex.

Getting started: https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

Look at the following tutorials: http://hortonworks.com/apache/nifi/#tutorials. In your case, the log data tutorials seem close enough match: http://hortonworks.com/apache/nifi/#tutorials

You can import several Nifi Templates from https://github.com/hortonworks-gallery/nifi-templates/tree/master/templates and learn even more.

If this response was helpful, please vote/accept answer.

aliyesami · ‎08-30-2016

thanks for the feedback I will start reading

aliyesami · ‎08-30-2016

do I need to install Horton SANDBOX for nifi ? I have hortonworks installed but if I go to localhost:8080/nifi it says not found ?

Cloudera Community

Support Questions

reading data from a file in specific format