Support Questions
Find answers, ask questions, and share your expertise

Help on analysis of unstructured dynamic data

Help on analysis of unstructured dynamic data

Explorer

We want to analyze using hive on the below type of data. Below are the challenges.

Source data are flat files from different sources.Multiple source file on daily basis. There is no fixed columns (each files have different columns). Each file have very large number of rows. No:of columns,order of the column are diffrent. each field will be comma seperated, but field value might have quotes ("").

Please suggest what would be the ideal aproch in this. Load to hbase and create hive table on top of that? or is it possible to create hive table with dynamic schema?

1 REPLY 1
Highlighted

Re: Help on analysis of unstructured dynamic data

So its something like Log files with key=value pairs? Or are the column names in the headers?

Makes a big difference.

If its the first approach your best bet is to transform them in Hive UDFs / Pig into Maps and Arrays. Arrays if you don't have the column names, maps if you have them.

Go down to "Complex Types" For more details:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

You should pull out some of the columns as fixed columns which makes your queries on them much faster ( assuming you store the tables in ORC )

How to do the transformation? A Java UDF in Pig or Hive is most likely your best bet.

Alternatives:

- Store the files as one big String and process them with Regex expressions or custom udfs that can parse the strings

- Use a flexible format like Avro. Might be good for sparse colums. However you would need to define all columns.

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

If on the other hand each file has a different header then you cannot do this since Hive/Pig UDFs work row by row. You would need to do the transformations outside of hadoop or use something like the below to run a custom InputFormat reading and processing every file on its own. The output again should be Hive/Pig maps, Avro or what else you decide.

https://community.hortonworks.com/repos/4576/apache-tika-integration-with-mapreduce.html