Support Questions
Find answers, ask questions, and share your expertise

Loading unstructured CSV files to Hive

Explorer

Hi,

I am struggling to load a csv file into my Hive table. In my csv file I have a column that is completely unstructured, it contains string data from a programming forum so there are lots of commas and other character that could be used as a delimiter. When I load the data into the table I get lots of Null entries and the data is not in the right format.

I am reading about serde from Ogrodnek but I believe I need to install a JAR file in order to get this to work. How would I install this JAR file on the Hortonworks Sandbox that's running on an Azure VM?

Any tips would be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

You need to add the jar to the hive process. To do this you have two ( or more ) ways:

a) Create a folder auxlib under your hive directory ( /usr/hdp/<version>/hive/auxlib )

And put the jar in there and restart hiveserver2 or start the hive console. Any job which runs will pick up the jars in that folder.

b) Copy the jar somewhere on the hiveserver and run the ADD JAR command. This is good for temporary testing

c) There is also an aux libs folder variable you can use if you want to use a different folder.

I prefer version a) it seems to be the solution that always works. ( Hive can be a bit tricky since you have multiple processes in a query, the server, the client and the tez jobs that are kicked off ).

View solution in original post

4 REPLIES 4

You need to add the jar to the hive process. To do this you have two ( or more ) ways:

a) Create a folder auxlib under your hive directory ( /usr/hdp/<version>/hive/auxlib )

And put the jar in there and restart hiveserver2 or start the hive console. Any job which runs will pick up the jars in that folder.

b) Copy the jar somewhere on the hiveserver and run the ADD JAR command. This is good for temporary testing

c) There is also an aux libs folder variable you can use if you want to use a different folder.

I prefer version a) it seems to be the solution that always works. ( Hive can be a bit tricky since you have multiple processes in a query, the server, the client and the tez jobs that are kicked off ).

View solution in original post

Explorer

Thanks Benjamin for your detailed response, it was very useful.

With regards to the serdeproperties arguments how can I specify all the necessary properties given that there are so many variations within this particular column?

Explorer

Hi Benjamin,

I've tried the following code but I am receiving an error "hdp/apps/2.4.0.0-169/hive/auxlib/csv-serde-1.1.2-0.11.0-all.jar does not exist". Do I just reference to csv-serde and drop the numerical notation?

add jar hdp/apps/2.4.0.0-169/hive/auxlib/csv-serde-1.1.2-0.11.0-all.jar;

create table dataStore(Id int, Score int, ViewCount int, ParentId string, Body string, DisplayName string, Rank int ) row format serde 'com.bizo.hive.serde.csv.CSVSerde'

with serdeproperties ( "separatorChar" = "\t", "quoteChar" = "'", "escapeChar" = "\\" )

stored as textfile ;

Hello John,

I think there has been a confusion, the Jars need to be on the client/hiveserver nodes of the cluster on the local Linux file system. In /usr/hdp/<version?/hive/auxlib. If you put a jar in there you don't need to do another ADD. If you do an ADD you also need to have the jar on the local file system. This time depending what you use. If you use the hive client then on your client machine if you use beeline or JDBC then on the same machine of the hiveserver2.