Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

csv to hive, data not correctly imported to column

avatar
Contributor

Hi all,

 

I have created a table with the required columns in hive and stored as textfile.

 

I use load data local inpath to put the csv file to the hive table created above, but all the data imported to 1 column, the rest of the columns are empty, I open the csv file in notepad, excel, seems like everything in order, I have use --hive-drop-delims and --fields-terminated-by '\t', the result is still the same.

 

Anyone tried this before?

 

 

8 REPLIES 8

avatar
Mentor
What is your CREATE TABLE statement? Are you specifying the right field delimiter character (\t if you are using that)? The default delimiter otherwise is ^A, which your data likely does not carry.

avatar
Contributor

Hi Harsh J,

 

I use the following command

 

>create table if not exists test(a string, b string, c string, d string, e string) row format delimited fields terminated by '\t' escaped by '\\';

>load data local inpath '/path/file.csv' into table testdraft --hive-drop-import-delims;

 

I have tried '\r', '\n' and '\001' at the row format fields terminated by statement but all give me same result.

 

I open the csv file using notepad the columns are seperated by tab, gedit will recognise the tab as space, when I select * test in hive, I saw a lot of null between the columns.

 

I also tried the following by replacing row format fields terminated by '\N'

create table if not exists test(a string, b string, c string, d string, e string) row format delimited fields terminated by '\N';

 

then the column will be seperated by those words with n in it.

 

I am pretty new with this csv, hive and DB, please shed some light.

 

Thanks!

avatar
Mentor
The first CREATE TABLE specification looks correct to me, for your described file.

Can you also double-inspect your /path/file.csv with "head -n1 /path/file.csv | od -c" command to ensure it does have the actual \t character between each field (vs. using a visual editor)?

avatar
Contributor

Hi Harsh J,

 

When I use the command

#head -n1 file.csv | od -c

 

It shows the following:


0000000 377 376 1 \0 \0 \0 \0 \0 \0 \0
0000020 \0 \0 \0 \0 \0 1 \0 6 \0 1 \0
0000040 8 \0 9 \0 7 \0 \0 \0 \0 \0 \0
0000060 \0 I \0 L \0 L \0 E \0 G \0 A \0 L \0
0000100 \0 D \0 U \0 M \0 P \0 \0 S \0 I \0
0000120 T \0 E \0 \0 ( \0 M \0 P \0 S \0 P \0
0000140 \0 W \0 A \0 T \0 C \0 H \0 \0 N \0
0000160 O \0 . \0 R \0 U \0 J \0 U \0 K \0 A \0
0000200 N \0 \0 1 \0 4 \0 / \0 0 \0 5 \0 / \0
0000220 1 \0 3 \0 3 \0 9 \0 ) \0 \0 A \0 d \0
0000240 a \0 l \0 a \0 h \0 \0 d \0 i \0 m \0
0000260 a \0 k \0 l \0 u \0 m \0 k \0 a \0 n \0
0000300 \0 b \0 a \0 h \0 a \0 w \0 a \0 \0
0000320 s \0 i \0 a \0 s \0 a \0 t \0 a \0 n \0
0000340 \0 M \0 P \0 S \0 P \0 \0 m \0 e \0
0000360 n \0 d \0 a \0 p \0 a \0 t \0 i \0 \0
0000400 k \0 a \0 w \0 a \0 s \0 a \0 n \0 \0
0000420 t \0 e \0 r \0 s \0 e \0 b \0 u \0 t \0
0000440 \0 m \0 e \0 r \0 u \0 p \0 a \0 k \0
0000460 a \0 n \0 \0 t \0 a \0 n \0 a \0 h \0
0000500 \0 h \0 a \0 k \0 \0 m \0 i \0 l \0
0000520 i \0 k \0 \0 p \0 e \0 r \0 s \0 e \0
0000540 n \0 d \0 i \0 r \0 i \0 a \0 n \0 . \0
0000560 \0 W \0 a \0 l \0 a \0 u \0 \0 b \0
0000600 a \0 g \0 a \0 i \0 m \0 a \0 n \0 a \0
0000620 p \0 u \0 n \0 , \0 \0 t \0 i \0 n \0
0000640 d \0 a \0 k \0 a \0 n \0 \0 t \0 e \0
0000660 l \0 a \0 h \0 \0 d \0 i \0 a \0 m \0
0000700 b \0 i \0 l \0 \0 d \0 e \0 n \0 g \0
0000720 a \0 n \0 \0 m \0 e \0 n \0 g \0 e \0
0000740 l \0 u \0 a \0 r \0 k \0 a \0 n \0 \0
0000760 s \0 a \0 t \0 u \0 \0 ( \0 1 \0 ) \0
0001000 \0 N \0 o \0 t \0 i \0 s \0 \0 d \0
0001020 i \0 \0 b \0 a \0 w \0 a \0 h \0 \0
0001040 U \0 n \0 d \0 a \0 n \0 g \0 - \0 U \0
0001060 n \0 \0 C \0 \0 \0 \0 \0 \0
0001100 \0 \0 \0 \0 \0 \0 H \0 A \0
0001120 F \0 I \0 Z \0 A \0 \0 \0 \0 \0
0001140 \0 \0 \0 \0 \0 \0 \0 \0
*
0001420 \0 \0 \0 2 \0 0 \0 1 \0 4 \0 - \0
0001440 0 \0 6 \0 - \0 0 \0 3 \0 \0 1 \0 6 \0
0001460 : \0 2 \0 5 \0 : \0 0 \0 0 \0 . \0 0 \0
0001500 0 \0 0 \0 \0 H \0 A \0 F \0 I \0 Z \0
0001520 A \0 \0 \0 \0 \0 \0 \0 \0
0001540 \0 \0 \0 \0 \0 \0 \0 \0
*
0002020 2 \0 0 \0 1 \0 4 \0 - \0 0 \0 6 \0 - \0
0002040 0 \0 5 \0 \0 1 \0 4 \0 : \0 2 \0 0 \0
0002060 : \0 4 \0 4 \0 . \0 0 \0 0 \0 0 \0 \r \0
0002100 \n
0002101

 

Thanks!

avatar
Mentor
I don't see the tab character. There are lots of interleaved null
characters in the file though, and the closest I can guess your delimiter
to be is a double null byte sequence: \0\0. You may want to format the file
right before use with Hive, if this is so.

Something like the below in Python can do the trick for cleanup, for
example, assuming your delimiter is indeed a double sequence of null bytes:

data = data.replace('\0\0', '\t').replace('\0', '')

avatar
Contributor

Thanks Harsh J,

 

Will try that out!

avatar
New Contributor

I have written a tool to auto generate hive scritps from csv, its written in Java and its open source, you can check at

https://sourceforge.net/projects/csvtohive/?source=directory

avatar
Contributor
Thanks bodivijay! I have use another way to import the hive, will try your tools when I have time to revisit the same issue!