- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Read csv files from HDFS into R using fread() and grep — lost column names
- Labels:
-
Apache Hadoop
Created 08-09-2016 03:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been trying to read large csv files from HDFS into R using the data.table package since it's a lot faster than the rhdfs package in my experience.
I have been successful with reading entire files with the following commands:
data <- fread("/usr/bin/hadoop fs -text /path/to/the/file.csv"), fill=TRUE
Then, I would like to only read in rows that contain the value "2MS-US". I tried to do it with grep
:
data <- fread("/usr/bin/hadoop fs -text /path/to/the/file.csv | grep '2MS-US'"), fill=TRUE)
This returns the correct number of rows, but it removes all the headers. They now become "V1", "V2", etc.
According to this thread, the issue with losing column names when using grep
has been resolved in data.package 1.9.6, but I am still experiencing it even though I am using 1.9.7. Any thoughts on this? Thanks!
Created 08-09-2016 06:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fixed the issue by using sed
instead:
fread("hadoop fs -text /path/to/the/file.csv |sed -n '1p;/2MS-US/p'", fill=TRUE)
The 1p
part prints the first line, which are the headers, so this way I was able to keep the headers as well as the rows that match the string.
Created 08-09-2016 06:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fixed the issue by using sed
instead:
fread("hadoop fs -text /path/to/the/file.csv |sed -n '1p;/2MS-US/p'", fill=TRUE)
The 1p
part prints the first line, which are the headers, so this way I was able to keep the headers as well as the rows that match the string.
