Support Questions
Find answers, ask questions, and share your expertise

grouping lines in a file

Master Collaborator

I need to group lines in a file based on three keywords A1,A2,A3. how can I do that?

10 REPLIES 10

Re: grouping lines in a file

Explorer

Since we want to group by 3 specific keywords, it sounds like we may want to apply a filter then do a group by. Below is an example where we filter for all rows where the first column is either 1, 2, or 3. Then we apply a group by on the first column.

val dataRdd = {
  val data = Seq(
    (1, 2, 3, 5),
    (4, 3, 2, 3),
    (1, 2, 3, 6),
    (1, 2, 3, 2),
    (6, 9, 0, 9))
  sc.parallelize(data)
} filter {
  case (c1, c2, c3, c4) => c1 == 1 || c1 == 2 || c1 == 4
} groupBy {
  case (c1, c2, c3, c4) => c1
}

dataRdd.collect().foreach(println)

Re: grouping lines in a file

Master Collaborator

I am not grouping by the column but rather by three keywords ,how can I do that?

Re: grouping lines in a file

Explorer

@Sami Ahmad I updated the answer. Hopefully this helps.

Re: grouping lines in a file

Master Collaborator

so I have three keywords "TagID" ,"Acct#" and "TxnID" , I am not understanding how I can use your example to group can you please explain using these three specific tags?

thanks

Re: grouping lines in a file

Explorer

In the example above, we are getting all rows in the RDD where the value of the first column is 1, 2, or 3. We then group by the first column, c1. If you print out the group by, it will yield something like:

(1,CompactBuffer((1,2,3,5), (1,2,3,6), (1,2,3,2)))
(4,CompactBuffer((4,3,2,3)))

In the case that you are looking to group by the three values you've listed ("TagID", "Acct#", "TxnID"), you'd first want to get all rows where the first column's value (or whatever column you want to group by) is equal to one of these three values. Once you have successfully filtered you should get a group by that you expect.

Re: grouping lines in a file

Master Collaborator

and why filter for 4 when we have only three keywords?

Re: grouping lines in a file

Guru

A pig script would be great for this.

If the file is structured (ie the keywords are in same position of a delimited file) then filter on this field. If the file is unstructured (ie the only structure is a line of text), you can either use the regex feature of pig or write a udf to filter).

In either case (structured or unstructured), filter the lines with A1 in one dataset, those with A2 in another, and those with A3 in another. You can then union them and store them as one file, or you can store them as separate files.

These links will get you started if you are new to pig

http://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/

http://pig.apache.org/docs/r0.15.0/

Re: grouping lines in a file

Master Collaborator

so create an RDD by filterting A1 n then another RDD by filtering A2 and another for A3 and then combine?

the keywords are not delimeted or at fixed location but rather scattered around the file.

Re: grouping lines in a file

Master Collaborator

also does PIG does it computations in memory or disk? since I know spark is using memory structures so its much faster than map-reduce which uses HDFS as I read .