Member since
02-07-2016
8
Posts
5
Kudos Received
0
Solutions
07-31-2016
10:27 PM
Also, the map() function is expecting Function1, not Function. I change it to Function1 but then the arguments are not correct for Function1!
... View more
07-31-2016
10:05 PM
Hi Arun, thanks for great suggestion. Only thing that I didn't get here is the below line:
dataFrame.write().format("com.databricks.spark.csv").option("delimiter","~").save("apache-logs"); how did we identify that format is from databricks? Can you please explain this statement? cheer!
... View more
07-28-2016
09:37 AM
Hello folks, I am reading lines from apache webserver log file into spark data frame. A sample line from log file is below:
piweba4y.prodigy.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/launchmedium.gif HTTP/1.0" 200 11853
I have split the values into host, timestamp, path, status and content_size and apply this as schema into new dataframe.
host: piweba4y.prodigy.com
timestamp: 01/Aug/1995:00:00:10 -0400
path: /images/launchmedium.gif
status: 200
content_size: 11853
I have done all above in python thru regular expressions and then have applied the schema (5 columns above) as well and now would like to do the same in java. But have no clue as how to do it? I am able to split the values by applying regexp library in java. Next step is to create columns (currently each line is a column named 'value') in my DF. Can someone help as how to do this in java? It is much easy to do in python but java seems to be little tough. Can someone help? (below is my java code) import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.Row;
import java.util.regex.*;
public class SparkJavaTest {
SparkContext sc;
String log_file_path;
public static void main(String[] args) {
SparkJavaTest sjt = new SparkJavaTest();
sjt = null;
}
public SparkJavaTest() {
log_file_path = new String("/Users/workspace/SparkJavaTest/src/cs100/lab2/apache.access.log.PROJECT");
ReadFile(log_file_path);
}
public void ReadFile(String filepath) {
DataFrame base_df, columned_df;
SparkConf conf = new SparkConf().setAppName("SparkApp").setMaster("local[*]");
SparkContext sc = new SparkContext(conf);
SQLContext sqlct = new SQLContext(sc);
base_df = sqlct.read().text(filepath);
// Below is where I am trying to find options as how to apply schema on the single value column in the dataframe "base_df"
Row dfrows[] = base_df.collect(); // NOT recommended, need to see how to do it correctly!
// Below was another try but to NO avail ..
for (Row r: dfrows) {
System.out.println(r.getString(0));
//columned_df = base_df.col
}
System.out.println(st + "\n");
// Below is where I am doing splitting of different values for hosts, timestamp, path, status and content size - DONE
// checking for host info (ex: in24.inetnebr.com)
PatternChecker("Host: ", "^([^\\s]+\\s)", st);
// checking for timestamp (ex: [01/Aug/1995:00:00:01 -0400])
PatternChecker("Timestamp: ", "(\\d\\d/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2} -\\d{4})", st);
// checking for Path (ex: /shuttle/missions/sts-68/news/sts-68-mcc-05.txt)
PatternChecker("Path: ", "^.*\"\\w+\\s+([^\\s]+)\\s+HTTP.*\"", st);
// checking for status (ex: 200)
PatternChecker("Status: ", "\\s+(\\d{3})+\\s", st);
// checking for content size (ex: 1839)
PatternChecker("Content Size: ", "\\s+(\\d+$)", st);
}
public static void PatternChecker(String caption, String theRegex, String str2check) {
Pattern patt2check = Pattern.compile(theRegex);
Matcher regexMatcher = patt2check.matcher(str2check);
while (regexMatcher.find()) {
if (regexMatcher.group().length() != 0) {
System.out.println(caption + regexMatcher.group(1).trim());
}
}
//System.out.println("\n");
}
}
... View more
Labels:
02-07-2016
07:40 PM
1 Kudo
I am downloading it and onceI fininsh installing it, I will let you know, thanks for help.
... View more
02-07-2016
12:35 AM
1 Kudo
@neeraj sabharwal any specific reason for mozilla? BTW, chrome is my favourite browser and so far I have no complaints.
... View more
02-07-2016
12:31 AM
1 Kudo
Thanks Neeraj.
... View more
02-07-2016
12:28 AM
1 Kudo
@Neeraj Sabharwal it means that I have to download the ova image of HDP? If this is true, then the installation instructions would also be different than vmware? Please clarify. Thanks.
... View more
02-07-2016
12:23 AM
1 Kudo
I have installed sandbox version HDP 2.3.2 on my Macbook over VMware fusion 7. Upon setting up vmware it says either provide license key or the other option was to try vmware for 30 days. I am worried that my vmware will stop working after a month and this will stop my learning journey which has just started! Is there a way I can make vmware work beyond its trial period? Please help or suggest alternate. Thanks.
... View more