Member since
07-01-2016
17
Posts
5
Kudos Received
0
Solutions
01-02-2017
02:34 AM
Something doesn't look right to me or some of the parameters may have been left out. I see you are manipulating one attribute (ingestionDate), yet all the attributes are being returned in your input/ output examples. The way we do this to consider the attributes and data payload as distinct. The EvaluateJsonPath moves the data to attributes and the AttributesToJson moves the attributes to data. In this way, when EvaluatingJsonPath, you can leave ingestionDate out since you do not need it. What you need are the other Json values so they move into the attributes. Then UpdateAttributes by adding the ingestionDate with the Now(). Then AttributesToJson to move all the atrtibutes including the new ingestionDate into the data. It wouldn't surprise me if internally you don't either have two ingestionDate attributes or your code is ignored to prevent that situation. Either way, the way I described we think is more maintainable you can see exactly what is going in and out of attributes to data it each step and not have to look at a bunch of transformations.
... View more
12-07-2016
06:43 PM
3 Kudos
Let me answer what I can... First, when I took it, it was 2.4., but I doubt it has changed that much if any from 4 months ago. 1. Essentially the interface is the same that you get when doing the Hive practice exam. You get a Linux host for you to code with. I recommend coding your jobs in the default text editor, then submitting them through the command line. 2. Look at the Spark documentation for the command line switches to tell you how to submit the job on Yarn. 3. For documentation there will be a link to the Spark documentation and the Python documentation, so you get that. There's probably a link for Scala as well, but I use Python. I'm told it's the same as the Hive practice where you get the documentation. You need to know it though, because you don't have a lot of time to do a lot of reading. I recommend you know your way around the documentation pretty well. 4. For #4 and #5, No. You are stuck with submitting through the command line. I thought this was odd because the HDP Spark training was all about Zeppelin. But I think it has to do with how they grade it. 5. I'm not sure why they haven't posted more about the exam. They have the practice for the Hive, which gets you used to the environment, but not for Spark. I think they are probably trying to work on that, but it takes time. I haven't tried it, but it would probably be worthwhile looking at that. 6. I think it would be impractical writing these jobs from the shell. First, I'm not sure how they would grade it and second you need to be able to start over and rerun everything. You will be time constrained. 7. Since you may get HDP 2.4, I'd be prepared to write a csv file without 2.0 if it is an exam topic. It takes a while to change these exams I think. I don't think I've exposed any secrets here that they wouldn't want you to know going in or that they didn't expose on the Hive certification through the practice.
... View more
08-01-2016
06:23 PM
I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.
... View more
07-31-2016
09:28 PM
In order for the broadcast join to work, you need to run ANALYZE TABLE [tablename] COMPUTE STATISTICS noscan. Otherwise, it won't know the size to do the broadcast map join. Also, you mention you are trying to do a groupBy. You need to use a reduceBy instead. That will dramatically increase the performance. Spark will do the count in the map, then distribute the results to the reducer (if you want to think about it in map-reduce terms). Switching to reduceBy alone should solve your performance issue.
... View more
07-20-2016
07:49 PM
Can you please do a SHOW CREATE TABLE testtable: ? That will show us exactly how your columns are defined and the ORC format.
... View more
07-14-2016
04:04 PM
Essentially I have a python tuple ('a','b','c','x','y','z') that are all strings. I could just map them into a single concatenation of ('a\tb\tc\tx\ty\tz'), then saveAsTextFile(path). But I was wondering if there was a better way than using an external package which could just be encapsulating that .map(lambda x: "\t'".join(x) ).
... View more
07-13-2016
05:37 AM
I have a RDD I'd like to write as tab delimited. I also want to write a data frame as tab delimited. How do I do this?
... View more
Labels:
- Labels:
-
Apache Spark
07-03-2016
03:18 AM
In order to reorder tuples (columns) in scala I think you just use a map like in Pyspark: val rdd2 = rdd.map((x, y, z) => (z, y, x)) You should also be able to build key-value pairs this way too. val rdd2 = rdd.map((x, y, z) => (z, (y, x))) This is very handy if you want to follow it up with sortByKey().
... View more
07-03-2016
02:51 AM
Yes. You can think of select() as the "filter" of columns where filter() filters rows. You want to reduce the impact of the shuffle as much as possible. Perform both of these as soon as possible. The groupBy() is going to cause a shuffle by key (most likely). Be careful with the groupBy(). If you can accomplish what you need to do with a reduceBy(), you should use that instead. If you mean dataframe instead of dataset, SparkSQL will handle much of this optimization for you. But if using normal RDDs, you are going to have to deal with these types of optimizations on your own.
... View more
07-02-2016
08:06 AM
You can just use sc.setLogLevel("ERROR") in your code to suppress log information without changing the log4j.properties file.
... View more