Support Questions

Find answers, ask questions, and share your expertise

HDPCD Spark Exam

avatar
Contributor

I have some additional questions about the Spark exam not been answered by other questions here.

  1. The current sandbox is HDP 2.5. Is this the version used in the exam?
  2. HDP 2.5 comes with Spark 1.6 and 2.0. Can I choose which version I would like to use to solve the tasks? (2.0 supports writing and reading of csv files out of the box.)
  3. Do I only have to use the Spark shell? If yes, why is there the exam objective "Initialize a Spark application"? Since using the Spark shell I do not have to do that manually. Further more there is "Run a Spark job on YARN". How should this be tested?
  4. Do I have something like Ambari to look at Hive tables or the files in the HDFS?
  5. Is there Zeppelin and can I use it?
  6. Can I change the keyboard layout?

I do have project experience with Spark but feel quite uncomfortable not knowing what to expect in the exam.

1 ACCEPTED SOLUTION

avatar
Contributor

I had my exam three days ago. Let me answer my own question.

  1. I do not know which HDP version it was.
  2. The default version `running `spark-shell` in the terminal was Spark 1.6. I did not try to change it.
  3. Yes, I was solving the tasks with Scala in the Spark Shell. However you have to save all you commands in the provided text files. It was not necessary to build a JAR manually to submit it. But there could be a task to submit a provided JAR to YARN.
  4. I do not know. You can use `hadoop fs` commands in the terminal to browse the HDFS.
  5. I do not think so.
  6. You do not have to. Since the VM is running in your browser it automatically uses your local one.

Further information:

  • I think there were some links on the desktop to the documentation. But I did no use it.
  • You do not have to write CSV files with a header. Read carefully, the delimiter do not have to be the same in all tasks.
  • The general question pattern is: read this from here, do something with it, write the results to here.
  • Because only the output counts you have to read the tasks carefully (ordering of columns, sorting of the data, delimiter in CSV files, ...)
  • It is up to you how to solve the tasks. You can use RDD or SparkSQL API.
  • The exam is not really difficult if you work through the exam objectives.

View solution in original post

11 REPLIES 11

avatar
Expert Contributor

I had almost the same question in mind. Adding to that, will I have access to Spark documentation or I have to write code from my own memory?

avatar
Contributor

Let me answer what I can...

First, when I took it, it was 2.4., but I doubt it has changed that much if any from 4 months ago.

1. Essentially the interface is the same that you get when doing the Hive practice exam. You get a Linux host for you to code with. I recommend coding your jobs in the default text editor, then submitting them through the command line.

2. Look at the Spark documentation for the command line switches to tell you how to submit the job on Yarn.

3. For documentation there will be a link to the Spark documentation and the Python documentation, so you get that. There's probably a link for Scala as well, but I use Python. I'm told it's the same as the Hive practice where you get the documentation. You need to know it though, because you don't have a lot of time to do a lot of reading. I recommend you know your way around the documentation pretty well.

4. For #4 and #5, No. You are stuck with submitting through the command line. I thought this was odd because the HDP Spark training was all about Zeppelin. But I think it has to do with how they grade it.

5. I'm not sure why they haven't posted more about the exam. They have the practice for the Hive, which gets you used to the environment, but not for Spark. I think they are probably trying to work on that, but it takes time. I haven't tried it, but it would probably be worthwhile looking at that.

6. I think it would be impractical writing these jobs from the shell. First, I'm not sure how they would grade it and second you need to be able to start over and rerun everything. You will be time constrained.

7. Since you may get HDP 2.4, I'd be prepared to write a csv file without 2.0 if it is an exam topic. It takes a while to change these exams I think.

I don't think I've exposed any secrets here that they wouldn't want you to know going in or that they didn't expose on the Hive certification through the practice.

avatar
Contributor

Thank you very much @Don Jernigan. Your answer helps me a lot. However I have further questions.

  1. Using Python it is simple to submit a job to Yarn, because you do not need more than a .py file. But when I want to use Scala it is necessary to build a .jar file with Maven, sbt or something like that. I am not sure if we have these build tools available in the exam. Did someone use Scale in the exam?
  2. Do I have to write csv files with an header line describing the column names? If yes, I think it is no that easy in a distributed environment.
  3. Is the general question pattern "Read this file(s), do something with it and write the result to here"? At the end only the results will be checked.

avatar
Contributor

@rich You have answered other questions regarding the Spark exam. We would be very grateful if you could answer some questions here.

avatar
Contributor

I had my exam three days ago. Let me answer my own question.

  1. I do not know which HDP version it was.
  2. The default version `running `spark-shell` in the terminal was Spark 1.6. I did not try to change it.
  3. Yes, I was solving the tasks with Scala in the Spark Shell. However you have to save all you commands in the provided text files. It was not necessary to build a JAR manually to submit it. But there could be a task to submit a provided JAR to YARN.
  4. I do not know. You can use `hadoop fs` commands in the terminal to browse the HDFS.
  5. I do not think so.
  6. You do not have to. Since the VM is running in your browser it automatically uses your local one.

Further information:

  • I think there were some links on the desktop to the documentation. But I did no use it.
  • You do not have to write CSV files with a header. Read carefully, the delimiter do not have to be the same in all tasks.
  • The general question pattern is: read this from here, do something with it, write the results to here.
  • Because only the output counts you have to read the tasks carefully (ordering of columns, sorting of the data, delimiter in CSV files, ...)
  • It is up to you how to solve the tasks. You can use RDD or SparkSQL API.
  • The exam is not really difficult if you work through the exam objectives.

avatar

@Stefan Frankenhauser: For #3, are we supposed to just write the scala code that will run in spark shell or do we need to create object/class and define SparkContext, SQLContext etc.

avatar
Contributor

You write your code in the Spark shell. SparkContext and SqlContext are already available.

avatar
New Contributor

can weuse only SparkSQL API to solve all tasks?

avatar
Contributor

No, I don`t think so. You also need some RDD knowledge, for example to read an CSV file and transform it to a DataFrame.