Member since
06-06-2017
45
Posts
8
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1712 | 11-03-2017 09:04 AM | |
4103 | 10-06-2017 07:57 AM | |
2056 | 09-01-2017 07:01 PM | |
5127 | 08-08-2017 09:20 AM | |
782 | 07-26-2017 02:05 PM |
07-25-2018
01:19 PM
I am sorry but I couldn't find proper documentation about spark.ui.proxyBase. Anyway, this property just tells Spark that you are accessing its UI through a proxy and the base path at which the proxy forwards requests to Spark is that one. This is a valid property which has been there for a while, so I am actually a bit surprised that you were able to proxy Spark History server through Knox (or any other proxy) without setting it in HDP 2.6.2. That is not a patch and it is the supposed way for Spark to work, I confirm that. Though, in HDP 3 that setting will not be needed anymore (if and only if the proxy in front of Spark History Server is Knox or it is providing the header mentioned in the JIRA) thanks to SPARK-24209 and the related Knox work.
... View more
07-25-2018
09:01 AM
I m not sure about what you mean by working earlier. Without setting that parameter proxies are not supported.
... View more
07-24-2018
03:07 PM
1 Kudo
@lawang mishra if you set: export SPARK_HISTORY_OPTS="-Dspark.ui.proxyBase=/gateway/default/sparkhistory" you should see "/gateway/default/sparkhistory" in setUIRoot. If that doesn't happen, probably there is something wrong in your configuration. you can check it by executing spark-env.sh and then check the content of SPARK_HISTORY_OPTS.
... View more
07-20-2018
07:19 AM
Hi @lawang mishra! Unfortunately I cannot see the picture. I get redirected to your env which I cannot access as I don't have credentials of course. Can you please check using your browser which API is called (ie. SHS uses a REST call for retrieving the application list, can you please report here the URL which is tried to access)? Moreover, can you please check in the source code of the HTML main page which is the value passed as argument of setUIRoot?
... View more
03-10-2018
01:02 PM
Sorry for the late answer, can you try and add export SPARK_HISTORY_OPTS="-Dspark.ui.proxyBase=/gateway/default/sparkhistory" to the spark-env? This should solve the issue. Thanks.
... View more
02-14-2018
07:55 AM
No, writing to text table is faster than writing ORC, independently from what your source is. Reading, instead, is always faster from an ORC table.
... View more
02-13-2018
03:42 PM
@PJ the format of the input table doesn't matter. When Spark reads a table (whatever format it has) it brings it in its internal memory representation. Then, when Spark writes the new table, it "converts" its internal representation of the result in the output format. Since the ORC format is way more complex than the text one, it will take longer to write, And even longer if it is compressed. It takes longer simply because it has more things to do in order to write the ORC file rather than a simple text file. ORC is faster than text in reading, but it is slower in writing. The assumption is that you write the table once and then you read it many times, so the critical part is reading: therefore ORC is claimed to be faster, because in general it is more convenient. But if you are concerned about performance (and you don't care disk space) and you write only your data and never read it (not sure it makes sense to write something which is never read), than the text format is surely more performant.
... View more
02-13-2018
09:03 AM
Hi, this makes sense. The data needs to be converted to ORC and compressed before being written, so it is normal that it is slower. How much slower depends on multiple factors. For sure, snappy is faster than zlib, but it takes a bit more disk space. And no compression is even faster, but again you need more disk space. The benefits of using ORC over text are multiple, though, and some of them are: 1. It requires fewer disk space; 2. You need to read less data; 3. Your queries on the resulting table will read only the columns needed (so if you have a lot of columns and you query just few of them in each query on the result table, you have a great performance gain by using ORC).
... View more
02-08-2018
03:35 PM
For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now). For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.
... View more
11-03-2017
09:04 AM
1 Kudo
Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply. Here you might have two reasons causing the big amount of files: 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch). 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280->HIVE-17403) or you can write your own application with your logic to do the concatenation.
... View more