Member since
06-06-2017
45
Posts
8
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1577 | 11-03-2017 09:04 AM | |
3769 | 10-06-2017 07:57 AM | |
1835 | 09-01-2017 07:01 PM | |
4746 | 08-08-2017 09:20 AM | |
717 | 07-26-2017 02:05 PM |
07-25-2018
01:19 PM
I am sorry but I couldn't find proper documentation about spark.ui.proxyBase. Anyway, this property just tells Spark that you are accessing its UI through a proxy and the base path at which the proxy forwards requests to Spark is that one. This is a valid property which has been there for a while, so I am actually a bit surprised that you were able to proxy Spark History server through Knox (or any other proxy) without setting it in HDP 2.6.2. That is not a patch and it is the supposed way for Spark to work, I confirm that. Though, in HDP 3 that setting will not be needed anymore (if and only if the proxy in front of Spark History Server is Knox or it is providing the header mentioned in the JIRA) thanks to SPARK-24209 and the related Knox work.
... View more
07-25-2018
09:01 AM
I m not sure about what you mean by working earlier. Without setting that parameter proxies are not supported.
... View more
07-24-2018
03:07 PM
1 Kudo
@lawang mishra if you set: export SPARK_HISTORY_OPTS="-Dspark.ui.proxyBase=/gateway/default/sparkhistory" you should see "/gateway/default/sparkhistory" in setUIRoot. If that doesn't happen, probably there is something wrong in your configuration. you can check it by executing spark-env.sh and then check the content of SPARK_HISTORY_OPTS.
... View more
07-20-2018
07:19 AM
Hi @lawang mishra! Unfortunately I cannot see the picture. I get redirected to your env which I cannot access as I don't have credentials of course. Can you please check using your browser which API is called (ie. SHS uses a REST call for retrieving the application list, can you please report here the URL which is tried to access)? Moreover, can you please check in the source code of the HTML main page which is the value passed as argument of setUIRoot?
... View more
03-10-2018
01:02 PM
Sorry for the late answer, can you try and add export SPARK_HISTORY_OPTS="-Dspark.ui.proxyBase=/gateway/default/sparkhistory" to the spark-env? This should solve the issue. Thanks.
... View more
02-14-2018
07:55 AM
No, writing to text table is faster than writing ORC, independently from what your source is. Reading, instead, is always faster from an ORC table.
... View more
02-13-2018
03:42 PM
@PJ the format of the input table doesn't matter. When Spark reads a table (whatever format it has) it brings it in its internal memory representation. Then, when Spark writes the new table, it "converts" its internal representation of the result in the output format. Since the ORC format is way more complex than the text one, it will take longer to write, And even longer if it is compressed. It takes longer simply because it has more things to do in order to write the ORC file rather than a simple text file. ORC is faster than text in reading, but it is slower in writing. The assumption is that you write the table once and then you read it many times, so the critical part is reading: therefore ORC is claimed to be faster, because in general it is more convenient. But if you are concerned about performance (and you don't care disk space) and you write only your data and never read it (not sure it makes sense to write something which is never read), than the text format is surely more performant.
... View more
02-13-2018
09:03 AM
Hi, this makes sense. The data needs to be converted to ORC and compressed before being written, so it is normal that it is slower. How much slower depends on multiple factors. For sure, snappy is faster than zlib, but it takes a bit more disk space. And no compression is even faster, but again you need more disk space. The benefits of using ORC over text are multiple, though, and some of them are: 1. It requires fewer disk space; 2. You need to read less data; 3. Your queries on the resulting table will read only the columns needed (so if you have a lot of columns and you query just few of them in each query on the result table, you have a great performance gain by using ORC).
... View more
02-08-2018
03:42 PM
Can you provide the code you are using? A simplified version without your logic of course. The first thing to do is to cache RDDs you are reusing. It is hard to say without some actual code, but if you are starting always from the same RDD for the 1000 iterations you definitely need to cache it before your loop. It might be worth also to cache the RDD before your min operations and uncache it after the collect. But I might have misunderstood your flow since there is no code, just your description.
... View more
02-08-2018
03:35 PM
For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now). For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.
... View more
02-08-2018
03:22 PM
The persisted RDD data is stored either in memory or on disk, according to the specified level. If it is stored in memory, every partition of the RDD is stored on the executor where it is computed. If all the partitions are on a single executor, then all the RDD data is cached in it. In this case, unless you cause a shuffle, all subsequent operations are performed on that executor. A shuffle can be caused explicitly - using repartition for instance - or implicitly - some operations like groupBy can cause it. Anyway, from the Spark UI (port 4040 of the node where the driver is running, or if you are using YARN you can access it from the RM UI, through the link "Application Master" of your Spark application) you can check where your data is stored (in the Executors tab) and whether subsequent operations are performed all on the same executor or not (from the Stage UI of the relevant job).
... View more
02-08-2018
03:13 PM
I think you can consider using Livy: https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/.
... View more
11-06-2017
08:26 AM
Since you data is made of logs, I assume they they are getting in a timely manner (today you are importing toady or yesterday logs and so on). Thus, in this case a suitable option would be to partition both table by day (or you can set a different time granularity according to how much data you have) and then sync the changed partitions overwriting the ORC table partitions with the content of the updated JSON table partitions. As a side note, I think you don't need to create the ORC table as external.
... View more
11-03-2017
09:04 AM
1 Kudo
Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply. Here you might have two reasons causing the big amount of files: 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch). 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280->HIVE-17403) or you can write your own application with your logic to do the concatenation.
... View more
10-19-2017
12:16 PM
Sorry, but I am still unable to reproduce the issue...
... View more
10-17-2017
10:02 AM
I am not able to reproduce. Can you please share the exact steps you performed to get this exception? ie. tables' DDL, how you created them, which spark-llap version you re using, the spark code you run to get it, ... Thanks.
... View more
10-17-2017
07:55 AM
You can use this processor to run the command through `ssh` command to that linux machine, like explained here.
... View more
10-12-2017
07:40 AM
I would suggest you NiFi. It is available as HDF by Hortonworks. It is very easy to be used since it is a graphical tool, it allows real time solutions and data transformation. It has a built in connector for Kafka and a built in JDBC connector you can use to write to DB2.
... View more
10-06-2017
07:57 AM
That file is needed only for performance reason. It works like a cache. Otherwise, you have to upload the jars everytime an application starts. Your problem might be that you have a root folder in your tar.gz. In this case, if you list your files in the archive, you should see something like ./one.jar
./another.jar
... Instead, you should have no root folder, and listing the files should be: one.jar
another.jar
... If this is the case, here you have some examples how to do it: https://stackoverflow.com/questions/939982/how-do-i-tar-a-directory-of-files-and-folders-without-including-the-directory-it. Hope this helps.
... View more
10-04-2017
07:49 AM
Are you using impersonation (ie. hive.server2.enable.doAs=true)? Which version of HDP are you running? Thanks.
... View more
10-02-2017
12:55 PM
1 Kudo
Why do you need this functionality in a single processor? If it is for reusability, I'd suggest you to create a process group and use it wherever you need it. If it is for locking and synchronization, I'd suggest to use something like Zookeeper to keep a status of the ongoing process (here you can find some processor to interact with Zookeeper). Anyway, if you still need to write your custom processor, then please check this tutorial: https://community.hortonworks.com/articles/4318/build-custom-nifi-processor.html
... View more
09-07-2017
09:02 AM
Spark streaming is reading only new files. You should not change existing files, otherwise Spark Streaming is expected not to work properly.
... View more
09-01-2017
07:01 PM
Here you have no problem in the log. It's only a warning. And this is because you have another Spark application running and then the port 4040 (which is the default port for the Spark Monitoring UI) is already in use. But then it will try on port 4041 and so on, until it finds a free port, as you can see in the following log, The only problem I can see is that you are specifying an invalid port number (100015 is not a valid port number, I guess you meant 10015).
... View more
08-21-2017
08:15 AM
You can retrieve the value of the aggregate query like this: aggr_value = df.select("your query").collect()[0][0] then you can use it in the next queries as any variable.
... View more
08-08-2017
09:20 AM
1 Kudo
I guess there are some errors in your DDL. The first one I can see is that location should be: array<struct<x: double, y: double>> Please try with this change and see whether it works or there are other problems.
... View more
08-08-2017
09:15 AM
You can run multiple spark applications simultaneously if your cluster has enough resources. If you are not exploiting all the resources you have allocated you should just reduce the allocated resources. In this way you can run multiple applications.
... View more
08-08-2017
08:40 AM
First of all, are you using all the resources of your cluster? ie. is your Spark application using all the resources? If so, are they really used by your Spark process? If no, you can scale horizontally, launching the m,odel creatin for multiple organization in the same time... If the resources are not enough for you, you can always scale your cluster size...
... View more
07-27-2017
03:35 PM
It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.
... View more
07-27-2017
11:33 AM
What you can do is: - list all tables in schema \dt YOUR_SCHEMA.* - and then get the create table for each of them via: \d+ table_name
... View more
07-26-2017
02:09 PM
Hi, have you tried with \d table_name
from the psql command line?
... View more