Member since
06-06-2017
45
Posts
8
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2599 | 11-03-2017 09:04 AM | |
5722 | 10-06-2017 07:57 AM | |
3029 | 09-01-2017 07:01 PM | |
6677 | 08-08-2017 09:20 AM | |
1468 | 07-26-2017 02:05 PM |
02-08-2018
03:35 PM
For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now). For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.
... View more
11-03-2017
09:04 AM
1 Kudo
Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply. Here you might have two reasons causing the big amount of files: 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch). 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280->HIVE-17403) or you can write your own application with your logic to do the concatenation.
... View more
10-17-2017
07:55 AM
You can use this processor to run the command through `ssh` command to that linux machine, like explained here.
... View more
10-06-2017
07:57 AM
That file is needed only for performance reason. It works like a cache. Otherwise, you have to upload the jars everytime an application starts. Your problem might be that you have a root folder in your tar.gz. In this case, if you list your files in the archive, you should see something like ./one.jar
./another.jar
... Instead, you should have no root folder, and listing the files should be: one.jar
another.jar
... If this is the case, here you have some examples how to do it: https://stackoverflow.com/questions/939982/how-do-i-tar-a-directory-of-files-and-folders-without-including-the-directory-it. Hope this helps.
... View more
10-04-2017
07:49 AM
Are you using impersonation (ie. hive.server2.enable.doAs=true)? Which version of HDP are you running? Thanks.
... View more
09-01-2017
07:01 PM
Here you have no problem in the log. It's only a warning. And this is because you have another Spark application running and then the port 4040 (which is the default port for the Spark Monitoring UI) is already in use. But then it will try on port 4041 and so on, until it finds a free port, as you can see in the following log, The only problem I can see is that you are specifying an invalid port number (100015 is not a valid port number, I guess you meant 10015).
... View more
08-08-2017
09:20 AM
1 Kudo
I guess there are some errors in your DDL. The first one I can see is that location should be: array<struct<x: double, y: double>> Please try with this change and see whether it works or there are other problems.
... View more
07-27-2017
03:35 PM
It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.
... View more
07-26-2017
02:05 PM
If you read carefully the exception message, it says: Row is not a valid JSON Object - JSONException: Unterminated string at 237 [character 238 line 1] This is likely to be caused by some CR-LF characters inside your "msg" object. Hive interprets each line as a full JSON. If you JSON contains newlines, Hive cannot parse it. Thus you have to clean/reformat you data before being able to analyze it with Hive.
... View more
07-25-2017
02:10 PM
Hi, the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option. Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes. Thanks, Marco
... View more