About mgaido1

mgaido1 · ‎02-08-2018

For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now). For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.

mgaido1 · ‎11-03-2017

Merge is not happening because you are writing with Spark, not through Hive, thus all these configurations don't apply. Here you might have two reasons causing the big amount of files: 1 - Spark has a default parallelism of 200 and it writes one file per partition, thus each Spark minibatch will write 200 files. This can be easily solved, especially if you are not writing a lot of data at each minibatch reducing the parallelism before writing using `coalesce` (eventually using 1 to write only 1 file per minibatch). 2 - Spark will anyway write (at least) one file per minibatch and this depends on the frequency you are scheduling them. In this case, the solution is to schedule periodically a CONCATENATE job (but be careful you might encounter HIVE-17280->HIVE-17403) or you can write your own application with your logic to do the concatenation.

mgaido1 · ‎10-17-2017

You can use this processor to run the command through `ssh` command to that linux machine, like explained here.

mgaido1 · ‎10-06-2017

That file is needed only for performance reason. It works like a cache. Otherwise, you have to upload the jars everytime an application starts. Your problem might be that you have a root folder in your tar.gz. In this case, if you list your files in the archive, you should see something like ./one.jar ./another.jar ... Instead, you should have no root folder, and listing the files should be: one.jar another.jar ... If this is the case, here you have some examples how to do it: https://stackoverflow.com/questions/939982/how-do-i-tar-a-directory-of-files-and-folders-without-including-the-directory-it. Hope this helps.

mgaido1 · ‎10-04-2017

Are you using impersonation (ie. hive.server2.enable.doAs=true)? Which version of HDP are you running? Thanks.

mgaido1 · ‎09-01-2017

Here you have no problem in the log. It's only a warning. And this is because you have another Spark application running and then the port 4040 (which is the default port for the Spark Monitoring UI) is already in use. But then it will try on port 4041 and so on, until it finds a free port, as you can see in the following log, The only problem I can see is that you are specifying an invalid port number (100015 is not a valid port number, I guess you meant 10015).

mgaido1 · ‎08-08-2017

I guess there are some errors in your DDL. The first one I can see is that location should be: array<struct<x: double, y: double>> Please try with this change and see whether it works or there are other problems.

mgaido1 · ‎07-27-2017

It depends on which query do you want to run against your data. If you have simple queries on the PKs, for instance, Phoenix+HBase might be the right choice. Presto and Vertica are not meant for interactive queries, AFAIK. Thus, I'd definitely recommend a RDBMS for interactive queries.

mgaido1 · ‎07-26-2017

If you read carefully the exception message, it says: Row is not a valid JSON Object - JSONException: Unterminated string at 237 [character 238 line 1] This is likely to be caused by some CR-LF characters inside your "msg" object. Hive interprets each line as a full JSON. If you JSON contains newlines, Hive cannot parse it. Thus you have to clean/reformat you data before being able to analyze it with Hive.

mgaido1 · ‎07-25-2017

Hi, the best option is to get the ETL output to a structured DB for interactive reports. Spark has a high latency compared to them. You might also try to use SparkSQL caching the tables which have to be queried, but I'd not recommend this option. Spark Structured Streaming would be helpful for you for the ETL. I guess you are going to read in streaming the data from your source RDBMS and perform some transformation on the data and output it to a new RDBMS for your reporting purpose: this ETL application can be written using Spark Structured Streaming. Anyway, at the moment, Spark2 is still not supported in current HDP releases and Spark Structured Streaming is not perfect yet. So, if you have to start your project now, I would suggest you to write a simple SparkSQL application which you can run on Spark 1.6 and later on Spark 2 (when it will be supported) with very few changes. Thanks, Marco

Online	Offline
Last Visited	‎07-25-2018 03:28 PM

Member Since	‎06-06-2017 08:07 AM
Last Visited	‎07-25-2018 03:28 PM
Posts	45
Kudos received	8

Cloudera Community

Re: Spark Streaming Creating Small files in Hive

Re: Spark2 - Getting 'Could not find or load main ...

Re: cant start the thrift server

Re: Creating a HIVE table with structs inside stru...

Re: tweets sentinment hive 2.0 problem when I try ...

Re: In Spark Streaming how to process old data and...

Re: Spark Streaming Creating Small files in Hive

Re: How to execute the .sh script on a remote serv...

Re: Spark2 - Getting 'Could not find or load main ...

Re: Spark Thrift job problem HTTP ERROR 500

Re: cant start the thrift server

Re: Creating a HIVE table with structs inside stru...

Re: Can Apache Spark be used for interactive repor...

Re: tweets sentinment hive 2.0 problem when I try ...

Re: Can Apache Spark be used for interactive repor...