Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5431 | 08-12-2016 01:02 PM | |
2204 | 08-08-2016 10:00 AM | |
2613 | 08-03-2016 04:44 PM | |
5519 | 08-03-2016 02:53 PM | |
1430 | 08-01-2016 02:38 PM |
06-09-2016
05:01 PM
I think you should make a new question for this. I am unfortunately an Eclipse guy 🙂
... View more
06-09-2016
10:55 AM
And learned some new things as well. Never knew that Hadoop can go directly to LDAP as well. Also static mapping is interesting.
... View more
06-09-2016
08:59 AM
aah didn't check which class was throwing the error. So its the thrift plugin that has been compiled with jdk8 and he is trying to run it with JDK7 not the jdbc driver. Makes sense.
... View more
06-09-2016
08:55 AM
1 Kudo
Unsupported major minor version means that the library cannot be executed because it was compiled with Java8 and your JDK is lower. However it is very weird because the normal Hive jdbc clients definitely work with Java7. Where did you get it from?
J2SE 8 = 52 J2SE 7 = 51 J2SE 6.0 = 50 J2SE 5.0 = 49 JDK 1.4 = 48 JDK 1.3 = 47 JDK 1.2 = 46 JDK 1.1 = 45
... View more
06-08-2016
02:58 PM
4 Kudos
spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change. Alternatively as the others have suggested put it in HDFS
... View more
06-08-2016
01:24 PM
The only reason I could see for both would be to run automated modeling tasks in spark and push the models to storm for scoring and event driven prediction. Sounds like your graph would have different inputs outputs with different speeds which sounds more like Storm. However if you CAN do it all with spark streaming it would be nice of course. I am just dubious regarding latency and the event model.
... View more
06-08-2016
01:02 PM
@SparkRocks Both can be made event driven but Storm is much easier for this since it is not based on relatively synchronous mini-batches like Spark is. It is more natural to have a control event spout in Storm IMO. I think your usecase sounds more like Storm. But Spark Streaming might be made to work too I suppose. For your usecase I would look at Storm, perhaps coupled with some standard webservices. Regarding integration of ML with Storm: - I have seen an R integration in Storm https://github.com/allenday/R-Storm - I have seen the use of a PMML library to run models for example created in Spark in Storm ( subset of models can be exported as PMML http://henning.kropponline.de/2015/09/06/jpmml-example-random-forest/ - I have seen a demo where a spark model context was intantiated in Storm to score a model. @Vadim in this community could help. IMO JPMML would be the cleanest way to do it though but there are some limitations, since Spark for example only exports a subset of models to PMML and other tools also have limited support for this standard.
... View more
06-08-2016
11:40 AM
2 Kudos
In general when you use date organized tables partitioning makes sense. If you for example are sure that data has been inserted in the last days you can use this to restrict the amount of data that need to be read. For example by partitioning by day and adding a where condition at the end that only reads the last two days or so. I assume you already use ORC? That data format has automatic indexing which should skip the majority of data blocks of your data set assuming you could give a range the timestamp can be in. ( Like where timestamp > one week ago ) . Since Hive 1.2.0 you also have the ability to read footers first which would make even better use of this feature. ( PEr default tasks are started but will close immediately, with the below setting the files are opened for a quick peek into the footer before tasks are even assigned ) hive.exec.orc.split.strategy=ETL
Finally Hive can use statistics for query acceleration but you need to be sure to have updated statistics so autogather needs to be true. It can also lead to wrong results if you sideload data etc. hive.compute.query.using.stats=true
hive.stats.autogather=true So in summary use partitioning, ORC, and have a look at the parameters below and try them out.
... View more
06-08-2016
09:42 AM
2 Kudos
I think you are putting it up wrongly, you should give us your requirements ( sub 1s, sub 5 second queries, data volumes queried, data volumes total , number requests per sec/min , reading only?, daily updates or lots of writers? ) and we can give an answer based on that. Benchmarks always change very much depending on requirements. As far as I am concerned they are pretty much meaningless for a specific usecase. Hive Hive is a analytical aggregation tool and it is hard to get queries below 5s, you can tune it for up to hundred parallel queries but it will not work for sub second queries. However we will soon have LLAP which is a persistent process that should bring down the minimum time significantly ( to perhaps 1-2s). The good thing is after that it scales to huge data volumes with very predictable behavior. So if you need to potentially aggregate billions of rows this would be the way to go. Phoenix/HBase Phoenix/HBase is much better for sub second queries that scan a small amount of data ( up to millions of rows ). In these cases you will also get a higher concurrency so it would be a good fit for a webservice, however you need to be careful in the data modeling. And it will not scale as predictably as Hive. Also the only tool that I am aware of where you could have small updates as well. Presto Don't know it too much but as I understood it its main advantage is that you can run probabilistic queries. That provide approximate results very very quickly. Someone else might chime in. Drill Similar to Presto I don't know too much about it, seems to be quite impressive for a variety of usecases but you don't have the wide support of Hive and Phoenix Kylin OLAP engine for VERY fast queries of aggregated data that can be precomputed into a cube. Druid Forgot one, Druid sounds really interesting, they have an OLAP engine based on inverted indexes which might work great to aggregate large but not huge subsets of data.
... View more
06-07-2016
09:08 PM
Personally I like it off. It binds extra resources in the cluster and the second query will be fast anyway. You also need to know how many sessions you want in advance since it will redistrube queries to the precreated seasons. If you don't care about the first query on a cold system being slow keeping it off is the safer choice IMO
... View more