Member since
09-29-2015
67
Posts
45
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1990 | 05-25-2016 10:24 AM | |
11976 | 05-19-2016 11:24 AM | |
8445 | 05-13-2016 10:09 AM | |
3106 | 05-13-2016 06:41 AM | |
9044 | 03-25-2016 09:15 AM |
01-28-2016
10:48 PM
Actually, I upgraded my Sandbox to the last version of HDP. When I do a "locate" on my Sandbox, I no longer find a reference to any spark-1.4.1 jar, only 1.5.2 jars.
... View more
01-28-2016
10:17 PM
Interesting, my SPARK_HOME is not defined in that file. I have only that comment: # export SPARK_HOME # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries Which explains why we use the Spark libraries in the zeppelin jar instead of the one defined in spark.yarn.jar ?
... View more
01-28-2016
10:13 PM
I have the same behaviour on my Sandbox (HDP2.3.4). This seems strange because the version number in spark.yarn.jar and in spark.home seemed to be totally bypassed. If you look at the jar zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar inside <ZEPPELIN-HOME>/interpreter/spark, and if you extract the file META-INF/maven/org.apache.zeppelin/zeppelin-spark/pom.xml, you'll see this: <spark.version>1.4.1</spark.version>
... View more
12-03-2015
05:50 AM
1 Kudo
To be able to use both S3 and HDFS for your Hive table, you could use an external table with partitions pointing to different locations. Look for the process that starts at "An interesting benefit of this flexibility is that we can archive old data on inexpensive storage" in this link: Hive def guide To automate this process, you could use Cron but I guess Falcon should also be possible.
... View more
12-01-2015
02:34 PM
3 Kudos
Some few months ago I asked a similar question and I got that reply: https://issues.apache.org/jira/browse/HIVE-11937 So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version. A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.
... View more
11-18-2015
02:18 PM
1 Kudo
I confirm that DbVisualizer works fine with HiveServer2. We use it quite a lot in one big ETL-Hive project we have. Take care, the free version has some limitations (for instance, you can't have 2 same tabs opened at the same time) and depending on the usage you have of that tool, you might want the comercial version.
... View more
11-18-2015
12:26 AM
In my previous company we developed a rules engine/CEP based on Hadoop. I don't remember the reasons why but we discarded Drool (the other existing software in the market did not match our need neither). Hive was definitely not an option because it had too much latencies (take care about those last 2 sentences: those design decisions were made 3 years ago, lot of things have changed since and you might reconsider those choices). The 1st implementation of the CEP was done using MapReduce and HBase (to maintain the states). The rules where loaded from a MySQL database and applied by the MapReduce job. Since we still had some latencies (due to MR), we started to move the code to Spark (streaming), still keeping HBase as a backend. Using HBase coprocessors was also an idea. Can't say much because I left the company before seeing that change in production. The front-end was web-graphical-drag&drop, so it allowed the user to quickly implement the business logic without our help. I'm not sure my answer is exactly what you were looking for. If you find some good opensource CEP projects that suit you, please let me know. I still feel curious about it.
... View more
11-16-2015
08:32 PM
1 Kudo
In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be: - results should be in a temporary HDFS directory (the same HDFS volume where the target directory is) - DFS command to move (instead of copy) the files to the target directory. Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.
... View more
11-11-2015
03:58 PM
1 Kudo
According to the documentation, the sink multiport_syslogtcp is faster than syslogtcp. Has anybody some benchmarks that describe that? Please share your experience.
... View more
Labels:
- Labels:
-
Apache Flume
11-11-2015
03:46 PM
1 Kudo
Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach: 1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column 2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.
... View more