About sluangsay

sluangsay · ‎01-28-2016

Actually, I upgraded my Sandbox to the last version of HDP. When I do a "locate" on my Sandbox, I no longer find a reference to any spark-1.4.1 jar, only 1.5.2 jars.

sluangsay · ‎01-28-2016

Interesting, my SPARK_HOME is not defined in that file. I have only that comment: # export SPARK_HOME # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries Which explains why we use the Spark libraries in the zeppelin jar instead of the one defined in spark.yarn.jar ?

sluangsay · ‎01-28-2016

I have the same behaviour on my Sandbox (HDP2.3.4). This seems strange because the version number in spark.yarn.jar and in spark.home seemed to be totally bypassed. If you look at the jar zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar inside <ZEPPELIN-HOME>/interpreter/spark, and if you extract the file META-INF/maven/org.apache.zeppelin/zeppelin-spark/pom.xml, you'll see this: <spark.version>1.4.1</spark.version>

sluangsay · ‎12-03-2015

To be able to use both S3 and HDFS for your Hive table, you could use an external table with partitions pointing to different locations. Look for the process that starts at "An interesting benefit of this flexibility is that we can archive old data on inexpensive storage" in this link: Hive def guide To automate this process, you could use Cron but I guess Falcon should also be possible.

sluangsay · ‎12-01-2015

Some few months ago I asked a similar question and I got that reply: https://issues.apache.org/jira/browse/HIVE-11937 So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version. A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.

sluangsay · ‎11-18-2015

I confirm that DbVisualizer works fine with HiveServer2. We use it quite a lot in one big ETL-Hive project we have. Take care, the free version has some limitations (for instance, you can't have 2 same tabs opened at the same time) and depending on the usage you have of that tool, you might want the comercial version.

sluangsay · ‎11-18-2015

In my previous company we developed a rules engine/CEP based on Hadoop. I don't remember the reasons why but we discarded Drool (the other existing software in the market did not match our need neither). Hive was definitely not an option because it had too much latencies (take care about those last 2 sentences: those design decisions were made 3 years ago, lot of things have changed since and you might reconsider those choices). The 1st implementation of the CEP was done using MapReduce and HBase (to maintain the states). The rules where loaded from a MySQL database and applied by the MapReduce job. Since we still had some latencies (due to MR), we started to move the code to Spark (streaming), still keeping HBase as a backend. Using HBase coprocessors was also an idea. Can't say much because I left the company before seeing that change in production. The front-end was web-graphical-drag&drop, so it allowed the user to quickly implement the business logic without our help. I'm not sure my answer is exactly what you were looking for. If you find some good opensource CEP projects that suit you, please let me know. I still feel curious about it.

sluangsay · ‎11-16-2015

In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be: - results should be in a temporary HDFS directory (the same HDFS volume where the target directory is) - DFS command to move (instead of copy) the files to the target directory. Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.

sluangsay · ‎11-11-2015

According to the documentation, the sink multiport_syslogtcp is faster than syslogtcp. Has anybody some benchmarks that describe that? Please share your experience.

sluangsay · ‎11-11-2015

Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach: 1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column 2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.

Online	Offline
Last Visited	‎05-30-2016 01:32 PM

Member Since	‎09-29-2015 07:44 AM
Last Visited	‎05-30-2016 01:32 PM
Posts	67
Kudos received	45

Cloudera Community

Re: Data Processing Using Pig from local to HDFS

Re: "Number of reduce tasks is set to 0 since ther...

Re: Sqoop import : composite primary key and textu...

Re: can we create a facts and dimensional tables i...

Re: Hive QL - Aggregating within a group

Re: Spark 1.5 with Zeppelin - but sc.version print...

Re: Spark 1.5 with Zeppelin - but sc.version print...

Re: Spark 1.5 with Zeppelin - but sc.version print...

Re: Single Hive table pointing to multiple storage...

Re: Counting rows in multiple partitions in Hive q...

Re: Hive GUI !

Re: Rules Engine in Hadoop

Re: Appending to existing partition with Pig

Benchmarks for Flume sink multiport_syslogtcp

Re: Need help creating a custom SerDe.