Member since
09-25-2015
230
Posts
276
Kudos Received
39
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
24864 | 07-05-2016 01:19 PM | |
8268 | 04-01-2016 02:16 PM | |
2066 | 02-17-2016 11:54 AM | |
5550 | 02-17-2016 11:50 AM | |
12504 | 02-16-2016 02:08 AM |
11-15-2015
01:09 AM
2 Kudos
Easiest way to change number of mappers to desired number is: set tez.grouping.split-count = YOUR-NUMBER-OF-TASKS; As pointed by Andrew Grande, documented here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works
... View more
11-13-2015
03:06 AM
1 Kudo
@Scott Shaw, @Sourygna Luangsay I created a "minimum-viable-serde" implementing what you described. See if it is what you need. PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table. from shell: wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar
echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt
echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt
echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt
hadoop fs -mkdir /tmp/testserde/
hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/
hive from hive: add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar;
drop table testserde;
create external table testserde (
field1 string,
field2 int,
field3 double,
maps map<string,string>
)
ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue'
WITH SERDEPROPERTIES (
"delimiter" = ","
)
LOCATION '/tmp/testserde/';
select * from testserde;
Source code is here: https://github.com/gbraccialli/HiveUtils https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde/NKeys_MapKeyValue.java PS2: there are lots of TODO yet.
... View more
11-12-2015
07:33 PM
@hrongali@hortonworks.com I think a hive UDF could implement same logic, but would be easier to consume than map-reduce program. I think this UDF from brickhouse do this: https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/hbase/CachedGetUDF.java
... View more
11-11-2015
06:08 PM
1 Kudo
@azeltov@hortonworks.com I think issues are: 1- you have to use file:// for local files 2- using pyspark, you have to use print before see example below (working for me): %pyspark
base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt")
print base_rdd.count()
print base_rdd.take(3)
... View more
11-11-2015
02:40 PM
1 Kudo
4- Execute sql, using sql interpreter %sql
select geohash_encode(1.11,1.11,3) from sample_07 limit 10
It fails with sql interpreter + zeppelin: java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
...
... View more
11-11-2015
02:40 PM
Second, the same with zeppelin: 1- Restart interpreter 2- Load dependencies %dep
z.reset()
z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT")
3- Execute sql, using same scale code val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
It worked with scale code + zeppelin!!!!!
... View more
11-11-2015
02:39 PM
2- Run spark-shell with dependency spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT" 3- Run spark code val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'""");
sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println);
spark-shell worked fine!
... View more
11-11-2015
02:39 PM
1 Kudo
@azeltov@hortonworks.com See what I tried. First, with spark-shell 1- Download jar and register to local maven su - zeppelin
wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar
mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \
-Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \
-DgroupId=com.github.gbraccialli \
-DartifactId=HiveUtils \
-Dversion=1.0-SNAPSHOT \
-Dpackaging=jar
... View more
11-11-2015
02:23 PM
@Neeraj I needed to add credential to hive-site to wasb to work inside hive. Did it work for you only with hdfs-site?
... View more
11-11-2015
02:59 AM
see this jira: https://issues.apache.org/jira/browse/ZEPPELIN-150
... View more