About gbraccialli3

gbraccialli3 · ‎11-15-2015

Easiest way to change number of mappers to desired number is: set tez.grouping.split-count = YOUR-NUMBER-OF-TASKS; As pointed by Andrew Grande, documented here: https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

gbraccialli3 · ‎11-13-2015

@Scott Shaw, @Sourygna Luangsay I created a "minimum-viable-serde" implementing what you described. See if it is what you need. PS: I'm assuming your last column will be a map<string,string>, I haven't done data type handling for last column yet. For the key columns, it will respect the data type you declare when creating table. from shell: wget https://github.com/gbraccialli/HiveUtils/raw/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar -O /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar echo "a,b,c,adsfa,adfa" > /tmp/testserde.txt echo "1,2,3,asdfasdf,sdfasd" >> /tmp/testserde.txt echo "4,5,6,adfas,adf,d" >> /tmp/testserde.txt hadoop fs -mkdir /tmp/testserde/ hadoop fs -put -f /tmp/testserde.txt /tmp/testserde/ hive from hive: add jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar; drop table testserde; create external table testserde ( field1 string, field2 int, field3 double, maps map<string,string> ) ROW FORMAT SERDE 'com.github.gbraccialli.hive.serde.NKeys_MapKeyValue' WITH SERDEPROPERTIES ( "delimiter" = "," ) LOCATION '/tmp/testserde/'; select * from testserde; Source code is here: https://github.com/gbraccialli/HiveUtils https://github.com/gbraccialli/HiveUtils/blob/master/src/main/java/com/github/gbraccialli/hive/serde/NKeys_MapKeyValue.java PS2: there are lots of TODO yet.

gbraccialli3 · ‎11-12-2015

@hrongali@hortonworks.com I think a hive UDF could implement same logic, but would be easier to consume than map-reduce program. I think this UDF from brickhouse do this: https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/hbase/CachedGetUDF.java

gbraccialli3 · ‎11-11-2015

@azeltov@hortonworks.com I think issues are: 1- you have to use file:// for local files 2- using pyspark, you have to use print before see example below (working for me): %pyspark base_rdd = sc.textFile("file:///usr/hdp/current/spark-client/data/mllib/sample_libsvm_data.txt") print base_rdd.count() print base_rdd.take(3)

gbraccialli3 · ‎11-11-2015

4- Execute sql, using sql interpreter %sql select geohash_encode(1.11,1.11,3) from sample_07 limit 10 It fails with sql interpreter + zeppelin: java.lang.ClassNotFoundException: com.github.gbraccialli.hive.udf.UDFGeohashEncode at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) ...

gbraccialli3 · ‎11-11-2015

Second, the same with zeppelin: 1- Restart interpreter 2- Load dependencies %dep z.reset() z.load("com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT") 3- Execute sql, using same scale code val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println); It worked with scale code + zeppelin!!!!!

gbraccialli3 · ‎11-11-2015

2- Run spark-shell with dependency spark-shell --master yarn-client --packages "com.github.gbraccialli:HiveUtils:1.0-SNAPSHOT" 3- Run spark code val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); sqlContext.sql("""create temporary function geohash_encode as 'com.github.gbraccialli.hive.udf.UDFGeohashEncode'"""); sqlContext.sql("""select geohash_encode(1.11,1.11,3) from sample_07 limit 10""").collect().foreach(println); spark-shell worked fine!

gbraccialli3 · ‎11-11-2015

@azeltov@hortonworks.com See what I tried. First, with spark-shell 1- Download jar and register to local maven su - zeppelin wget https://raw.githubusercontent.com/gbraccialli/HiveUtils/master/target/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar /tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar mvn org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file \ -Dfile=/tmp/HiveUtils-1.0-SNAPSHOT-jar-with-dependencies.jar \ -DgroupId=com.github.gbraccialli \ -DartifactId=HiveUtils \ -Dversion=1.0-SNAPSHOT \ -Dpackaging=jar

gbraccialli3 · ‎11-11-2015

@Neeraj I needed to add credential to hive-site to wasb to work inside hive. Did it work for you only with hdfs-site?

gbraccialli3 · ‎11-11-2015

see this jira: https://issues.apache.org/jira/browse/ZEPPELIN-150

Online	Offline
Last Visited	‎09-28-2021 03:33 PM

Member Since	‎09-25-2015 05:42 PM
Last Visited	‎09-28-2021 03:33 PM
Posts	230
Kudos received	236

Cloudera Community

Re: How to reset Ambari Admin password?

Re: Connection Refused trying to access port 8000 ...

Re: Flume + Knox

Re: Ambari stuck with "Install Pending" when creat...

Re: HDP 2,3.4- Running jobs is not getting display...

Re: How are number of mappers determined for a que...

Re: Need help creating a custom SerDe.

Re: What is the best, most performant, method to j...

Re: PySpark on Zeppelin in sandbox is not loading ...

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Using Hive UDF/UDAF/UDTF with SparkSQL

Re: Azure (Linux VM) & HDP Demo

Re: Using Hive UDF/UDAF/UDTF with SparkSQL