Reply
Explorer
Posts: 12
Registered: ‎10-01-2015

UDF not working with CDH 5.9.0

We recently installed and configured CDH 5.9.0 on 4 high memory linux CentOS nodes on Google's Compute Cloud.  For the most part, CDH ( impala & hive) are working as expected, with the exception of UDFs.

 

Following the instructions here (as much as possible):

http://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html

 

Successfully created UDF (as per section:  Using Hive UDFs with Impala)

 

>> create function udfs.myDayOfMonth(string) returns string location '/tmp/hive-udf2.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFDayOfMonth';

 

However, when I attempt:

>> select udfs.myDayOfMonth("2015-03-05");

 

Built in function work fine:

>> selectDayOfMonth("2015-03-05");

return

 

 

error:

  • ImpalaRuntimeException: Unable to find evaluate function with the correct signature: org.apache.hadoop.hive.ql.udf.UDFDayOfMonth.evaluate(STRING) UDF contains: public org.apache.hadoop.io.IntWritable org.apache.hadoop.hive.ql.udf.UDFDayOfMonth.evaluate(org.apache.hadoop.hive.serde2.io.TimestampWritable) public org.apache.hadoop.io.IntWritable org.apache.hadoop.hive.ql.udf.UDFDayOfMonth.evaluate(org.apache.hadoop.hive.serde2.io.DateWritable) public org.apache.hadoop.io.IntWritable org.apache.hadoop.hive.ql.udf.UDFDayOfMonth.evaluate(org.apache.hadoop.io.Text)

 

Before this, I was attempting to create/user a custom C++ UDF that worked fine in CDH 5.4.7 created as: 

 

>> create function if not exists default.getcleanurl (string) returns string location '/user/impala/udfs/libudfibi.so' symbol='GetCleanUrl';

 

>> select default.getcleanurl(' hTtp://www.investopedia.com/a-BB-c/file.asp-more/stuff') as clearurl;

 

however, it does the following:

 

takes 30 seconds to return this error code:

  • Could not connect to hadp-inv-ibi-a.c.investopedia-1062.internal:21050 (code THRIFTTRANSPORT): TTransportException('Could not connect to hadp-inv-ibi-a.c.investopedia-1062.internal:21050',)

hadp-inv-ibi-a.c.investopedia-1062.internal is 1 of my 3 worker nodes 

 

Any suggestions?  Doesn't seem like these errors are related except that they are both UDF related.

 

 

 

 

 

Gord
Cloudera Employee
Posts: 246
Registered: ‎07-29-2015

Re: UDF not working with CDH 5.9.0

Hi Gord,

  The problem with the Java UDF is probably the return type - it's declared as "returns string" but the Java functions all return an IntWritable.

 

The C++ UDF problem looks like it may be crashing Impala somehow. Are you able to share the definition of the UDF?

 

Thanks,

Tim

Explorer
Posts: 12
Registered: ‎10-01-2015

Re: UDF not working with CDH 5.9.0

[ Edited ]

Thanks Tim - I overlooked that return type when trying a different function other than the example.  However, when I tried to execute the following impala query:

 

create function udfs.myDayOfMonth(string) returns IntWritable location '/tmp/hive-udf2.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFDayOfMonth';

 

It would not run and returned athe following error:

 

AnalysisException: Syntax error in line 1: ...ayOfMonth(string) returns IntWritable location '/tmp/h... ^ Encountered: IDENTIFIER Expected: ARRAY, BIGINT, BINARY, BOOLEAN, CHAR, DATE, DATETIME, DECIMAL, REAL, FLOAT, INTEGER, MAP, SMALLINT, STRING, STRUCT, TIMESTAMP, TINYINT, VARCHAR CAUSED BY: Exception: Syntax error

 

The example on the page I was following makes no mention of using only IntWritable return types.  But the page does say:

 

  • Prior to CDH 5.7 / Impala 2.5, the return type must be a "Writable" type such as Text or IntWritable, rather than a Java primitive type such as String or int. Otherwise, the UDF returns NULL. In CDH 5.7 / Impala 2.5 and higher, this restriction is lifted, and both UDF arguments and return values can be Java primitive types.

 

Here is our custom function code:

 

// Copyright 2012 Cloudera Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include "udf-ibi.h"

#include <cctype>
#include <cmath>
#include <string>
#include <algorithm>
#include <vector>
#include <sstream>

// trim from start
static inline std::string &ltrim(std::string &s) {
	s.erase(s.begin(), std::find_if(s.begin(), s.end(), std::not1(std::ptr_fun<int, int>(std::isspace))));
	return s;
}

// trim from end
static inline std::string &rtrim(std::string &s) {
	s.erase(std::find_if(s.rbegin(), s.rend(), std::not1(std::ptr_fun<int, int>(std::isspace))).base(), s.end());
	return s;
}

// trim from both ends
static inline std::string &trim(std::string &s) {
	return ltrim(rtrim(s));
}

// replace strings
void replace_all(std::string& str, const std::string& from, const std::string& to) {
	if (from.empty())
		return;
	size_t start_pos = 0;
	while ((start_pos = str.find(from, start_pos)) != std::string::npos) {
		str.replace(start_pos, from.length(), to);
		start_pos += to.length();
	}
}

// split string by character into vector
std::vector<std::string> split_string(std::string str, char delimiter) {
	std::vector<std::string> internal;
	std::stringstream ss(str);
	std::string tok;
	while (std::getline(ss, tok, delimiter)) {
		internal.push_back(tok);
	}
	return internal;
}

// clean a URL to allow consistent joining across data sources
StringVal GetCleanUrl(FunctionContext* context, const StringVal& arg1) {
	if (arg1.is_null) return StringVal::null();

	std::string raw_url((const char *)arg1.ptr, arg1.len);
	std::string clean_url("");

	// trim whitespace
	clean_url = trim(raw_url);
	
	// lower case
	std::transform(clean_url.begin(), clean_url.end(), clean_url.begin(), ::tolower);
	
	// remove the domain (only keep relative paths)
	int find_domain = clean_url.find("://");
	if (find_domain != std::string::npos) {
		int next_slash = clean_url.find("/", find_domain + 3);
		if (next_slash != std::string::npos) {
			clean_url = clean_url.substr(next_slash, clean_url.length() - next_slash);
		}
	}
	
	// remove text after: ?&#
	clean_url = clean_url.substr(0, clean_url.find("?", 0));
	clean_url = clean_url.substr(0, clean_url.find("&", 0));
	clean_url = clean_url.substr(0, clean_url.find("#", 0));
	
	// ensure starting slash
	clean_url = "/" + clean_url;
	
	// ensure trailing slash if folder path
	std::vector<std::string> url_parts = split_string(clean_url, '/');
	std::string last_url_part = url_parts.back();
	if (last_url_part.find(".") == std::string::npos) {
		clean_url = clean_url + "/";
	}

	// remove all duplicate slashes
	while (clean_url.find("//") != std::string::npos) {
		replace_all(clean_url, "//", "/");
	}

	// The modified string is stored in 'clean_url', which is destroyed when this function
	// ends. We need to make a string val and copy the contents.
	// NB: Only the version of the actor that takes a context object allocates new memory.
	StringVal result(context, clean_url.size());
	memcpy(result.ptr, clean_url.c_str(), clean_url.size());
	return result;
}

// Return part of the URL (scheme, domain, path, query, fragment)
StringVal GetUrlPart(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) {
	if (arg1.is_null) return StringVal::null();
	if (arg2.is_null) return StringVal::null();
	
	// parse input params
	std::string raw_url((const char *)arg1.ptr, arg1.len);
	std::string part_name((const char *)arg2.ptr, arg2.len);
	
	// declare variables for parts
	std::string url_scheme("");
	std::string url_path("");
	std::string url_domain("");
	std::string url_query("");
	std::string url_fragment("");
	
	// get the querystring
	int pos_querystring = raw_url.find("?");
	if (pos_querystring != std::string::npos) {
		url_query = raw_url.substr(pos_querystring + 1, raw_url.length() - pos_querystring - 1);
		raw_url = raw_url.substr(0, pos_querystring);
	}
	
	// get the fragment (from querystring is not empty otherwise from the url)
	if (!url_query.empty()) {
		int pos_fragment = url_query.find("#");
		if (pos_fragment != std::string::npos) {
			url_fragment = url_query.substr(pos_fragment + 1, url_query.length() - pos_fragment - 1);
			url_query = url_query.substr(0, pos_fragment);
		}
	}
	else {
		int pos_fragment = raw_url.find("#");
		if (pos_fragment != std::string::npos) {
			url_fragment = raw_url.substr(pos_fragment + 1, raw_url.length() - pos_fragment - 1);
			raw_url = raw_url.substr(0, pos_fragment);
		}
	}
	
	// get the scheme
	int pos_scheme = raw_url.find("://");
	if (pos_scheme != std::string::npos) {
		url_scheme = raw_url.substr(0, pos_scheme);
		raw_url = raw_url.substr(pos_scheme + 3, raw_url.length() - pos_scheme - 3);
		
		// get the domain
		int pos_slash = raw_url.find("/");
		if (pos_slash != std::string::npos) {
			url_domain = raw_url.substr(0, pos_slash);
			raw_url = raw_url.substr(pos_slash + 1, raw_url.length() - pos_scheme - 1);
		}
	}
	
	// get the path
	url_path = raw_url;
	
	// return part name
	part_name = trim(part_name);
	std::transform(part_name.begin(), part_name.end(), part_name.begin(), ::tolower);
	if (part_name.compare("scheme") == 0) {
		StringVal result(context, url_scheme.size());
		memcpy(result.ptr, url_scheme.c_str(), url_scheme.size());
		return result;
	}
	else if (part_name.compare("domain") == 0) {
		StringVal result(context, url_domain.size());
		memcpy(result.ptr, url_domain.c_str(), url_domain.size());
		return result;
	}
	else if (part_name.compare("path") == 0) {
		StringVal result(context, url_path.size());
		memcpy(result.ptr, url_path.c_str(), url_path.size());
		return result;
	}
	else if (part_name.compare("query") == 0) {
		StringVal result(context, url_query.size());
		memcpy(result.ptr, url_query.c_str(), url_query.size());
		return result;
	}
	else if (part_name.compare("fragment") == 0) {
		StringVal result(context, url_fragment.size());
		memcpy(result.ptr, url_fragment.c_str(), url_fragment.size());
		return result;
	}
	else return StringVal::null();
}

   

Gord
New Contributor
Posts: 3
Registered: ‎01-25-2017

Re: UDF not working with CDH 5.9.0

[ Edited ]

Hi Tim.  I work with Gord and have offered to take over this issue as he's busy with other things at the moment.

 

In summary, we are trying to get past one linker error trying to compile a C++ UDF function.

 

First, I just wanted to clarify a few things:

  • Our original UDF function when we were on CDH 5.4.7 was in C++ - it is the code Gord attached in the previous post and is what we really need to get working.  The java attempt was just to gain some information while troubleshooting but we really would like to get the C++ one working as our system has been tested and working against that.  From this point on, we'll just be discussing the C++ one.
  • We are on CentOS v6.8
  • Gord mentioned this above but we have recently upgraded from CDH 5.4.7 to 5.9.0.
  • In our original attempt to get the C++ one working, we had some compile errors and worked around them by messing with the /usr/include/impala_udf/udf.h header file - but we've since stepped back and tried to figure out why we're getting compile errors in the first place.
  • It seems that one of the issues is that the latest version of  the UDF development package found here: http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo has been updated to use C++ v11 features such as "noexcept"; for example see this commit from May 28, 2016: https://github.com/apache/incubator-impala/commit/5f3996e6d1c53a5255b84f9e32da9324f3f972b3
    • The problem is the version of cpp that comes with CentOS v6.8 is cpp v.4.4.x, which is did not yet contain the "noexcept" feature, according to http://en.cppreference.com/w/cpp/compiler_support.  Given this, it seemed the best way to move forward was to upgrade the version of gcc.
  • After upgrading to gcc v4.8.2, I am left with one linker error:
	opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/ld: /usr/lib/../lib64/libImpalaUdf.a(udf.cc.o)(.text+0x3): unresolvable R_X86_64_NONE relocation against symbol `_ZNSs4_Rep20_S_empty_rep_storageE@@GLIBCXX_3.4'
	/opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/ld: final link failed: Nonrepresentable section on output
  • I was able to reproduce this problem against the UDF sample project so I'll be posting the steps for that below; hopefully anyone would then be able to reproduce the problem and suggest a solution.

Here are the steps:

  1. Create a fresh CentOS 6.8 VM.  We have tried two ways; both produce the same result:
    • Use vagrant to created one hosted in VirtualBox with
      vagrant init kaorimatz/centos-6.8-x86_64; vagrant up --provider virtualbox
    • Google cloud compute image named "CentOS 6".
  2. Install gcc v4.8.2:
    sudo yum install wget
    sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo
    sudo yum upgrade
    sudo yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++
  3. Install cmake and boost:
    sudo yum install cmake boost-devel
  4. Install UDF Development Package:
    sudo wget http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo -O /etc/yum.repos.d/cloudera-cdh5.repo
    sudo yum install impala-udf-devel
  5. Download UDF sample code from https://github.com/cloudera/impala-udf-samples/archive/master.zip and extract to some folder.
  6. Inside the sample code directory, edit CMakeLists.txt and change this line to enable C++11 features:
    #old:
    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g")
    
    #new:
    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -ggdb -std=c++11")
  7. Run this command to enable gcc v4.8.2 for this bash session:
    scl enable devtoolset-2 bash
  8. run cmake and make:
    cmake .
    make
  9. At this point, I have one linker error:
    opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/ld: /usr/lib/../lib64/libImpalaUdf.a(udf.cc.o)(.text+0x3): unresolvable R_X86_64_NONE relocation against symbol `_ZNSs4_Rep20_S_empty_rep_storageE@@GLIBCXX_3.4'
    /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-redhat-linux/4.8.2/ld: final link failed: Nonrepresentable section on output

 

Again, this is on a fresh CentOS6.8 VM with the Impala UDF sample code, so hopefully it's easy for anyone to reproduce this issue.  Thanks!

 

Update: I have just tried this on a CentOS v7.3.1611 VM using:

vagrant init centos/7; vagrant up --provider virtualbox

and installing the standard C++ compiler via:

sudo yum install gcc-c++ cmake boost-devel

I get the same linker error there.

Cloudera Employee
Posts: 246
Registered: ‎07-29-2015

Re: UDF not working with CDH 5.9.0

Sorry for the slow reply. It looks like we made a mistake in including the "noexcept" specifiers in the UDF SDK when we switched to building Impala with C++11 support. The UDF SDK should be still built with C++11 disabled.

 

If you use an older udf.h or manually remove the noexcept specifiers from udf.h I think that may solve your problem.

New Contributor
Posts: 3
Registered: ‎01-25-2017

Re: UDF not working with CDH 5.9.0

[ Edited ]

Hi Tim. I tried going back to g++ v4.4.7 and commenting out all the 'noexcept' keywords in /usr/include/impala_udf/udf.h.

When running make (against the UDF sample code), I get a bunch more errors.
It's a bit large so I put them in a pastebin: http://pastebin.com/qv9EbS5h


Any other ideas?

Cloudera Employee
Posts: 3
Registered: ‎09-06-2016

Re: UDF not working with CDH 5.9.0

We are still investigating the linking error. In the meantime, would you mind giving an older version of the UDF SDK a try ? It should be mostly compatible with 5.9.0.

 

Sorry for the trouble.

Announcements