About Tim Armstrong

Tim Armstrong · ‎09-25-2017

I'd expect ORDER BY trunc(ts, "DD") to work. E.g. on my system this works: [localhost:21000] > select timestamp_col, tinyint_col from functional_hbase.alltypestiny order by trunc(timestamp_col, 'DD'), tinyint_col desc; +---------------------+-------------+ | timestamp_col | tinyint_col | +---------------------+-------------+ | 2009-01-01 00:01:00 | 1 | | 2009-01-01 00:00:00 | 0 | | 2009-02-01 00:01:00 | 1 | | 2009-02-01 00:00:00 | 0 | | 2009-03-01 00:01:00 | 1 | | 2009-03-01 00:00:00 | 0 | | 2009-04-01 00:01:00 | 1 | | 2009-04-01 00:00:00 | 0 | +---------------------+-------------+

Tim Armstrong · ‎09-25-2017

If you want to implement a C++ UDF though, I'd recommend starting with the docs here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html. There are some examples of string manipulation UDFs on that page.

Tim Armstrong · ‎09-25-2017

[localhost:21000] > select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5; Query: select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5 Query submitted at: 2017-09-25 07:38:09 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=34f1a993e3cb99a:51d89bf800000000 +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | egu*** ****** ***** *** the | | ly ***** ************* ***** **** old | | rio***** ******** ******* *** dep | | lit*** ******** **** **n de | | pe***** ****** ***** **y re | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ I'd recommend doing it with builtin functions since it will be easier to maintain. I included an example above of how you might do it using regexp_replace. I expect it will be quite fast - Impala's query compilation can inline functions like length() and substring() so those are essentially free in Impala (unlike many other SQL engines). The main cost is regexp_replace() but I'd expect that to be quite fast too.

Tim Armstrong · ‎09-25-2017

Impala doesn't have a date data type, it does have a timestamp type though: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html. to_date() converts timestamp to a string.https://www.cloudera.com/documentation/enterprise/latest/topics/impala_datetime_functions.html If you want to remove the time portion of the timestamp you can use trunc(ts, "DD") to get the timestamp of midnight on that day: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_datetime_functions.html I don't understand your question about "order by". Order by should work for all scalar data types in Impala (timestamp, double, etc). What do you want it to do and what is it doing now?

Tim Armstrong · ‎09-20-2017

If you're starting Impala from the command line like that you can configure flags and environment variables with the /etc/default/impala - https://www.cloudera.com/documentation/enterprise/latest/topics/impala_processes.html#starting_via_cmdline . The relevant variable in that file is IMPALA_SERVER_ARGS. (If anyone else reads this, if you're using Cloudera Manager you can configure the scratch directories through the UI. You probably won't have to since CM does a pretty good job of autoconfiguring scratch directories).

Tim Armstrong · ‎09-19-2017

The user that Impala is running under needs to be able to remove and recreate the scratch directory at startup (i.e. /tmp/impala-scratch). This is done to ensure that the directory is free of old files and that Impala has ownership of the directory. Based on the log message that user doesn't have the required permissions to do that. I suspect if you just delete that directory and let Impala create it at startup, that will solve your problem.

Tim Armstrong · ‎09-18-2017

We've seen this before when a bug caused a zombie impalad process to get stuck listening on port 22000. It's worth seeing if one is stilll hanging around and if so, running kill -9 on it.

Tim Armstrong · ‎09-18-2017

The Impala daemon wasn't able to set up the scratch directories during startup. The reason will be logged in one of the impalad*.WARNING logs, probably one of the first messages in there.

Tim Armstrong · ‎09-14-2017

Very astute questions! The version of StringConcatUpdate() in impala-udf-samples is correct. The use of the "local" allocation in the second version of StringConcatUpdate() is incorrect. I filed a bug to correct that: https://issues.apache.org/jira/browse/IMPALA-5939. The problem is that the StringVal() constructor and StringVal::CopyFrom() use AllocateLocal() behind the scenes. Your UDA does not own the memory returned by AllocateLocal() and it will be automatically cleaned up by Impala at some point after your Update function returns. It's a bit unfortunate that the two sets of examples have diverged. I recommend looking at https://github.com/cloudera/impala-udf-samples/ because that's intended to be the public-facing version and I think is more up to date. You may be also be interested in this PR https://github.com/cloudera/impala-udf-samples/pull/18, which improves the UDF examples to better handle failed memory allocations. With regards to 2). That is a builtin aggregate function that uses some internal functionality that we added recently. Some builtin functions only require a fixed-size intermediate value, so there's a way to declare this and have it preallocated by the Impala runtime. That functionality isn't exposed to UDAs for now.

Tim Armstrong · ‎08-24-2017

I'm not sure there are risks specifically. The best practice is to use Cloudera manager to configure memory limits for different services, so this is the right way to configure things. Cloudera Manager does have support to help set up memory limits for applications: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_autoconfig.html#concept_xjy_vb3_rn . For a production system, it's important to put thought into how much memory your system needs and how it's allocated between different services. E.g. as an earlier poster saw, 256MB is not enough memory to do much interesting with Impala.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala - Timestamp - order by works?

Re: masking UFD function for impala

Re: masking UFD function for impala

Re: Impala - Timestamp - order by works?

Re: Getting Error : Spilling has been disabled due...

Re: Getting Error : Spilling has been disabled due...

Re: Impala Thrift Server

Re: Getting Error : Spilling has been disabled due...

Re: Memory handling in Impala UDA functions

Re: "Memory Limit Exceeded" error on Impala when i...