About SahilTakiar

SahilTakiar · ‎02-24-2020

Something like this should work. It should just be a matter of using the correct string manipulation functions: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_string_functions.html create table test1 (col1 string); insert into table test1 values ("IT Strategy& Architecture BDC India [MITX 999]"), ("Corporate & IC Solution Delivery [SVII]"), ("Operations Solution Delivery [SVIA]"), ("Mainframe Service [MLEM]"), ("Strategy & Architecture [MLEL]"); select * from test1; +------------------------------------------------+ | col1 | +------------------------------------------------+ | IT Strategy& Architecture BDC India [MITX 999] | | Corporate & IC Solution Delivery [SVII] | | Operations Solution Delivery [SVIA] | | Mainframe Service [MLEM] | | Strategy & Architecture [MLEL] | +------------------------------------------------+ create table test2 as select trim(split_part(col1, ' [', 1)), trim(concat(' [', split_part(col1, ' [', 2))) fr om test1; select * from test2; +-------------------------------------+------------+ | _c0 | _c1 | +-------------------------------------+------------+ | IT Strategy& Architecture BDC India | [MITX 999] | | Corporate & IC Solution Delivery | [SVII] | | Operations Solution Delivery | [SVIA] | | Mainframe Service | [MLEM] | | Strategy & Architecture | [MLEL] | +-------------------------------------+------------+

SahilTakiar · ‎01-28-2020

https://impala.apache.org/docs/build/html/topics/impala_misc_functions.html#misc_functions__get_json_object has some good documentation on how to use the get_json_object function. What Impala version are you using? What is the type of the column that contains the JSON data? I hit a couple issues when parsing the JSON you posted. I believe the JSON standard does not allow for single quotes. Standard online JSON parsers have trouble with the 'u' character as well. I was able to get the following to work on Impala master: [localhost:21000] default> select get_json_object("{\"CH4: NO2 (we-aux)\": {\"unit\": \"mV\", \"value\": 4.852294921875}, \"CH6: Ox concentration (we-aux)\": {\"unit\": \"ppb\", \"value\": -84.73094995471016}}", '$.*'); Query: select get_json_object("{\"CH4: NO2 (we-aux)\": {\"unit\": \"mV\", \"value\": 4.852294921875}, \"CH6: Ox concentration (we-aux)\": {\"unit\": \"ppb\", \"value\": -84.73094995471016}}", '$.*') +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | get_json_object('{"ch4: no2 (we-aux)": {"unit": "mv", "value": 4.852294921875}, "ch6: ox concentration (we-aux)": {"unit": "ppb", "value": -84.73094995471016}}', '$.*') | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | [{"unit":"mV","value":4.852294921875},{"unit":"ppb","value":-84.73094995471016}] | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

SahilTakiar · ‎01-28-2020

Offset means the offset into the actual csv file. So in this case, that means the 2432696320th byte of the file foo_042019.csv. There are multiple tools that should allow you to open the file and seek to the desired offset. For example, you could open the file in vim and run :goto 2432696320 which should seek the cursor to the 2432696320th byte of the file, and thus the offending row.

SahilTakiar · ‎01-17-2020

The error "Error converting column: 35 to TIMESTAMP" means there was an error when converting column 35 to the TIMESTAMP type. The error "Error parsing row: file: hdfs://blabla/foo_042019.csv, before offset: 2432696320" means there was an error while parsing the row at file offset 2432696320, in the file foo_042019.csv. So it looks like there are several rows in your dataset where certain fields cannot be converted to TIMESTAMPs. You should be able to open up the file, and seek to the specified offset to find the rows that are corrupted. I believe, Hive does not throw an exception when given the same dataset, instead it converts the corrupted rows to NULL. The same behavior can be emulated in Impala by setting 'abort_on_error=false'. However, be warned that setting this option can mask data corruption issues. See https://impala.apache.org/docs/build/html/topics/impala_abort_on_error.html for details.

SahilTakiar · ‎09-24-2019

It depends on how you are updating your partitions. Are you creating completely new partitions, or adding files to existing partitions? ALTER TABLE RECOVER PARTITIONS is specifically used for adding newly created partition directories to a partitioned table - https://impala.apache.org/docs/build/html/topics/impala_alter_table.html

SahilTakiar · ‎09-24-2019

Queries are in the "waiting to be closed" stage if they are in the EXCEPTION state or if all the rows from the query have been read. In either case, the query needs to be explicitly closed for it to be "completed". https://community.cloudera.com/t5/Support-Questions/Query-Cancel-and-idle-query-timeout-is-not-working/td-p/58104 might be useful as well.

SahilTakiar · ‎09-23-2019

The "TransmitData() to X.X.X.X:27000 failed" portion of the error message is thrown by the Impala RPC code. the "Connection timed out (error 110)" is a TCP error. TCP error code 110 corresponds to the error "Connection timed out". So, as the error message states, there was a TCP connection timeout between two Impala processes. It's hard to debug without more information. What query was being run? Can you post the full log files? What were the two processes that were trying to communicate with each other? In all likelihood, this looks like a network issue, does this happen consistently?

SahilTakiar · ‎03-25-2019

What version of Impala are you using? I suspect the meaning of "duration" might have changed in IMPALA-1575 / IMPALA-5397. In general, its possible that the your definition of duration is different from Impala's. Depending on the version, Impala might include the time taken until the query has been actually closed (which would include fetching rows and releasing all resources). I *think* the waiting time is the difference between the current time and the time the query was last actively being processed. So this value could be high if the query has been completed, but the client has not closed the query (which is why "waiting time" shows up the section "waiting to be closed").

SahilTakiar · ‎03-24-2019

Impala scanners internally have a RowBatch queue that allows Impala to decouple I/O from CPU processing. The I/O threads read data into RowBatches and put them into a queue, CPU threads asynchronously fetch data from the queue and process them. RowBatchQueueGetWaitTime is the amount of time CPU threads wait on data to arrive into the queue. Essentially, it means the CPU threads were waiting a long time for the I/O threads to read the data.

SahilTakiar · ‎03-24-2019

Is the stack trace the same every time this crashes? If so, there is likely a bug somewhere in your UDAF, it's hard for me to say exactly where the bug is, but I would recommend trying to reproduce the issue outside Impala (perhaps run the UDAF in a test harness). If you can reproduce the crash outside Impala, standard C++ debugging tools should help you pin down the issue.

Online	Offline
Last Visited	‎04-06-2020 12:43 PM

Member Since	‎03-22-2019 08:23 AM
Last Visited	‎04-06-2020 12:43 PM
Posts	24
Kudos received	4

Cloudera Community

Re: technical problem

Re: Information about query runtime metrics in Imp...

Re: Meaning of RowBatchQueueGetWaitTime metric

Re: technical problem

Re: Nested JSON to columns using Impala SQL functi...

Re: "Error parsing row: file" Table consists of mu...

Re: "Error parsing row: file" Table consists of mu...

Re: ALTER TABLE RECOVER PARTITIONS not working

Re: Impala queries - waiting to be closed

Re: TransmitData() to X.X.X.X:27000 failed: Networ...

Re: Information about query runtime metrics in Imp...

Re: Meaning of RowBatchQueueGetWaitTime metric

Re: UDA function causes Impala to crash