Member since
03-22-2019
24
Posts
4
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
910 | 02-24-2020 10:37 AM | |
1556 | 03-25-2019 09:20 AM | |
2675 | 03-24-2019 09:57 AM |
02-24-2020
10:37 AM
Something like this should work. It should just be a matter of using the correct string manipulation functions: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_string_functions.html create table test1 (col1 string); insert into table test1 values ("IT Strategy& Architecture BDC India [MITX 999]"), ("Corporate & IC Solution Delivery [SVII]"), ("Operations Solution Delivery [SVIA]"), ("Mainframe Service [MLEM]"), ("Strategy & Architecture [MLEL]"); select * from test1; +------------------------------------------------+ | col1 | +------------------------------------------------+ | IT Strategy& Architecture BDC India [MITX 999] | | Corporate & IC Solution Delivery [SVII] | | Operations Solution Delivery [SVIA] | | Mainframe Service [MLEM] | | Strategy & Architecture [MLEL] | +------------------------------------------------+ create table test2 as select trim(split_part(col1, ' [', 1)), trim(concat(' [', split_part(col1, ' [', 2))) fr om test1; select * from test2; +-------------------------------------+------------+ | _c0 | _c1 | +-------------------------------------+------------+ | IT Strategy& Architecture BDC India | [MITX 999] | | Corporate & IC Solution Delivery | [SVII] | | Operations Solution Delivery | [SVIA] | | Mainframe Service | [MLEM] | | Strategy & Architecture | [MLEL] | +-------------------------------------+------------+
... View more
01-28-2020
08:15 AM
1 Kudo
https://impala.apache.org/docs/build/html/topics/impala_misc_functions.html#misc_functions__get_json_object has some good documentation on how to use the get_json_object function. What Impala version are you using? What is the type of the column that contains the JSON data? I hit a couple issues when parsing the JSON you posted. I believe the JSON standard does not allow for single quotes. Standard online JSON parsers have trouble with the 'u' character as well. I was able to get the following to work on Impala master: [localhost:21000] default> select get_json_object("{\"CH4: NO2 (we-aux)\": {\"unit\": \"mV\", \"value\": 4.852294921875}, \"CH6: Ox concentration (we-aux)\": {\"unit\": \"ppb\", \"value\": -84.73094995471016}}", '$.*'); Query: select get_json_object("{\"CH4: NO2 (we-aux)\": {\"unit\": \"mV\", \"value\": 4.852294921875}, \"CH6: Ox concentration (we-aux)\": {\"unit\": \"ppb\", \"value\": -84.73094995471016}}", '$.*') +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | get_json_object('{"ch4: no2 (we-aux)": {"unit": "mv", "value": 4.852294921875}, "ch6: ox concentration (we-aux)": {"unit": "ppb", "value": -84.73094995471016}}', '$.*') | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | [{"unit":"mV","value":4.852294921875},{"unit":"ppb","value":-84.73094995471016}] | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
... View more
01-28-2020
07:59 AM
Offset means the offset into the actual csv file. So in this case, that means the 2432696320th byte of the file foo_042019.csv. There are multiple tools that should allow you to open the file and seek to the desired offset. For example, you could open the file in vim and run :goto 2432696320 which should seek the cursor to the 2432696320th byte of the file, and thus the offending row.
... View more
01-17-2020
03:48 PM
The error "Error converting column: 35 to TIMESTAMP" means there was an error when converting column 35 to the TIMESTAMP type. The error "Error parsing row: file: hdfs://blabla/foo_042019.csv, before offset: 2432696320" means there was an error while parsing the row at file offset 2432696320, in the file foo_042019.csv. So it looks like there are several rows in your dataset where certain fields cannot be converted to TIMESTAMPs. You should be able to open up the file, and seek to the specified offset to find the rows that are corrupted. I believe, Hive does not throw an exception when given the same dataset, instead it converts the corrupted rows to NULL. The same behavior can be emulated in Impala by setting 'abort_on_error=false'. However, be warned that setting this option can mask data corruption issues. See https://impala.apache.org/docs/build/html/topics/impala_abort_on_error.html for details.
... View more
09-24-2019
03:05 PM
It depends on how you are updating your partitions. Are you creating completely new partitions, or adding files to existing partitions? ALTER TABLE RECOVER PARTITIONS is specifically used for adding newly created partition directories to a partitioned table - https://impala.apache.org/docs/build/html/topics/impala_alter_table.html
... View more
09-24-2019
12:10 PM
Queries are in the "waiting to be closed" stage if they are in the EXCEPTION state or if all the rows from the query have been read. In either case, the query needs to be explicitly closed for it to be "completed". https://community.cloudera.com/t5/Support-Questions/Query-Cancel-and-idle-query-timeout-is-not-working/td-p/58104 might be useful as well.
... View more
09-23-2019
07:52 AM
1 Kudo
The "TransmitData() to X.X.X.X:27000 failed" portion of the error message is thrown by the Impala RPC code. the "Connection timed out (error 110)" is a TCP error. TCP error code 110 corresponds to the error "Connection timed out". So, as the error message states, there was a TCP connection timeout between two Impala processes. It's hard to debug without more information. What query was being run? Can you post the full log files? What were the two processes that were trying to communicate with each other? In all likelihood, this looks like a network issue, does this happen consistently?
... View more
03-25-2019
09:20 AM
1 Kudo
What version of Impala are you using? I suspect the meaning of "duration" might have changed in IMPALA-1575 / IMPALA-5397. In general, its possible that the your definition of duration is different from Impala's. Depending on the version, Impala might include the time taken until the query has been actually closed (which would include fetching rows and releasing all resources). I *think* the waiting time is the difference between the current time and the time the query was last actively being processed. So this value could be high if the query has been completed, but the client has not closed the query (which is why "waiting time" shows up the section "waiting to be closed").
... View more
03-24-2019
09:57 AM
1 Kudo
Impala scanners internally have a RowBatch queue that allows Impala to decouple I/O from CPU processing. The I/O threads read data into RowBatches and put them into a queue, CPU threads asynchronously fetch data from the queue and process them. RowBatchQueueGetWaitTime is the amount of time CPU threads wait on data to arrive into the queue. Essentially, it means the CPU threads were waiting a long time for the I/O threads to read the data.
... View more
03-24-2019
09:16 AM
Is the stack trace the same every time this crashes? If so, there is likely a bug somewhere in your UDAF, it's hard for me to say exactly where the bug is, but I would recommend trying to reproduce the issue outside Impala (perhaps run the UDAF in a test harness). If you can reproduce the crash outside Impala, standard C++ debugging tools should help you pin down the issue.
... View more