Support Questions

thierry_vernhet · ‎06-06-2018

Hi,

I would like to create an external table with an Hive regex expression by selecting lines containing "ARMING" (in uppercase).

The records HDFS look like this

2018-06-06T11:28:54+02:00 sazqye named[980]: ARMING trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283)

2018-06-06T11:20:27+02:00 sazqyd named[92960]: client (1.debian)

...

My request :

CREATE EXTERNAL TABLE my_arming_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE;

The result is KO :

> select * from my_arming_table limit 2 ;

OK

NULL NULL NULL NULL NULL 0 0 0

And if I try this request (with client in lowercase)

CREATE EXTERNAL TABLE my_client_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(client)\\s(.*)') STORED AS TEXTFILE;

The result is OK

> select * from my_client_table limit 2 ;

OK

2018-06-06T11:12:55+02:00 sazqyd named[92960]: client (swza6z) 0 0 0

2018-06-06T11:13:10+02:00 sazqyd named[92960]: client (osce01) 0 0 0

Does anybody knows why it doesn't work with uppercase in the regex expression ?

Thanks

vmurakami · ‎06-07-2018

Hey @Thierry Vernhet!

Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to

- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.

I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...

 @Override  public Object deserialize(Writable blob) throws SerDeException {
    if (inputPattern == null) {      throw new SerDeException(          "This table does not have serde property \"input.regex\"!");    }    Text rowText = (Text) blob;
    Matcher m = inputPattern.matcher(rowText.toString());
    // If do not match, ignore the line, return a row with all nulls.    if (!m.matches()) {      unmatchedRows++;      if (unmatchedRows >= nextUnmatchedRows) {        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);        // Report the row        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);      }      return null;    }
    // Otherwise, return the row.    for (int c = 0; c < numColumns; c++) {      try {        row.set(c, m.group(c + 1));      } catch (RuntimeException e) {        partialMatchedRows++;        if (partialMatchedRows >= nextPartialMatchedRows) {          nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);          // Report the row          LOG.warn("" + partialMatchedRows              + " partially unmatched rows are found, " + " cannot find group "              + c + ": " + rowText);        }        row.set(c, null);      }    }    return row;  }

Hope this helps! 🙂

View solution in original post

vmurakami · ‎06-06-2018

Hi @Thierry Vernhet
Do you have more examples from your HDFS records? 🙂

I made a simple test here, and its seems to be working your REGEX with the Uppercase
Here's the output

CREATE EXTERNAL TABLE my_arming_table 
( dc_syslog_date STRING, 
dc_syslog_hostname STRING, 
dc_syslog_process STRING, 
dc_logtype STRING, 
dc_message STRING) 
PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/my_arming_table/';
ALTER TABLE my_arming_table ADD PARTITION(yearbrut=2018,monthbrut=06,daybrut=06);
#Command to put some data in the table hdfs dfs -put regex_hcc  /user/hive/warehouse/my_arming_table/yearbrut=2018/monthbrut=6/daybrut=6/
#And here's a quick describe from my tablehive> desc formatted my_arming_table;
OK
# col_name            	data_type           	comment             
	 	 
dc_syslog_date      	string              	                    
dc_syslog_hostname  	string              	                    
dc_syslog_process   	string              	                    
dc_logtype          	string              	                    
dc_message          	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
yearbrut            	int                 	                    
monthbrut           	int                 	                    
daybrut             	int                 	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	hive                	 
CreateTime:         	Wed Jun 06 18:00:02 UTC 2018	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://vini-horton/user/hive/warehouse/my_arming_table	 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	EXTERNAL            	TRUE                
	transient_lastDdlTime	1528308002          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.RegexSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	input.regex         	^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)
	serialization.format	1   
 
#Finally the select outputhive>select * from my_arming_table;
OK
2018-06-06T11:28:54+02:00	sazqye	named[980]:	ARMING	trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283)	2018	6	6
NULL	NULL	NULL	NULL	NULL	2018	6	6
Time taken: 0.175 seconds, Fetched: 2 row(s)

thierry_vernhet · ‎06-07-2018

Hi @Vinicius Higa Murakami

Thanks Vinicius. You're right. I didn't see the "ARMING" records with the "select...limit 2;" because the's a lot of "client" records and few "ARMING" records.

But now, how exclude other records and have only "ARMING" rows in the final table and no row "NULL" ?

vmurakami · ‎06-07-2018

Hey @Thierry Vernhet!

Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to

- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.

I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...

 @Override  public Object deserialize(Writable blob) throws SerDeException {
    if (inputPattern == null) {      throw new SerDeException(          "This table does not have serde property \"input.regex\"!");    }    Text rowText = (Text) blob;
    Matcher m = inputPattern.matcher(rowText.toString());
    // If do not match, ignore the line, return a row with all nulls.    if (!m.matches()) {      unmatchedRows++;      if (unmatchedRows >= nextUnmatchedRows) {        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);        // Report the row        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);      }      return null;    }
    // Otherwise, return the row.    for (int c = 0; c < numColumns; c++) {      try {        row.set(c, m.group(c + 1));      } catch (RuntimeException e) {        partialMatchedRows++;        if (partialMatchedRows >= nextPartialMatchedRows) {          nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);          // Report the row          LOG.warn("" + partialMatchedRows              + " partially unmatched rows are found, " + " cannot find group "              + c + ": " + rowText);        }        row.set(c, null);      }    }    return row;  }

Hope this helps! 🙂

thierry_vernhet · ‎06-08-2018

OK

Thanks for all

Cloudera Community

Support Questions

HIVE Regex with Uppercase