Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

HIVE Regex with Uppercase

Hi,

I would like to create an external table with an Hive regex expression by selecting lines containing "ARMING" (in uppercase).

The records HDFS look like this

2018-06-06T11:28:54+02:00 sazqye named[980]: ARMING trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283)

2018-06-06T11:20:27+02:00 sazqyd named[92960]: client (1.debian)

...

My request :

CREATE EXTERNAL TABLE my_arming_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE;

The result is KO :

> select * from my_arming_table limit 2 ;

OK

NULL NULL NULL NULL NULL 0 0 0

NULL NULL NULL NULL NULL 0 0 0

And if I try this request (with client in lowercase)

CREATE EXTERNAL TABLE my_client_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(client)\\s(.*)') STORED AS TEXTFILE;

The result is OK

> select * from my_client_table limit 2 ;

OK

2018-06-06T11:12:55+02:00 sazqyd named[92960]: client (swza6z) 0 0 0

2018-06-06T11:13:10+02:00 sazqyd named[92960]: client (osce01) 0 0 0

Does anybody knows why it doesn't work with uppercase in the regex expression ?

Thanks

1 ACCEPTED SOLUTION

Hey @Thierry Vernhet!

Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to

- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.


I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...

 @Override  public Object deserialize(Writable blob) throws SerDeException {
    if (inputPattern == null) {      throw new SerDeException(          "This table does not have serde property \"input.regex\"!");    }    Text rowText = (Text) blob;
    Matcher m = inputPattern.matcher(rowText.toString());
    // If do not match, ignore the line, return a row with all nulls.    if (!m.matches()) {      unmatchedRows++;      if (unmatchedRows >= nextUnmatchedRows) {        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);        // Report the row        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);      }      return null;    }
    // Otherwise, return the row.    for (int c = 0; c < numColumns; c++) {      try {        row.set(c, m.group(c + 1));      } catch (RuntimeException e) {        partialMatchedRows++;        if (partialMatchedRows >= nextPartialMatchedRows) {          nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);          // Report the row          LOG.warn("" + partialMatchedRows              + " partially unmatched rows are found, " + " cannot find group "              + c + ": " + rowText);        }        row.set(c, null);      }    }    return row;  }

Hope this helps! 🙂

View solution in original post

4 REPLIES 4

Hi @Thierry Vernhet
Do you have more examples from your HDFS records? 🙂

I made a simple test here, and its seems to be working your REGEX with the Uppercase
Here's the output

CREATE EXTERNAL TABLE my_arming_table 
( dc_syslog_date STRING, 
dc_syslog_hostname STRING, 
dc_syslog_process STRING, 
dc_logtype STRING, 
dc_message STRING) 
PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/my_arming_table/';
ALTER TABLE my_arming_table ADD PARTITION(yearbrut=2018,monthbrut=06,daybrut=06);
#Command to put some data in the table hdfs dfs -put regex_hcc  /user/hive/warehouse/my_arming_table/yearbrut=2018/monthbrut=6/daybrut=6/
#And here's a quick describe from my tablehive> desc formatted my_arming_table;
OK
# col_name            	data_type           	comment             
	 	 
dc_syslog_date      	string              	                    
dc_syslog_hostname  	string              	                    
dc_syslog_process   	string              	                    
dc_logtype          	string              	                    
dc_message          	string              	                    
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
yearbrut            	int                 	                    
monthbrut           	int                 	                    
daybrut             	int                 	                    
	 	 
# Detailed Table Information	 	 
Database:           	default             	 
Owner:              	hive                	 
CreateTime:         	Wed Jun 06 18:00:02 UTC 2018	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://vini-horton/user/hive/warehouse/my_arming_table	 
Table Type:         	EXTERNAL_TABLE      	 
Table Parameters:	 	 
	EXTERNAL            	TRUE                
	transient_lastDdlTime	1528308002          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.RegexSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	input.regex         	^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)
	serialization.format	1   
 
#Finally the select outputhive>select * from my_arming_table;
OK
2018-06-06T11:28:54+02:00	sazqye	named[980]:	ARMING	trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283)	2018	6	6
NULL	NULL	NULL	NULL	NULL	2018	6	6
Time taken: 0.175 seconds, Fetched: 2 row(s)	

Hi @Vinicius Higa Murakami

Thanks Vinicius. You're right. I didn't see the "ARMING" records with the "select...limit 2;" because the's a lot of "client" records and few "ARMING" records.

But now, how exclude other records and have only "ARMING" rows in the final table and no row "NULL" ?

Hey @Thierry Vernhet!

Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to

- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.


I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...

 @Override  public Object deserialize(Writable blob) throws SerDeException {
    if (inputPattern == null) {      throw new SerDeException(          "This table does not have serde property \"input.regex\"!");    }    Text rowText = (Text) blob;
    Matcher m = inputPattern.matcher(rowText.toString());
    // If do not match, ignore the line, return a row with all nulls.    if (!m.matches()) {      unmatchedRows++;      if (unmatchedRows >= nextUnmatchedRows) {        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);        // Report the row        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);      }      return null;    }
    // Otherwise, return the row.    for (int c = 0; c < numColumns; c++) {      try {        row.set(c, m.group(c + 1));      } catch (RuntimeException e) {        partialMatchedRows++;        if (partialMatchedRows >= nextPartialMatchedRows) {          nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);          // Report the row          LOG.warn("" + partialMatchedRows              + " partially unmatched rows are found, " + " cannot find group "              + c + ": " + rowText);        }        row.set(c, null);      }    }    return row;  }

Hope this helps! 🙂

OK

Thanks for all

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.