Created 06-06-2018 10:26 AM
Hi,
I would like to create an external table with an Hive regex expression by selecting lines containing "ARMING" (in uppercase).
The records HDFS look like this
2018-06-06T11:28:54+02:00 sazqye named[980]: ARMING trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283)
2018-06-06T11:20:27+02:00 sazqyd named[92960]: client (1.debian)
...
My request :
CREATE EXTERNAL TABLE my_arming_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE;
The result is KO :
> select * from my_arming_table limit 2 ;
OK
NULL NULL NULL NULL NULL 0 0 0
NULL NULL NULL NULL NULL 0 0 0
And if I try this request (with client in lowercase)
CREATE EXTERNAL TABLE my_client_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(client)\\s(.*)') STORED AS TEXTFILE;
The result is OK
> select * from my_client_table limit 2 ;
OK
2018-06-06T11:12:55+02:00 sazqyd named[92960]: client (swza6z) 0 0 0
2018-06-06T11:13:10+02:00 sazqyd named[92960]: client (osce01) 0 0 0
Does anybody knows why it doesn't work with uppercase in the regex expression ?
Thanks
Created 06-07-2018 03:46 PM
Hey @Thierry Vernhet!
Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to
- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.
I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...
@Override public Object deserialize(Writable blob) throws SerDeException { if (inputPattern == null) { throw new SerDeException( "This table does not have serde property \"input.regex\"!"); } Text rowText = (Text) blob; Matcher m = inputPattern.matcher(rowText.toString()); // If do not match, ignore the line, return a row with all nulls. if (!m.matches()) { unmatchedRows++; if (unmatchedRows >= nextUnmatchedRows) { nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows); // Report the row LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText); } return null; } // Otherwise, return the row. for (int c = 0; c < numColumns; c++) { try { row.set(c, m.group(c + 1)); } catch (RuntimeException e) { partialMatchedRows++; if (partialMatchedRows >= nextPartialMatchedRows) { nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows); // Report the row LOG.warn("" + partialMatchedRows + " partially unmatched rows are found, " + " cannot find group " + c + ": " + rowText); } row.set(c, null); } } return row; }
Hope this helps! 🙂
Created 06-06-2018 06:27 PM
Hi @Thierry Vernhet
Do you have more examples from your HDFS records? 🙂
I made a simple test here, and its seems to be working your REGEX with the Uppercase
Here's the output
CREATE EXTERNAL TABLE my_arming_table ( dc_syslog_date STRING, dc_syslog_hostname STRING, dc_syslog_process STRING, dc_logtype STRING, dc_message STRING) PARTITIONED BY (yearbrut INT, monthbrut INT, daybrut INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES( 'input.regex'='^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*)') STORED AS TEXTFILE LOCATION '/user/hive/warehouse/my_arming_table/'; ALTER TABLE my_arming_table ADD PARTITION(yearbrut=2018,monthbrut=06,daybrut=06); #Command to put some data in the table hdfs dfs -put regex_hcc /user/hive/warehouse/my_arming_table/yearbrut=2018/monthbrut=6/daybrut=6/ #And here's a quick describe from my tablehive> desc formatted my_arming_table; OK # col_name data_type comment dc_syslog_date string dc_syslog_hostname string dc_syslog_process string dc_logtype string dc_message string # Partition Information # col_name data_type comment yearbrut int monthbrut int daybrut int # Detailed Table Information Database: default Owner: hive CreateTime: Wed Jun 06 18:00:02 UTC 2018 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://vini-horton/user/hive/warehouse/my_arming_table Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE transient_lastDdlTime 1528308002 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.RegexSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: input.regex ^(\\S+)\\s(\\S+)\\s(\\S+)\\s(ARMING)\\s(.*) serialization.format 1 #Finally the select outputhive>select * from my_arming_table; OK 2018-06-06T11:28:54+02:00 sazqye named[980]: ARMING trigger on (action:LOG) (T6-Recursive-attacks recursive-time: 1283) 2018 6 6 NULL NULL NULL NULL NULL 2018 6 6 Time taken: 0.175 seconds, Fetched: 2 row(s)
Created 06-07-2018 08:07 AM
Thanks Vinicius. You're right. I didn't see the "ARMING" records with the "select...limit 2;" because the's a lot of "client" records and few "ARMING" records.
But now, how exclude other records and have only "ARMING" rows in the final table and no row "NULL" ?
Created 06-07-2018 03:46 PM
Hey @Thierry Vernhet!
Hmm, I'm not sure if you can do this only using RegexSerde. Probably you will need to
- Filter those NULL values in a query
or
- Construct an ETL by passing the NULL values and grabbing the ARMING values to other table
or
- Create a View
or
- Clean the data before putting into HDFS.
I made a research at the code and here's some parts of the code that explain it:
https://github.com/apache/hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/Re...
@Override public Object deserialize(Writable blob) throws SerDeException { if (inputPattern == null) { throw new SerDeException( "This table does not have serde property \"input.regex\"!"); } Text rowText = (Text) blob; Matcher m = inputPattern.matcher(rowText.toString()); // If do not match, ignore the line, return a row with all nulls. if (!m.matches()) { unmatchedRows++; if (unmatchedRows >= nextUnmatchedRows) { nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows); // Report the row LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText); } return null; } // Otherwise, return the row. for (int c = 0; c < numColumns; c++) { try { row.set(c, m.group(c + 1)); } catch (RuntimeException e) { partialMatchedRows++; if (partialMatchedRows >= nextPartialMatchedRows) { nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows); // Report the row LOG.warn("" + partialMatchedRows + " partially unmatched rows are found, " + " cannot find group " + c + ": " + rowText); } row.set(c, null); } } return row; }
Hope this helps! 🙂
Created 06-08-2018 05:43 AM
OK
Thanks for all