I'm still a noob in this world, so bare with me if there are any misconceptions.
What have I done so far is, I've setup Hadoop and Hive in my linux enviroment and I'm trying to extract data from an HDFS file into my Hive table.
This is how the log line looks like:
apache2013-10-09T14:04:32Zphp18.104.22.168/EKEE.php20019705910Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24
This my Hive query:
CREATE EXTERNAL TABLE LogParserSample( logtype STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '^([\w\-]+)' ) STORED AS TEXTFILE LOCATION '/mypath/';
When I did execute the above the table got created successfully. But then when I tried to query the data from that table using a SELECT statement, I'm getting null instead of the word apache. For this instance, I'm trying to extract the word apache from the above log line. But then I tried executing the same regex in rubular, the outcome was correct as I expected. I'm unable to figure out why!
Have I missed anything trivial above? Any help could be appreciated. Thanks!
It seems that you regexp pattern is not correct. Try this one ^[^0-9]+
If you need more help with regular expressions I recommend this site https://regex101.com/
This pattern starts from the beginning of the string and then is looking for any character except numbers. With your example row the result is string "apache".
Thank you @Andres Koitmäe for your quick reply.
I updated my query with the regex you mentioned, but still, I'm getting it as null. Is that the normal behavior when I search in with a SELECT query after executing the hive query?
To make sure that I've made the correct changes I'm pasting the hive query here again:
CREATE EXTERNAL TABLE LogParserSample( logtype STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '^[^0-9]+' ) STORED AS TEXTFILE LOCATION '/mypath/';