<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Extract text and Replace text processors regex in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Extract-text-and-Replace-text-processors-regex/m-p/211192#M66012</link>
    <description>&lt;P&gt;I have sorted it. closing the question.&lt;/P&gt;</description>
    <pubDate>Fri, 24 Nov 2017 00:19:47 GMT</pubDate>
    <dc:creator>mark_hadoop</dc:creator>
    <dc:date>2017-11-24T00:19:47Z</dc:date>
    <item>
      <title>Extract text and Replace text processors regex</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Extract-text-and-Replace-text-processors-regex/m-p/211191#M66011</link>
      <description>&lt;P&gt;I have below data in hdfs&lt;/P&gt;&lt;P&gt;a="alphabet_123_a" b="alphabetb" c="alphabet"c" is third one"&lt;/P&gt;&lt;P&gt;b="newb" d="alphabet@/d" a="new a"&lt;/P&gt;&lt;P&gt;a="changed a", b="changed b" c="changed c" e="alphabet e"&lt;/P&gt;&lt;P&gt;My idea is:&lt;/P&gt;&lt;P&gt;1. Make a table in hive as orc, with columns a, b, c,d,e.&lt;/P&gt;&lt;P&gt;2. extract the attributes from the above data.&lt;/P&gt;&lt;P&gt;3. Mapping attributes according to column names in hive and storing them in hive.&lt;/P&gt;&lt;P&gt;4. in first line a,b,c; second line b,d,a; third line a,b,c,e&lt;/P&gt;&lt;P&gt;5. now after extracting all the lines and storing in hive, the values which are not present in lines (e.g. first line dont have "d" and "e"; second line dont have "c" and "e"; third line dont have "d") should be NULL, by the time they store in hive.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Approach&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1. Table "details" is created with columns a,b,c,d,e&lt;/P&gt;&lt;P&gt;2. Extract text processor is configured with custom properties as&lt;/P&gt;&lt;P&gt;    (a=)(.*?(?=\s\w+=|$))   --- [This will extract "alphabet_123a" in line 1 along with quotes(") at begening and ending of the values&lt;/P&gt;&lt;P&gt;    (b=)(.*?(?=\s\w+=|$))   --- [This will extract "aphabetb" in line 1 along with quotes...)&lt;/P&gt;&lt;P&gt;3. &lt;U&gt;I am confused in the replace text&lt;/U&gt; processor, as &lt;/P&gt;&lt;P&gt;      1. how to remove double quotes?&lt;/P&gt;&lt;P&gt;      2. insert NULL values if the corresponding column name is missing in the line?&lt;/P&gt;&lt;P&gt;      3. how to generalize the replace text for search value?&lt;/P&gt;&lt;P&gt;Also let me know, how can I change the regex in extract text processor(if necessary)?&lt;/P&gt;&lt;P&gt;Please help me&lt;/P&gt;&lt;P&gt;Thanks &lt;/P&gt;</description>
      <pubDate>Fri, 04 Aug 2017 05:58:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Extract-text-and-Replace-text-processors-regex/m-p/211191#M66011</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2017-08-04T05:58:12Z</dc:date>
    </item>
    <item>
      <title>Re: Extract text and Replace text processors regex</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Extract-text-and-Replace-text-processors-regex/m-p/211192#M66012</link>
      <description>&lt;P&gt;I have sorted it. closing the question.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2017 00:19:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Extract-text-and-Replace-text-processors-regex/m-p/211192#M66012</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2017-11-24T00:19:47Z</dc:date>
    </item>
  </channel>
</rss>

