Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HbaseExtract and solrCell integration issues

HbaseExtract and solrCell integration issues

New Contributor

I have the requirment of using the ks_indexer to index an Hbase table that contains information for a given crawled URL.  The table has 1 column, data,with several qualifiers for  information, such as a bussiness id, category, source, ect, as well as the main HTML source from a web crawler.  The solr index needs to have indexed fields for all the meta info( bussiness id, source, ect.) as well as the parsed HTML via solr extraction ( i.e. links, content, keywords, meta fields that solr extraction automatillly generates when paring HTML. 

 

Having layed that all out I am hbase extract with solrcell morphlines to try and perfom this -

 

SOLR_LOCATOR : {
  # Name of solr collection
  collection : morphTest
  
  # ZooKeeper ensemble
  zkHost : "$ZK_HOST" 
}

morphlines : [
    {
    id : morphline
    importCommands : ["com.cloudera.**", "com.ngdata.**","org.apache.solr.**"]

    commands : [                    
      {
        extractHBaseCells {
          mappings : [
            {
              inputColumn : "data:url"
              outputField : "WEB_PAGE_URL" 
              type : string
              source : value
            }
            {
              inputColumn : "data:cid"
              outputField : "custID"
              type : string
              source : value
            } 
            {
              inputColumn : "data:cat"
              outputField : "category"
              type : string
              source : value
            } 
            {
              inputColumn : "data:src"
              outputField : "source"
              type : string
              source : value
            } 
            {
              inputColumn : "data:dom"
              outputField : "domain"
              type : string
              source : value
            }                                    
            {
              inputColumn : "data:htm"
              outputField : "_attachment_body" 
              type : "byte[]"
              source : value
            }
          ]
        }
      }

      {
        # used for auto-detection if MIME type isn't explicitly supplied
        detectMimeType {
          includeDefaultMimeTypes : true
          #mimeTypesFiles : [target/test-classes/custom-mimetypes.xml]
        }
      }

      { 
        solrCell {
          solrLocator : ${SOLR_LOCATOR}

            # extract some fields
            capture : [content, title,links]

            # rename exif_image_height field to text field
            # rename a field to anchor field
            # rename h1 field to heading1 field
            fmap : { a : links, style : head_stype,  base : head_base, link : head_links, script : head_script, noscript : head_noscript, Last-Modified : last_modified }
            uprefix: ignored_
            captureAttr: true
            lowernames: true
 

            # xpath : "/xhtml:html/xhtml:body/xhtml:div/descendant:node()"

            parsers : [ # one or more nested Tika parsers
              { parser : org.apache.tika.parser.html.HtmlParser }
            ]
          }
      }
      { logDebug { format : "output record: {}", args : ["@{}"] } }
    ] 
  }

 

Though i am having an issue where the HTML is parsed, but the other information is not indexed into its approprite fields.  They are instead parsed into the meta field of the extracted HTML -

 

"docs": [
      {
        "links": [
          "rect",
          "www.google.com"
        ],
        "meta": [
          "_attachment_mimetype",
          "text/html",
          "source",
          "test",
          "WEB_PAGE_URL",
          "www.myTestUrl.com",
          "Content-Encoding",
          "ISO-8859-1",
          "_attachment_body",
          "[B@41aa7618",
          "Content-Type",
          "text/html; charset=ISO-8859-1",
          "dc:title",
          "My second web page",
          "custID",
          "test"
        ],
        "WEB_PAGE_HEAD": [
          "_attachment_mimetype",
          "text/html",
          "source",
          "test",
          "WEB_PAGE_URL",
          "www.myTestUrl.com",
          "Content-Encoding",
          "ISO-8859-1",
          "_attachment_body",
          "[B@41aa7618",
          "Content-Type",
          "text/html; charset=ISO-8859-1",
          "dc:title",
          "My second web page",
          "custID",
          "test"
        ],
        "id": "r22333",
        "content": [
          "My Second Heading  My first paragraph.  this is a link"
        ],
        "WEB_PAGE_BODY": "My Second Heading  My first paragraph.  this is a link",
        "title": [
          "My second web page",
          "My second web page"
        ],
        "WEB_PAGE_TITLE": [
          "My second web page",
          "My second web page"
        ],
        "source": [
          "test"
        ],
        "content_encoding": [
          "ISO-8859-1"
        ],
        "content_type": [
          "text/html; charset=ISO-8859-1"
        ],
        "_version_": 1472639721215623200,
        "timestamp": "2014-07-03T20:17:59.442Z"
      }
    ]
  }
}

 Any idea what this is the case?  Is it posible to extract one field out of many and parse the contents?

2 REPLIES 2

Re: HbaseExtract and solrCell integration issues

Expert Contributor
The right place for conversations like this is the search-user@cloudera.org mailing list (archive here: https://groups.google.com/a/cloudera.org/forum/#!forum/search-user)

Highlighted

Re: HbaseExtract and solrCell integration issues

Master Collaborator

I'd also like to point out that we have a very active Discussion Board right here in the community that is dedicated to Cloudera Search:

 

http://community.cloudera.com/t5/Cloudera-Search/bd-p/Search