Reply
Highlighted
New Contributor
Posts: 5
Registered: ‎07-06-2018

How to read DECIMAL datatype present parquet file using Morphlines config

[ Edited ]

How to read DECIMAL datatype present parquet file ... - Cloudera Community Cloudera Community

I want to read parquet files using Morphlines.

Reference:https://medium.com/@bkvarda/index-parquet-with-morphlines-and-solr-20671cd93a41

My Parquet file has DECIMAL datatypes. I do not find any documentation, how to deal with DECIMAL datatype in Morphlines. I am using below code in conf file which is not working.

===============================================================================

 

SOLR_LOCATOR : {

# Name of solr collection
#collection : citiscreening
collection : icttdnee_ttsd_collection
#solrHomeDir : ${HOME}/solr_citiscreening_configs
# ZooKeeper ensemble -- edit this for your cluster's Zk hostname(s)
zkHost : "bdgtr018x01h2.nam.nsroot.net:2181,bdgtr013x03h2.nam.nsroot.net:2181,bdgtr015x02h2.nam.nsroot.net:2181/solr"

#zkHost : "bdgtr018x01h2.nam.nsroot.net:2181,bdgtr013x03h2.nam.nsroot.net:2181,bdgtr015x02h2.nam.nsroot.net:2181/solr"
#bdgtr013x04h2:9983/solr

# The maximum number of documents to send to Solr per network batch (throughput knob)
# batchSize : 1000
}

morphlines : [
{
# Name used to identify a morphline. E.g. used if there are multiple
# morphlines in a morphline config file
id : solrTest

# Import all morphline commands in these java packages and their
# subpackages. Other commands that may be present on the classpath are
# not visible to this morphline.
importCommands : ["org.kitesdk.**", "com.cloudera.**", "org.apache.solr.**"]

commands : [

# Read the Parquet data

{ readAvroParquetFile {
# For Parquet files that were not written with the parquet.avro package
# (e.g. Impala Parquet files) there is no Avro write schema stored in
# the Parquet file metadata. To read such files using the
# readAvroParquetFile command you must either provide an Avro reader
# schema via the readerSchemaFile parameter, or a default Avro schema
# will be derived using the standard mapping specification.

# Optionally, use this Avro schema in JSON format inline for projection:
readerSchemaString:"""{ "type": "record"
,"name": "my_record"
,"fields": [

{"name": "audit_internal_id","type":["bytes","null"],"logicalType":"decimal","precision":38,"scale":10,"default":0 }
,{"name": "alert_id","type":["bytes","null"],"logicalType":"decimal","precision":38,"scale":10,"default":0 }
,{"name": "created_date", "type":["null","string"]}
,{"name": "event", "type":["null","string"]}
,{"name": "comments", "type":["null","string"]}
,{"name": "user_identifier", "type":["null","string"]}
,{"name": "user_role", "type":["null","string"]}
,{"name": "status", "type":["null","string"]}
,{"name": "step_identifier", "type":["null","string"]}
,{"name": "attachment_internal_id","type":["bytes","null"],"logicalType":"decimal","precision":38,"scale":10,"default":0 }
,{"name": "note_internal_id","type":["bytes","null"],"logicalType":"decimal","precision":38,"scale":10,"default":0 }
,{"name": "owner", "type":["null","string"]}

]
}"""

}
}


{ logDebug { format : "output record {}", args : ["@{}"] } }


{ extractAvroPaths {
flatten : true
paths : {

audit_internal_id : /audit_internal_id
alert_id : /alert_id
created_date : /created_date
event : /event
comments : /comments
user_identifier : /user_identifier
user_role : /user_role
status : /status
step_identifier : /step_identifier
attachment_internal_id: /attachment_internal_id
note_internal_id : /note_internal_id
owner : /owner

}
}
}

{ sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } }

# load the record into a Solr server or MapReduce Reducer.
{ loadSolr { solrLocator : ${SOLR_LOCATOR} } }

]
}
]
==================================

Data in Logs:
Output logs: DEBUG org.kitesdk.morphline.stdlib.LogDebugBuilder$LogDebug - output record [{_attachment_body=[{"audit_internal_id": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0002#\u0014I��\u0000", "alert_id": "\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0001\u000F\u001A(��\u0000", "created_date": "2018-03-19", "event": "ALERT_REVIEWED", "comments": "Alert Reviewed, and Submitted by User:LV1234"}]

New Contributor
Posts: 5
Registered: ‎07-06-2018

Re: How to read DECIMAL datatype present parquet file using Morphlines config

Expected: It should read actual values of Decimal data types .

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.