// This query:
sqlContext.sql("select * from retail_invoice").show
// gives this output:
+---------+---------+-----------+--------+-----------+---------+----------+-------+
|invoiceno|stockcode|description|quantity|invoicedate|unitprice|customerid|country|
+---------+---------+-----------+--------+-----------+---------+----------+-------+
+---------+---------+-----------+--------+-----------+---------+----------+-------+
// The Hive DDL for the table in HiveView 2.0:
CREATE TABLE `retail_invoice`(
`invoiceno` string,
`stockcode` string,
`description` string,
`quantity` int,
`invoicedate` string,
`unitprice` double,
`customerid` string,
`country` string)
CLUSTERED BY (
stockcode)
INTO 2 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hadoopsilon2.zdwinsqlad.local:8020/apps/hive/warehouse/retail_invoice'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"country\":\"true\",\"quantity\":\"true\",\"customerid\":\"true\",\"description\":\"true\",\"invoiceno\":\"true\",\"unitprice\":\"true\",\"invoicedate\":\"true\",\"stockcode\":\"true\"}}',
'numFiles'='2',
'numRows'='541909',
'orc.bloom.filter.columns'='StockCode, InvoiceDate, Country',
'rawDataSize'='333815944',
'totalSize'='5642889',
'transactional'='true',
'transient_lastDdlTime'='1517516006')
I can query the data in Hive just fine. The data is inserted from Nifi using the PutHiveStreaming processor.
We have tried to recreate the table, but the same problem arises. I haven't found any odd looking configurations.
Any Ideas on what could be going on here?