About matt_krueger

matt_krueger · ‎08-09-2018

I am tuning a spark application and noticed there are discrepancies between the job's metrics shown on Spark's history server UI and YARN's resource manager UI. I've specified the the following properties on my Zeppelin Notebook's spark interpreter: master yarn-client spark.app.name Zeppelin spark.cores.max spark.driver.memory 3g spark.executor.cores 3 spark.executor.instances 2 spark.executor.memory 4g When I look at the YARN ResourceManager UI I do not see evidence that the executor's containers are getting 3 cores each. I see that they each are using 1 v-core each. Yet, when I check the Spark History Server... it describes each running executor has 3 cores and reflects all the properties I've specified. What's up with this? Which of these should I be looking at? YARN 3.1.0 Zeppelin 0.8.0 Spark2 2.3.1

matt_krueger · ‎02-02-2018

// This query: sqlContext.sql("select * from retail_invoice").show // gives this output: +---------+---------+-----------+--------+-----------+---------+----------+-------+ |invoiceno|stockcode|description|quantity|invoicedate|unitprice|customerid|country| +---------+---------+-----------+--------+-----------+---------+----------+-------+ +---------+---------+-----------+--------+-----------+---------+----------+-------+ // The Hive DDL for the table in HiveView 2.0: CREATE TABLE `retail_invoice`( `invoiceno` string, `stockcode` string, `description` string, `quantity` int, `invoicedate` string, `unitprice` double, `customerid` string, `country` string) CLUSTERED BY ( stockcode) INTO 2 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hadoopsilon2.zdwinsqlad.local:8020/apps/hive/warehouse/retail_invoice' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"country\":\"true\",\"quantity\":\"true\",\"customerid\":\"true\",\"description\":\"true\",\"invoiceno\":\"true\",\"unitprice\":\"true\",\"invoicedate\":\"true\",\"stockcode\":\"true\"}}', 'numFiles'='2', 'numRows'='541909', 'orc.bloom.filter.columns'='StockCode, InvoiceDate, Country', 'rawDataSize'='333815944', 'totalSize'='5642889', 'transactional'='true', 'transient_lastDdlTime'='1517516006') I can query the data in Hive just fine. The data is inserted from Nifi using the PutHiveStreaming processor. We have tried to recreate the table, but the same problem arises. I haven't found any odd looking configurations. Any Ideas on what could be going on here?

Online	Offline
Last Visited	‎08-14-2018 03:47 PM

Member Since	‎01-31-2018 05:05 PM
Last Visited	‎08-14-2018 03:47 PM
Posts	2

Cloudera Community

Container metrics do not match between spark UI an...

Spark not reading data from a Hive managed table. ...