About dmueller1607

dmueller1607 · ‎09-11-2018

Here the extended explain of the DF, LLAP is not enabled in our cluster: df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [importtime: timestamp, tester: string ... 21 more fields] == Parsed Logical Plan == GlobalLimit 1 +- LocalLimit 1 +- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22] +- SubqueryAlias mytable +- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc == Analyzed Logical Plan == importtime: timestamp, tester: string, testerhead: string, matchcode: string, revision: string, usertext: string, teststufe: string, temp: string, filedat: date, starttime: timestamp, endtime: timestamp, lotnrc: string, wafernr: string, testname: string, testnumber: string, testtype: string, unit: string, highl: string, lowl: string, highs: string, lows: string, valuelist: string, hashvalue: int GlobalLimit 1 +- LocalLimit 1 +- Project [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22] +- SubqueryAlias mytable +- Relation[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] orc == Optimized Logical Plan == InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- CollectLimit 1 +- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user... == Physical Plan == InMemoryTableScan [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22] +- InMemoryRelation [importtime#0, tester#1, testerhead#2, matchcode#3, revision#4, usertext#5, teststufe#6, temp#7, filedat#8, starttime#9, endtime#10, lotnrc#11, wafernr#12, testname#13, testnumber#14, testtype#15, unit#16, highl#17, lowl#18, highs#19, lows#20, valuelist#21, hashvalue#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- CollectLimit 1 +- *FileScan orc mydb.mytable[importtime#0,tester#1,testerhead#2,matchcode#3,revision#4,usertext#5,teststufe#6,temp#7,filedat#8,starttime#9,endtime#10,lotnrc#11,wafernr#12,testname#13,testnumber#14,testtype#15,unit#16,highl#17,lowl#18,highs#19,lows#20,valuelist#21,hashvalue#22] Batched: false, Format: ORC, Location: CatalogFileIndex[hdfs://hdp-m-01:8020/apps/hive/warehouse/mydb.db/mytab..., PartitionCount: 1000, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<importtime:timestamp,tester:string,testerhead:string,matchcode:string,revision:string,user... <br>

dmueller1607 · ‎09-11-2018

I have a huge Hive Table (ORC) and I want to select just a few rows of the table (in Zeppelin). %spark sqlContext.setConf("spark.sql.orc.filterPushdown", "true") val df1 = sqlContext.sql("SELECT * FROM mydb.myhugetable LIMIT 1") // Takes 10 mins val df2 = sqlContext.sql("SELECT * FROM mydb.myhugetable").limit(1) // Takes 10 mins Using the LIMIT clause in my SQL statement or the corresponding dataframe method DF.limit doesn't help, as the query still takes too long. It seems to read the whole table first and then just returning the n rows. How can I achieve, that the filter limits the data during running the SQL and therefore runs faster? Shouldn't the filter pushdown help here? I can't see any difference with the setting spark.sql.orc.filterPushdown set to true or false. Thank you!

dmueller1607 · ‎09-03-2018

Thank you! Can you give me some details about this or do you have some helpful links?

dmueller1607 · ‎08-15-2018

I have a huge Hive Table, that I want to process in Spark. To avoid a full table scan in my test purposes, I first run this code: %spark val df1 = sqlContext.sql("select * from mydb.mytable limit 10") df1.show() That works very fast (1 Sec). Now I added another query to "transform" the result a little bit - Now Hive seems to make a full table scan, as this execution takes more than 10 minutes! %spark val df2 = sqlContext.sql("SELECT concat_ws(\":\", lpad(cast((hashvalue % 200) as String), 3, \"0\"), concat_ws(\"_\", name, number, test, step, cast(starttime as STRING))) as key, hashvalue, valuelist from (select * from mydb.mytable limit 10) a") df2.show() 1. Is this normal behavior? 2. Is there a way to avoid full table scans, when using sub-queries? Thank you!

dmueller1607 · ‎08-08-2018

I created a Hive Table, that can access another HBase table. Something like this: create external table hbase_50mio_200regions (rowkey String, valuelist String) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList") TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions') The access works and HiveQL queries are fast for small HBase table. The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)! Is there a way to pass start and stop key to my Hive SELECT or COUNT queries? As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table. Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?

dmueller1607 · ‎08-07-2018

Is there a possibility to query a salted HBase table by using the HBase Shell. E.g. my HBase table row key looks like this: 000:rowkey0 000:rowkey2 ... 001:rowkey1 001:rowkey3 ... How can I use my HBase Shell to query this table without doing a full table scan, e.g. when I want to read out the row with rowkey1? Thank you!

dmueller1607 · ‎08-06-2018

@Josh Elser Thank you for that information. I changed my HBase table creation now to following command: create 'hbase_1m_10r', {NAME => 'cf'}, {SPLITS => ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}<br> When running the following query: INSERT OVERWRITE TABLE dmueller.hbase_1m_10r SELECT concat_ws(":", cast((hashvalue % 10) as String), concat_ws("_", name, number, test, step, cast(starttime as STRING))) as k, valuelist from (select * from testdb.test_orc limit 1000000) a distribute by split(k, ":")[0] sort by k<br> I still have only 1 reducer... Any idea why?

dmueller1607 · ‎08-03-2018

Found following article: http://www.openkb.info/2017/05/hive-on-tez-how-to-control-number-of.html That helped me. I now set the numbers of Tasks to a fix amount.

dmueller1607 · ‎08-03-2018

I found the following article about how to fill a HBase table with data from Hive: https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.html I also did the steps for me, and it seems to work. The problem is, when I call the following HiveQL set hfile.family.path=/tmp/test_hbase/cf set hive.hbase.generatehfiles=true INSERT OVERWRITE TABLE testdb.test_hbase SELECT distinct concat_ws("_", name, number, test, step, cast(starttime as STRING)) as k, hashValue, valuelist from testdb.test_orc order by k, hashvalue limit 1000 I need to combine 4 columns to get a unique row key for my HBase table. Another problem is, that my valueList column can contain huge Strings, between 0 and 1 MB. When I run the query, Tez creates 100 containers for Mapping jobs. This takes a few minutes to complete, which is also slow for 1000 rows, but ok. After the Map step, a Reduce step follows. And this could be the problem in my oppinion, because there's only 1 Reducer for this huge amount of data. This seems to be too less, as the job takes hours now (still not completed yet!) My questions here: What are the Map and Reduce step doing in this scenario? Why is there only 1 Reducer? Can I somehow change this behavior (e.g. disable Reducing or using more Reducers)? Thank you!

dmueller1607 · ‎08-01-2018

Good question. I'm just running the insert into ... select * from ... command, like I could do it e.g. in Beeline or Ambari Hive View (JDBC). Is this running in Single-Insert or Batch mode?

Online	Offline
Last Visited	‎11-25-2019 04:11 AM

Member Since	‎04-24-2017 12:08 PM
Last Visited	‎11-25-2019 04:11 AM
Posts	106
Kudos received	13

Cloudera Community

Re: Spark Streaming / Hive + Kafka: Only one Worke...

Re: Filter a Phoenix Timestamp Column in SparkSQL ...

Re: Phoenix Query with Split operation on String (...

Re: Hive Metastore not starting in HDP 3.0

Re: HFile creation from Hive Table not working

Re: Spark SQL: Limit clause performance issues

Spark SQL: Limit clause performance issues

Re: SparkSQL: Hive sub-query leads to full table s...

SparkSQL: Hive sub-query leads to full table scan

Accessing HBase Table through Hive is very slow

Use HBase Shell Scan method to search in salted HB...

Re: HFile creation from Hive Table not working

Re: HFile creation from Hive Table not working

HFile creation from Hive Table not working

Re: Hive HBase Integration very slow inserts