About dmueller1607

dmueller1607 · ‎08-01-2018

@Josh Elser Thank you for your support. I updated my question by some information, to answer a few of the questions. Where can I find information in the logs? Can you tell me, which file(s) on which server (HBase Master, HBase RegionServer) are helpful? Will make some more "benchmarking" and log searching tomorrow. Any hints or assumptions yet?

dmueller1607 · ‎08-01-2018

Formatting of code is not saved, sorry!

dmueller1607 · ‎08-01-2018

I'm plying around with HBase tables that are managed by Hive. Therefore I run the following commands in Zeppelin: %hbase # Create Table with 1 CF and 10 Regions create 'my_test', {NAME => 'cf1', VERSIONS => 3}, {SPLITS => ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09']} After creating the HBase table, I create a external Hive Table with HBaseStorageHandler: %hive -- CREATE EXTERNAL HIVE MANAGED HBASE TABLE! CREATE EXTERNAL TABLE dmueller.my_test( key String, hashvalue int, valuelist String ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:hashValue,cf1:valueList") TBLPROPERTIES("hbase.table.name" = "my_test", "hbase.mapred.output.outputtable" = "my_test") Then I fill the HBase table by reading data (1000 rows) from another Hive table into this external Hive Table: %hive -- INSERT DATA INTO THE MANAGED TABLE INSERT OVERWRITE TABLE dmueller.my_test SELECT concat_ws("_", testname, lotnrc, testnumber, teststufe) as key, hashvalue, valuelist from dmueller.hivetable limit 1000 The problem here is, that the INSERT statement is too slow. It takes around 4 minutes to insert the new values (1000 rows with 6 column to read)! The dmueller.hivetable is a partitioned ORC table with around 60 mio rows and ~500 GB of ORC file size in HDFS. How can this INSERT statement made faster? What am I doing wrong? Update - Some more information: I'm using Hive with Tez. Writing less data takes less time, e.g. one row takes already ~30 sec. (so it seems to be an exponential runtime). Reading the data (e.g. the 1000 rows) directly via Zeppelin Hive (JDBC) interpreter takes ~2 sec. Re-running the same INSERT statement (e.g. with 1000 rows) takes always the same time (+/- a few seconds of course).

dmueller1607 · ‎07-18-2018

I created an ORC table in Hive (saved in HDFS path /apps/hive/warehouse/mydb.db/mytable). As I need to add some rows manually sometimes, I call the INSERT statement. This creates many small files in the table directory in HDFS, which is the expected behavior. Now I run the command ALTER TABLE mydb.mytable CONCATENATE; to merge these small files together to bigger files. What I'm observing here is, that sometimes all small files are merged to one big file (~80 MB) and sometimes I have the big file and some small files (with a few KB each) over, they seem not to be merged. Is this normal behavior of the CONCATENATE command? Is there a way to influence this behavior (to avoid having these small files sometimes after the Concatenate command)? Thank you!

dmueller1607 · ‎07-10-2018

Yes, I'm familiar with Spark. What I wondered about was the caching behavior. It really seems to know which HiveQL statement belongs to the cached data, and re-uses it automatically when the same query comes: // Cache the table for the first time => takes some time! val df1_1 = sqlContext.sql("SELECT a, b FROM db.table limit 1000000") val df1_2 = df1_1.cache() df1_2.count() // This re-uses the cached object, as the request is the same as before => very fast! val df2_1 = sqlContext.sql("SELECT a, b FROM db.table limit 1000000") val df2_2= df2_1.cache() df2_2.count() // This caches the data, because the request is different (another limit clause) => takes some time! val df3_1 = sqlContext.sql("SELECT a, b FROM db.table limit 10") val df3_2= df3_1.cache() df3_2.count() Thanks for your help @Felix Albani

dmueller1607 · ‎07-10-2018

Thank you for the fast answer. The problem seems to be the cache method, as this takes always long for the first iteration and is faster in all the following iterations. This behavior is independent of which paragraphs I run before... It's clear to me, that the first action launches the application (and therefore starts the Executors etc.). But it's not clear to me, why Spark "knows" that it can re-use the cached dataframe from the iterations before, because I overwrite the variables (or even use some others).

dmueller1607 · ‎07-10-2018

I wrote a small Zeppelin paragraph that tries around with caching (reading Hive tables). Here my code: %spark // Caching test #4: df.cache with 10 partitions val input1 = sqlContext.sql("SELECT * FROM db.table").repartition(10) val input2 = input1.cache() input2.count() The first time I run this paragraph it takes about 10 minutes to finish it. When I run this paragraph some more times, it always needs around 0,5 and 1 second. I added another paragraph to "initialize" the Spark interpreter after restarting it (I want to avoid these different run-times for the paragraph above). %spark sc.version sqlContext.sql("select * from db.table limit 1").show() Also this paragraph needs some more time in the first iteration, and less in the following. BUT: The main paragraph (with the caching test) still takes so long in the first run! Am I doing something wrong here? Is Zeppelin re-using the cached elements in the further iterations (even if I "overwrite" the objects by reading the whole table again before calling the cache method)? Is there a difference between the two transformations df.cache() and sqlContext.cacheTable(name)? Thanks for your help!

dmueller1607 · ‎04-26-2018

Here my workflow so far:

dmueller1607 · ‎04-26-2018

I also tried to "generate" the XML by the GenerateFlowFile processor, but still the same problem (thought it has something to do with my read XML maybe, but seems not to be so)

dmueller1607 · ‎04-26-2018

Thanks for your fast answer! I checked the settings and there are definitivly no upper/lower case problems. I just saw that the NiFi version is 1.4, not 1.5. Is there a problem with this processor?

Online	Offline
Last Visited	‎11-25-2019 04:11 AM

Member Since	‎04-24-2017 12:08 PM
Last Visited	‎11-25-2019 04:11 AM
Posts	106
Kudos received	13

Cloudera Community

Re: Spark Streaming / Hive + Kafka: Only one Worke...

Re: Filter a Phoenix Timestamp Column in SparkSQL ...

Re: Phoenix Query with Split operation on String (...

Re: Hive Metastore not starting in HDP 3.0

Re: HFile creation from Hive Table not working

Re: Hive HBase Integration very slow inserts

Re: Hive HBase Integration very slow inserts

Hive HBase Integration very slow inserts

Hive CONCATENATE not always merging all small file...

Re: Apache Zeppelin (HDP 2.6) - First iteration of...

Re: Apache Zeppelin (HDP 2.6) - First iteration of...

Apache Zeppelin (HDP 2.6) - First iteration of par...

Re: NiFi: Extract atrribute value from XML using E...

Re: NiFi: Extract atrribute value from XML using E...

Re: NiFi: Extract atrribute value from XML using E...