About cstanca

cstanca · ‎03-10-2017

@Mohammed El Moumni A queue has a limit in size (1 GB) or 10,000 files by default. To change the settings go to setting tab on "Configure" of that queue. See screenshot attached. If it helps, please vote/accept response. It is also possible that downstream you may have another queue or processor stuck due to this limit set by default. You have to increase there and let the processor start processing to reduce the amount in the queue before your queue report may start to drain. Imagine all this flow like a river with all kind of streams and obstructions...

cstanca · ‎03-08-2017

@Ram Ghase You are trying to follow-up a demo with Spark 2.1, but your sandbox is at best at 2.0. You should follow tutorials that are supported by the version of Spark deployed on HDP 2.5 sandbox, Spark 1.6.2. Spark 2.0 is also possible, but I would wait for HDP 2.6 sandbox which is to be released probably next month. The error is self-explanatory. If you wish to address it, you could add those missing libraries.

cstanca · ‎03-08-2017

@elliot gimple Hive is not like a traditional RDBMS in regard to DML operations because of how Hive leverages HDFS to store data in files. Keep in mind that each partition has a file, each bucket adds another file and so on. When you perform a DML action against of a row, you practically overwrite a file, not append to a file. This is how HDFS has been architected for good reasons.

cstanca · ‎03-08-2017

@Subramaniyam KMV I assume you mean mosquito mqtt. Here is an example of installation on centos-7: https://www.digitalocean.com/community/tutorials/how-to-install-and-secure-the-mosquitto-mqtt-messaging-broker-on-centos-7 You can probably skip the "secure" part. This is not specific to HDP 2.5, you can assume that the sandbox is just a Centos VM for your case. What it matters is the OS and availability of resources.

cstanca · ‎03-08-2017

@som You would have to be more explicit about versions of Hive, Spark etc, also explain "failing". There is nothing different about accessing Hive views via Hive context from Spark as it is the same as with tables. Anyhow, check the following: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_spark-hive-access.html https://hortonworks.com/hadoop-tutorial/using-hive-with-orc-from-apache-spark/ Hopefully this helps.

cstanca · ‎03-08-2017

@Ivan Majnaric HDP 2.6, probably next month

cstanca · ‎03-07-2017

It is common to allocate capacity to an interactive queue during the day when business are active and to allocate capacity to a batch queue during the night when batch workloads are frequently executed. To configure this scenario, schedule-based policies are used. Per HDP Hive Performance Tuning Guide (8/29/2016), section 3.6.8, this is an alpha Apache feature. Can anyone elaborate on this feature? How will be used to setup time-based queue capacity (steps, screenshots) and whether this is actually available and if it is not available yet, when would it be?

cstanca · ‎03-07-2017

@CriCL Got you. I like to use R Studio connected to HDP and use markdown language to generate PDF files: http://rmarkdown.rstudio.com/pdf_document_format.html True. This is more a tool for data science. On the other hand any tool that you like and can use ODBC or JDBC can connect to Hive or to HBase via Phoenix. Try that approach.

cstanca · ‎03-07-2017

@Lior Hadaya CBO (cost based optimizer) and statistics collected on your tables. You may have the settings mentioned here set to true: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_performance_tuning/content/hive_perf_best_pract_use_col_stats_cost_base_opt.html As such, the behavior can change over time. You could also force stats on a specific table or even column.

cstanca · ‎03-07-2017

@Sree Kupp The 15 additional seconds are outside of Hive. As such, the changes you make to Tez settings will only impact (positively or negatively) the Hive query, but it will not address your 15 seconds lag outside Hive. You delay is a combination of network latency, ODBC and Power BI rendering it in UI, The focus for tuning should be on ODBC tuning to chunk the data better to increase throughput, address the network latency (if any), caching on Power BI side. There are a few things that you could do on Hive side too, but it is a long shot to guess what you and you did not. For example, use ORC format, use Interactive Query (LLAP), increase the use of caching for map-side joins, improve parallelism by setting a few parameters that will allow a better chunking of the data for increased parallelism, etc.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Issue with Nifi Merge Content : Files stay in ...

Re: : java.lang.ClassNotFoundException: Failed to ...

Re: Appending to hive table gives an error but ove...

Re: mosquitto installation on hdp2.5

Re: How to read hive views using spark with hiveco...

Re: Kafka upgrade on 0.10.1.0+

How to setup time-based queue capacity?

Re: Reporting Tool

Re: How does hive decide on the insert query plan

Re: Time difference between Query results from Hiv...