About bbreak

bbreak · ‎08-04-2023

Hi @archer2012 , that error output doesn't give us a lot of information about what went wrong, but it looks like the connection wasn't successful. I recommend using Beeline in verbose mode from a command line on a suitable node to troubleshoot the connection independently. Once you have the connection to Impala working separately, then you can come back to Airflow and use the working connection settings.

bbreak · ‎07-20-2023

Hi @KienKim ! Impala has a lot of different configuration options, and increasing concurrency is a broad topic to tackle. If you haven't already, I recommend consulting the documentation for the version of CDP you're using. If you get stuck on a particular configuration property, then providing those specifics here would be a good place to start.

bbreak · ‎03-27-2023

Closing the connection while the query is still executing is generally not good practice. Think of it as taking ownership for the query you've executed, sort of like turning the lights off before you leave a room. If you're going to close the session (leave the room), you need to first cancel the query that's consuming resources (turn off the lights you turned on). The normal expectation is that an executing query will have an associated active session.

bbreak · ‎03-21-2023

@hqbhoho , if the query is executing, it probably makes sense to cancel it before you try to call close.

bbreak · ‎10-31-2022

Hi @Emiller ! One thing I notice right away about the format string you're trying to use is that "month" works with a full month name rather than an abbreviation. That is, you have "JAN" in your source data, but the "month" format string would work with "January". I suspect you'll have more luck with something like "DDmonRR". I recommend consulting the documentation on date casts if you need additional help.

bbreak · ‎06-28-2022

Hi @data_diver. To start with, in CDP Public Cloud, you write all your data to the cloud storage service for your platform (such as S3 or ADLS). After doing that, you can read it from a data hub cluster. Regarding your question about writing a DataFrame from Python, I want to start by clarifying a couple of points. You want to write a DataFrame, which is a Spark object, from Python, but without using PySpark, which is the framework that allows Python to interact with Spark objects such as DataFrames. Is all that correct? Perhaps you can start by giving us a bit of context. Why do you want to write a DataFrame without using PySpark? How will the DataFrame object exist in your Python program without PySpark in the first place? Any context you can provide for your use case would be helpful.

bbreak · ‎02-16-2017

Thanks for asking about this. The max message size for Hive Metastore should be set to 10% of the Metastore server heap size, up to a maximum of 2,147,483,647 bytes. Unfortunately, the values used or displayed by that configuration validator may be incorrect in some cases. Until that's fixed, I recommend checking the actual HMS heap size and configuring the max message size accordingly.

bbreak · ‎02-08-2017

Here's what the most recent version of the CDH Hive documentation says about this: http://www.cloudera.com/documentation/enterprise/latest/topics/hive.html#hive_transaction_support "Transaction (ACID) Support in Hive The CDH distribution of Hive does not support transactions (HIVE-5317). Currently, transaction support in Hive is an experimental feature that only works with the ORC file format. Cloudera recommends using the Parquet file format, which works across many tools. Merge updates in Hive tables using existing functionality, including statements such as INSERT, INSERT OVERWRITE, and CREATE TABLE AS SELECT."

bbreak · ‎06-20-2016

The bit was an attempt to format the post. It wasn't supposed to be part of the query, sorry about that. Try adding "as" after "create table tester" and before the nested select query.

bbreak · ‎06-17-2016

Hi Lucille, For the example you provided, you could get the file names with a query like this: SELECT hive_magnum.col1, hive_magnum.col2, hive_magnum.col3, hive_magnum.INPUT__FILE__NAME FROM hive_magnum; It will actually provide the full HDFS location, which includes the file name. I hope this is helpful.

Online	Offline
Last Visited	‎03-20-2025 06:26 PM

Member Since	‎02-08-2015 05:22 PM
Last Visited	‎03-20-2025 06:26 PM
Posts	423
Kudos received	4

Cloudera Community

Re: Impala Concurrency with Hadoop

Re: hive table compaction

Re: Add filename to import in Hive

Re: Airflow Job scheduling with CDE and CDW ( ETL ...

Re: Impala Concurrency with Hadoop

Re: Why Impala JDBC close() and execute() with syn...

Re: Why Impala JDBC close() and execute() with syn...

Re: String conversion of Date of 06JAN15 Impala CD...

Re: write python dataframe (pandas, spark) from CM...

Re: Cloudera Manager warning about Max Message Siz...

Re: hive table compaction

Re: Add filename to import in Hive

Re: Add filename to import in Hive