About andrea_pretotto

andrea_pretotto · ‎10-02-2023

Hi, thank you for the answers. @cravani unfortunately Impala is not used. About pyODBC, @mszurap , it sounds like the best option to adopt. We will work with this, and we will update you soon. Best regards Andrea

andrea_pretotto · ‎09-29-2023

Hi, in a Unix environment, I get this error while connecting to Hive via Kerberos, using the library pyHive in a python script: thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-1) SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server krbtgt/LOCAL.IT@EXAMPLE.IT not found in Kerberos database)' I am able to connect to Kerberos using "kinit -kt user user.keytab", and also via Hive ODBC driver. I use the same krb5.conf file, with Default Realm = EXAMPLE.IT. With kinit, I obtain, correctly: Default principal: user@EXAMPLE.IT Valid starting Expires Service principal 09/28/23 11:05:16 09/28/23 11:05:16 krbtgt/EXAMPLE.IT@EXAMPLE.IT The error is only using pyHive library. In the error, the library uses the domain LOCAL.IT instead of the one specified in krb5.conf, that is EXAMPLE.IT My connection in pyHive: conn = hive.Connection(host="host.domain.it", port=10000, auth="KERBEROS", database="db_123", kerberos_service_name="hive") Note that LOCAL.IT is equal to domain.it. Can you help me? Thank you

andrea_pretotto · ‎06-08-2022

Hi @mszurap , I agree with you about these numbers. Even if 60-100TB is a high amount of data, the total number of blocks involved is not so high (next to 600k), if compared to each Datanode. Each datanode reports 9M of blocks, but we found the problem is related to other directories that cointain small files, where block size is about 2-3MB. Even if the total size of these directory is not so high, we expect the number of block will decrease more significantly. We are facing the problem of small files, which determines a high number of blocks. The directory we have deleted had larger blocks, which is why the decrease in blocks was imperceptible. Thank you for the support in the analysis!

andrea_pretotto · ‎06-06-2022

Hi, I'm still analyzing the output: the command "fsck" on the path where deleting operations have been made reports just 1 block. Looking at the attached chart, you can see that on May, the 19th, a lot of data was removed from hdfs (60TB), and the number of blocks decreased for a single datanode (bda1node02). 600.000 blocks (1 block -> 256MB). In the other datanodes, blocks remained the same (or increased slightly).

andrea_pretotto · ‎06-01-2022

Hi, thank you for the replies. @mszurap no upgrade has been made recently, and there are no pending steps. @Shelton we kept files on Trash, but after 24h files were deleted. At HDFS side, the capacity has decreased, but the number of blocks is still high (and does not change). Thank you again

andrea_pretotto · ‎05-31-2022

Hi Miklos, sorry for the typo.. I have executed the command hdfs dfs -ls /snapshottable_path/.snapshot and got no lines on the directory. The "du" commands ("du -x -h" and "du -h") report the same size. When I click on the block count alerts on the HDFS service, I can see the number of blocks, which does not decrease. The DataNode has 8,743,931 blocks. Critical threshold: 8,000,000 block(s). Thank you again.

andrea_pretotto · ‎05-31-2022

Hi Miklos, thank you for the detailed answer. I found that the parent of the directory I removed has snapshots enabled, but there are no snapshots. The command: hdfs dfs -du -x -h -v -s /snapshottable_path returns no lines. Also the output of "du" is the same. Should I disable snapshots on the parent directory? Are there other configuration I should apply? Thank you again.

andrea_pretotto · ‎05-30-2022

Hi, after having deleted tera bytes of data from HDFS (1/4 of the total capacity), the block count among data nodes did not decrease as expected. It is still over the critical threshold. How could it be solved? Thank you

andrea_pretotto · ‎05-19-2022

Hi Alex, thank you for having confirmed that. I'll proceed as you suggest. Regards Andrea

andrea_pretotto · ‎05-16-2022

Hi, is there any documentation to install R and SparkR in a gateway node of a DataHub? I have CM 7.5.2 and a CDP Public subscription. Spark version currently configured: spark 2.4.8 spark 3.1.2 Thank you Andrea

Online	Offline
Last Visited	‎10-02-2023 12:44 PM

Member Since	‎09-29-2021 05:42 AM
Last Visited	‎10-02-2023 12:44 PM
Posts	14
Kudos received	1

Cloudera Community

Re: Error connecting to Kerberos using pyHive libr...

Error connecting to Kerberos using pyHive library

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

Re: HDFS block count does not decrease after delet...

HDFS block count does not decrease after deleting ...

Re: CDP Public - install R and SparkR

CDP Public - install R and SparkR