About PentaReddy

Tim Armstrong · ‎12-21-2020

We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!

PentaReddy · ‎04-24-2020

Traceback (most recent call last): File "consumer.py", line 8, in <module> consumer = KafkaConsumer('test', File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/consumer/group.py", line 355, in __init__ self._client = KafkaClient(metrics=self._metrics, **self.config) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/client_async.py", line 242, in __init__ self.config['api_version'] = self.check_version(timeout=check_timeout) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/client_async.py", line 907, in check_version version = conn.check_version(timeout=remaining, strict=strict, topics=list(self.config['bootstrap_topics_filter'])) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 1228, in check_version if not self.connect_blocking(timeout_at - time.time()): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 337, in connect_blocking self.connect() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 426, in connect if self._try_handshake(): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 505, in _try_handshake self._sock.do_handshake() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1108) I am getting above error after running the program, Any inputs ?

PentaReddy · ‎04-23-2020

Web Hdfs is disabled for our cluster.. Is there any other options ?

PentaReddy · ‎04-20-2020

Hi, I am trying to connect from local machine to a kerberized kafka cluster through python as python client, could you please let me know what all the properties to include along with bootstrap server ? consumer = KafkaConsumer('test',bootstrap_servers='XXX.ORG:XXXX', #client_id= kafka-python- + __version__, request_timeout_ms=30000, connections_max_idle_ms=9 * 60 * 1000, reconnect_backoff_ms=50, reconnect_backoff_max_ms=1000, max_in_flight_requests_per_connection=5, receive_buffer_bytes=None, send_buffer_bytes=None, #socket_options= [(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)], sock_chunk_bytes=4096, # undocumented experimental option sock_chunk_buffer_count=1000, # undocumented experimental option retry_backoff_ms=100, metadata_max_age_ms=300000, security_protocol='SASL_SSL', ssl_context=None, ssl_check_hostname=True, ssl_cafile=None, ssl_certfile=None, ssl_keyfile=None, ssl_password=None, ssl_crlfile=None, api_version=None, api_version_auto_timeout_ms=2000, #selector=selectors.DefaultSelector, sasl_mechanism='GSSAPI', #sasl_plain_username= None, #sasl_plain_password='XXXX', sasl_kerberos_service_name='XXXX', # metrics configs metric_reporters=[], metrics_num_samples=2, metrics_sample_window_ms=30000) Your help is appreciated. Thanks

PentaReddy · ‎04-20-2020

Hi All, I am trying to connect from Local machine to kafka cluster(kerberized Cluster) through python. Can anyone help what are the properties to specify for the krb5.conf file and other properties. your help is appreciated.

csguna · ‎12-12-2019

Could you try performing the "Validate hivemetastore schema " from Cloudera manager - > Hive service then Let us know if you are able to create the same table.

Manikumar Juttukonda · ‎01-12-2018

You can use PURGE option to delete data file as well along with partition mentadata but it works only in INTERNAL/MANAGED tables ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE; External Tables have a two step process to alterr table drop partition + removing file ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec; hadoop fs -rm -r <partition file path>

SandyCT · ‎10-02-2017

Hi Penta, Did it work? Actually Im facing the same issue and this is what I have used: a1.sources.Twitter.consumerKey=XXX a1.sources.Twitter.consumerSecret=XXX a1.sources.Twitter.accessToken=XXX a1.sources.Twitter.accessTokenSecret=XXX I am trying to run the flume agent in cloudera VM. Please advice if you or anyone know the solution. Appreciate your suggestions/help!

rbiswas1 · ‎07-21-2017

@PPR Reddy Here goes the solution (You can do it in other ways if you choose to): Column name in table a is c hive> select * from a; OK y y y n hive> Query: hive> select c,per from( > select x.c c,(x.cc/y.ct)*100 per,(z.cn/y.ct)*100 pern from > (select c, count(*) cc from a group by c) x, (select count(*)ct from a) y, > (select c, count(*) cn from a where c='n' group by c) z) f > where pern > 20; Output: OK n 25.0 y 75.0 Thanks

PentaReddy · ‎04-27-2017

Replication is for the data-node failure, when Human deletes the data, data will lost where-ever it resides be it on any number of nodes. and this is moved into trash and if needed we can get it back within certain interval time.

Online	Offline
Last Visited	‎12-20-2020 08:08 PM

Member Since	‎12-21-2016 06:37 PM
Last Visited	‎12-20-2020 08:08 PM
Posts	83
Kudos received	5

Cloudera Community

Re: How to delete/drop a partition of an external ...

Re: Flume Authentication credentials are missing

Re: How to add a new column to an existing parquet...

Re: How to authenticate (through Kerborized cluste...

Re: How to do authenticate kerberized cluster usin...

Re: Using Python Client to read and write data to ...

Re: Running a producer in a kerberized HDP 3.1 Kaf...

Re: Unable to create Hive external table after del...

Re: How to delete/drop a partition of an external ...

Re: Flume Authentication credentials are missing

Re: Hive - i would like to calculate percentage of...

Re: Hive - external table schema got dropped unfor...