Member since
12-21-2016
83
Posts
5
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
42957 | 02-08-2017 05:56 AM | |
6311 | 01-02-2017 11:05 PM |
12-21-2020
09:01 AM
We have some background on schema evolution in Parquet in the docs - https://docs.cloudera.com/runtime/7.2.2/impala-reference/topics/impala-parquet.html. See "Schema Evolution for Parquet Tables". Some of the details are specific to Impala but the concepts are the same across engines including Hive and Spark that use parquet tables. At a high level, you can think of the data files being immutable while the table schema evolves. If you add a new column at the end of the table, for example, that updates the table schema but leaves the parquet files unchanged. When the table is queried, the table schema and parquet file schema are reconciled and the new column's values will be all NULL. If you want to modify the existing rows and include new non-NULL values, that would require rewriting the data, e.g. with an INSERT OVERWRITE statement for a partition or a CREATE TABLE .. AS SELECT to create an entirely new table. Keep in mind that traditional Parquet tables are not optimized for workloads with updates - Apache Kudu in particular and also transactional tables in Hive3+ have support for row-level updates that is more convenient/efficient. We definitely don't require rewriting the whole table every time you want to add a column, that would be impractical for large tables!
... View more
04-24-2020
09:32 AM
Traceback (most recent call last): File "consumer.py", line 8, in <module> consumer = KafkaConsumer('test', File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/consumer/group.py", line 355, in __init__ self._client = KafkaClient(metrics=self._metrics, **self.config) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/client_async.py", line 242, in __init__ self.config['api_version'] = self.check_version(timeout=check_timeout) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/client_async.py", line 907, in check_version version = conn.check_version(timeout=remaining, strict=strict, topics=list(self.config['bootstrap_topics_filter'])) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 1228, in check_version if not self.connect_blocking(timeout_at - time.time()): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 337, in connect_blocking self.connect() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 426, in connect if self._try_handshake(): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/kafka/conn.py", line 505, in _try_handshake self._sock.do_handshake() File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1108) I am getting above error after running the program, Any inputs ?
... View more
04-23-2020
10:35 AM
Web Hdfs is disabled for our cluster.. Is there any other options ?
... View more
04-20-2020
03:17 PM
Hi, I am trying to connect from local machine to a kerberized kafka cluster through python as python client, could you please let me know what all the properties to include along with bootstrap server ? consumer = KafkaConsumer('test',bootstrap_servers='XXX.ORG:XXXX', #client_id= kafka-python- + __version__, request_timeout_ms=30000, connections_max_idle_ms=9 * 60 * 1000, reconnect_backoff_ms=50, reconnect_backoff_max_ms=1000, max_in_flight_requests_per_connection=5, receive_buffer_bytes=None, send_buffer_bytes=None, #socket_options= [(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)], sock_chunk_bytes=4096, # undocumented experimental option sock_chunk_buffer_count=1000, # undocumented experimental option retry_backoff_ms=100, metadata_max_age_ms=300000, security_protocol='SASL_SSL', ssl_context=None, ssl_check_hostname=True, ssl_cafile=None, ssl_certfile=None, ssl_keyfile=None, ssl_password=None, ssl_crlfile=None, api_version=None, api_version_auto_timeout_ms=2000, #selector=selectors.DefaultSelector, sasl_mechanism='GSSAPI', #sasl_plain_username= None, #sasl_plain_password='XXXX', sasl_kerberos_service_name='XXXX', # metrics configs metric_reporters=[], metrics_num_samples=2, metrics_sample_window_ms=30000) Your help is appreciated. Thanks
... View more
04-20-2020
03:07 PM
Hi All, I am trying to connect from Local machine to kafka cluster(kerberized Cluster) through python. Can anyone help what are the properties to specify for the krb5.conf file and other properties. your help is appreciated.
... View more
12-12-2019
10:22 PM
Could you try performing the "Validate hivemetastore schema " from Cloudera manager - > Hive service then Let us know if you are able to create the same table.
... View more
01-12-2018
09:18 AM
2 Kudos
You can use PURGE option to delete data file as well along with partition mentadata but it works only in INTERNAL/MANAGED tables ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec PURGE; External Tables have a two step process to alterr table drop partition + removing file ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec; hadoop fs -rm -r <partition file path>
... View more
10-02-2017
06:58 PM
Hi Penta, Did it work? Actually Im facing the same issue and this is what I have used: a1.sources.Twitter.consumerKey=XXX a1.sources.Twitter.consumerSecret=XXX a1.sources.Twitter.accessToken=XXX a1.sources.Twitter.accessTokenSecret=XXX I am trying to run the flume agent in cloudera VM. Please advice if you or anyone know the solution. Appreciate your suggestions/help!
... View more
07-21-2017
08:59 PM
1 Kudo
@PPR Reddy Here goes the solution (You can do it in other ways if you choose to): Column name in table a is c hive> select * from a; OK y y y n hive> Query: hive> select c,per from( > select x.c c,(x.cc/y.ct)*100 per,(z.cn/y.ct)*100 pern from > (select c, count(*) cc from a group by c) x, (select count(*)ct from a) y, > (select c, count(*) cn from a where c='n' group by c) z) f > where pern > 20; Output: OK n 25.0 y 75.0 Thanks
... View more
04-27-2017
06:24 PM
Replication is for the data-node failure, when Human deletes the data, data will lost where-ever it resides be it on any number of nodes. and this is moved into trash and if needed we can get it back within certain interval time.
... View more