Member since
01-19-2017
3676
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 610 | 06-04-2025 11:36 PM | |
| 1177 | 03-23-2025 05:23 AM | |
| 584 | 03-17-2025 10:18 AM | |
| 2186 | 03-05-2025 01:34 PM | |
| 1375 | 03-03-2025 01:09 PM |
01-05-2020
08:40 PM
I'm sorry for not answering earlier, only now got access to the environment again. So I did solution one, and the agent worked, but it went down in a while which I couldn't realize why, the logs didn't say anything. But when it was up, I tried to start services there through Ambari, and it looks like it doesn't initiate a request, since there are no logs about the start action of that service. Then, I thought let's clean everything and start from fresh, deleted ambari-agent, cleaned all folders as mentioned and installed it again. The same issue is there when I try to start services, For example this is the log of ambari-agent when I tried to start the SNameNode, which is hosted on the problematic node. INFO 2020-01-06 07:35:39,472 __init__.py:82 - Event from server at /user/commands: {u'clusters': {u'2': {u'commands': [{u'commandParams': '...', u'clusterId': u'2', u'clusterName': u'dev', u'commandType': u'EXECUTION_COMMAND', u'roleCommand': u'START', u'serviceName': u'HDFS', u'role': u'SECONDARY_NAMENODE', u'requestId': 424, u'taskId': 7353, u'repositoryFile': '...', u'componentVersionMap': {u'HDFS': {u'SECONDARY_NAMENODE': u'3.1.0.0-78', u'JOURNALNODE': u'3.1.0.0-78', u'HDFS_CLIENT': u'3.1.0.0-78', u'DATANODE': u'3.1.0.0-78', u'NAMENODE': u'3.1.0.0-78', u'NFS_GATEWAY': u'3.1.0.0-78', u'ZKFC': u'3.1.0.0-78'}, u'ZOOKEEPER': {u'ZOOKEEPER_SERVER': u'3.1.0.0-78', u'ZOOKEEPER_CLIENT': u'3.1.0.0-78'}, u'SPARK2': {u'SPARK2_THRIFTSERVER': u'3.1.0.0-78', u'SPARK2_CLIENT': u'3.1.0.0-78', u'LIVY2_SERVER': u'3.1.0.0-78', u'SPARK2_JOBHISTORYSERVER': u'3.1.0.0-78'}, u'SQOOP': {u'SQOOP': u'3.1.0.0-78'}, u'HIVE': {u'HIVE_SERVER': u'3.1.0.0-78', u'HIVE_METASTORE': u'3.1.0.0-78', u'HIVE_SERVER_INTERACTIVE': u'3.1.0.0-78', u'HIVE_CLIENT': u'3.1.0.0-78'}, u'YARN': {u'YARN_REGISTRY_DNS': u'3.1.0.0-78', u'RESOURCEMANAGER': u'3.1.0.0-78', u'YARN_CLIENT': u'3.1.0.0-78', u'TIMELINE_READER': u'3.1.0.0-78', u'APP_TIMELINE_SERVER': u'3.1.0.0-78', u'NODEMANAGER': u'3.1.0.0-78'}, u'PIG': {u'PIG': u'3.1.0.0-78'}, u'RANGER': {u'RANGER_TAGSYNC': u'3.1.0.0-78', u'RANGER_ADMIN': u'3.1.0.0-78', u'RANGER_USERSYNC': u'3.1.0.0-78'}, u'TEZ': {u'TEZ_CLIENT': u'3.1.0.0-78'}, u'MAPREDUCE2': {u'MAPREDUCE2_CLIENT': u'3.1.0.0-78', u'HISTORYSERVER': u'3.1.0.0-78'}, u'ZEPPELIN': {u'ZEPPELIN_MASTER': u'3.1.0.0-78'}, u'HBASE': {u'HBASE_MASTER': u'3.1.0.0-78', u'PHOENIX_QUERY_SERVER': u'3.1.0.0-78', u'HBASE_CLIENT': u'3.1.0.0-78', u'HBASE_REGIONSERVER': u'3.1.0.0-78'}, u'KAFKA': {u'KAFKA_BROKER': u'3.1.0.0-78'}, u'KNOX': {u'KNOX_GATEWAY': u'3.1.0.0-78'}, u'RANGER_KMS': {u'RANGER_KMS_SERVER': u'3.1.0.0-78'}}, u'commandId': u'424-0'}]}}, u'requiredConfigTimestamp': 1578284883431}
INFO 2020-01-06 07:35:39,473 ActionQueue.py:79 - Adding EXECUTION_COMMAND for role SECONDARY_NAMENODE for service HDFS of cluster_id 2 to the queue
INFO 2020-01-06 07:35:39,473 security.py:135 - Event to server at /reports/responses (correlation_id=66): {'status': 'OK', 'messageId': '2'}
INFO 2020-01-06 07:35:39,475 __init__.py:82 - Event from server at /user/ (correlation_id=66): {u'status': u'OK'}
INFO 2020-01-06 07:35:40,595 security.py:135 - Event to server at /heartbeat (correlation_id=67): {'id': 44}
INFO 2020-01-06 07:35:40,597 __init__.py:82 - Event from server at /user/ (correlation_id=67): {u'status': u'OK', u'id': 45}
INFO 2020-01-06 07:35:42,317 ComponentStatusExecutor.py:107 - Skipping status command for INFRA_SOLR. Since command for it is running Also, there are no logs under /var/log/hadoop/hdfs which makes me think that ambari-agent on the problematic node didn't actually initiate the call. I'm going to mark your answer as acceptable since it has solved the issue I originally talked about, do you think I should create a new post for this?
... View more
01-05-2020
04:46 AM
@Shelton could you please repost the solution ?? I am facing similar issue. separate thread for the same created https://community.cloudera.com/t5/Support-Questions/sqoop-import-of-BLOB-columns-from-oracle-database/m-p/286761#M212633 Thanks in advance
... View more
01-03-2020
10:03 PM
@Shaneg For Sqoop export, parameter "--export-dir" is required, please refer to below doc: https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_syntax_4 Export is designed to export HDFS data to RDBMS, not Hive tables to RDBMS. Hope that helps. Cheers Eric
... View more
01-02-2020
10:57 AM
Hi Shelton, I did find a note to add the partitions in another way. Are you aware of it? if so, do you see any issues with it. Regards ~Suresh D https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive-automate-msck.html Automate partition discovery and repair Hive automatically and periodically discovers discrepancies in partition metadata in the Hive metastore and corresponding directories on the file system, and then performs synchronization. Automating this operation for log data or data in Spark and Hive catalogs is especially helpful. The discover.partitions table property enables and disables synchronization of the file system with partitions. In external partitioned tables, this property is enabled (true) by default when you create the table using Hive in HDP 3.1.4 and later. To a legacy external table (created using an earlier version of Hive), add discover.partitions to the table properties to enable partition discovery. By default, the discovery and synchronization of partitions occurs every 5 minutes, but you can configure the frequency as shown in this task. Assuming you have an external table created using a version of Hive that does not support partition discovery, enable partition discovery for the table. ALTER TABLE exttbl SET TBLPROPERTIES ('discover.partitions' = 'true'); Set synchronization of partitions to occur every 10 minutes expressed in seconds: In Ambari > Hive > Configs, set metastore.partition.management.task.frequency to 600.
... View more
01-01-2020
11:51 AM
@pra_big hbase user is the admin user of hbase one connects to a running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. Here the version information that is printed when you start HBase Shell has been omitted. The HBase Shell prompt ends with a > character. As hbase user $ ./bin/hbase shell hbase(main):001:0> All the below methods will give you access to the HBase shell as the admin user [hbase] If you have root access # su - hbase It will give you the same above If you have sudo privileges # sudo su hbase -l I don't see the reason for changing to bash or didn't I understand your question well?
... View more
01-01-2020
11:04 AM
@saivenkatg55 You didn't respond to this answer, do you still need help or it was resolved if so please do accept and close the thread.
... View more
12-29-2019
08:41 PM
@Cl0ck Each host's name is stored in CM's backend database with an UUID attached, please refer to table HOSTS. Example as below: HOST_ID: 12
OPTIMISTIC_LOCK_VERSION: 148
HOST_IDENTIFIER: bfaf4b71-01e2-4157-b46f-d1c13566b69a
NAME: host-xxx-xxx.xxx
IP_ADDRESS: xx.xx.xx.xx
RACK_ID: /default
STATUS: NA
CONFIG_CONTAINER_ID: 1
MAINTENANCE_COUNT: 0
DECOMMISSION_COUNT: 0
CLUSTER_ID: 1
NUM_CORES: 1
TOTAL_PHYS_MEM_BYTES: 1929342976
PUBLIC_NAME: NULL
PUBLIC_IP_ADDRESS: NULL
CLOUD_PROVIDER: NULL Where HOST_IDENTIFIER is the UUID, and is stored under /var/lib/cloudera-scm-agent/uuid on each host. Maybe you can try to update the table here for NAME field and see if that can help? Cheers Eric
... View more
12-28-2019
02:43 AM
@sheelstera There is a great YARN tuning spreadsheet here that will help you calculate correctly your YARN settings. It applies to YARN clusters only, and describes how to tune and optimize YARN for your cluster Please revert
... View more
12-27-2019
07:52 AM
@Prakashcit There is a Jira https://issues.apache.org/jira/browse/HIVE-16575 last updated on 05/Dec/19 Hive does not enforce foreign keys to refer to primary keys or unique keys. In your previous thread, I explained what a NOVALIDATE constratriant is "A NOVALIDATE constraint is basically a constraint that can be enabled but for which hive will not check the existing data to determine whether there might be data that currently violate the constraint" The difference between a UNIQUE constraint and a Primary Key is that per table you may only have one Primary Key but you may define more than one UNIQUE constraint. Primary Key constraints are not nullable.UNIQUE constraints may be nullable. Oracle also implements the NOVALIDATE constraint here is a write-up by Richard Foote When you create a UNIQUE constraint, the database automatically creates a UNIQUE index. For RDBMS databases, a PRIMARY KEY will generate a unique CLUSTERED INDEX. A UNIQUE constraint will generate a unique NON-CLUSTERED INDEX.
... View more
12-26-2019
12:30 PM
@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH
... View more