Member since
07-31-2013
1924
Posts
460
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
956 | 07-09-2019 12:53 AM | |
4118 | 06-23-2019 08:37 PM | |
5459 | 06-18-2019 11:28 PM | |
5491 | 05-23-2019 08:46 PM | |
1927 | 05-20-2019 01:14 AM |
05-07-2019
06:01 PM
Could you share your CM agent logs snippets from right after the parcel activated and the host inspector showed the missing components/users? The users are typically created (if they do not pre-exist) by the Cloudera Manager agent when the parcel is activated for the first time. It is possible something may have gone wrong at that step, so having the agent logs will be helpful to troubleshoot it.
... View more
05-07-2019
05:58 PM
> my question is is there a way to change that to good health ? > if not , does that affect my work or the working of any of the services ? This greatly depends on what the health failures specifically are. From your screenshot, it appears that your hosts are also showing bad health states, so I would begin there (visit the Hosts tab, click your Host and checkout the alerts shown on the top left panes in the UI) and then move onto the services (repeating the same procedure, but for each service). Given your metrics from HDFS is flowing normally, the alerts may just be due to some broad host issue such as lack of space, use of swap memory, etc. - which can be addressed independently to resolve the health state. In most host-level health issues, your services may appear to work fine but the early health warning serves to indicate that a problem may develop in them very soon, and taking a preventive action would be recommended. Please share your health alert texts or screenshots if you'd like more clarification on how to resolve some of them.
... View more
05-07-2019
05:51 PM
Please add more details, or command outputs of your error. What exactly fails with an OutOfMemoryError message - a job map task, your HBase command, your java app, etc.? How wide is your TSV file per line? How many columns are in each line? Could you share the output of Linux commands 'file your-file.tsv' and 'wc your-file.tsv'? How much heap are you providing HBase RegionServers (search 'regionserver heap' in Cloudera Manager -> HBase -> Configuration page)?
... View more
05-07-2019
05:48 PM
HDFS only stores two time points in its INode data structures/persistence: The modification time and the access time [1]. For files, the mtime is effectively the time of when the file was last closed (such as when originally written and closed, or when reopened for append and closed). In general use this does not change very much for most files you'll place on HDFS and can serve as a "good enough" creation time. Is there a specific use-case you have in mind that requires preservation of the original create time? [1] https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeAttributes.java#L61-L65
... View more
05-07-2019
05:33 PM
Cloudera CDH provides backwards compatibility within minor release upgrades (5.x to 5.y, or 6.x to 6.y, etc.). The major release upgrades (5.x to 6.x) carry a very large amount of differences between them and do not guarantee backwards compatibility. Please see our upgrade impact assessment docs for more: https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_overview.html#concept_ch1_xw1_yw You'll need to use a 6.x client with a 6.x server. For JDBC connectors compatible with 6.x, see https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-2.html
... View more
05-07-2019
05:24 PM
The simplest way is through Cloudera Hue. See http://gethue.com/new-apache-oozie-workflow-coordinator-bundle-editors/ That said, if you've attempted something and have run into issues, please add more details so the community can help you on specific topics.
... View more
05-07-2019
05:21 PM
It would help if you add along some description of what you have found or attempted, instead of just a broad question. What load balancer are you choosing to use? We have some sample HAProxy configs at https://www.cloudera.com/documentation/enterprise/latest/topics/impala_proxy.html#tut_proxy for Impala that can be repurposed for other components. Hue also offers its own pre-optimized Load Balancer as roles in Cloudera Manager that you can add and have it setup automatically: https://www.cloudera.com/documentation/enterprise/latest/topics/hue_perf_tuning.html
... View more
05-07-2019
05:18 PM
CDH offers Lily KV Indexer for this purpose, as documented at https://www.cloudera.com/documentation/enterprise/latest/topics/search_index.html (see Near Real-Time Indexing sections) Are you running into specific issues when attempting this?
... View more
05-05-2019
08:58 PM
> So If i want to fetch all defined mapreduce properties,can i use this Api or it does have any pre-requisites? Yes you can. The default role group mostly always exists even if role instances do not, but if not (such as in a heavily API driven install) you can create one before you fetch. > Also does it require any privileges to access this api? A read-only user should also be able to fetch configs as a GET call over API. However, if there are configs marked as secured (such as configs that carry passwords, etc.) then the value retrieval will require admin privileges - they will otherwise appear redacted.
... View more
05-05-2019
06:08 PM
@priyanka2, > But in the yarn, there is no role of type Gateway for my cluster. > So is there any other way to fetch mapreduce properties? There may still be a role config group for it. You can use the roleConfigGroups endpoint to access its configs: Something like `curl -u auth:props -v http://cm-host.com:7180/api/v15/clusters/MyClusterName/services/YARN-1/roleConfigGroups/YARN-1-GATEWAY-BASE/config?view=full` > Could you please explain what could be the reason for that? The NodeManagers do not require MR client-side properties, just properties related to services it may need to contact and the MR shuffle service plugin configs. The NM is not involved in the MR app-side framework execution, so its mapred-site.xml only carries a subset as you've observed. @mikefisch, IIUC, you are looking for a way to assign roles to specific hosts? Use the POST call described here, for each service endpoint: https://cloudera.github.io/cm_api/apidocs/v19/path__clusters_-clusterName-_services_-serviceName-_roles.html -- Specifically, the roles list needs a structure that also requires a host reference ID that you can grab from the cluster hosts endpoint prior to this step. There's a simpler auto-assign feature also available: https://cloudera.github.io/cm_api/apidocs/v19/path__clusters_-clusterName-_autoAssignRoles.html
... View more
04-18-2019
12:09 AM
What are you passing in your command-line arguments to DistCp? The split feature is a new one that is activated only if you pass a positive integer via the -blocksperchunk flag.
... View more
04-10-2019
10:15 PM
As @Tomas79 explains, there will be no consequence whatsoever of making that change (for your described problem) as these files are not deleted by the writer (in the same way regular service log files are). You'll need to delete older log files on your own, regardless of what you specify the maximum file sizes to be for each rolled log. You can consider using something like logrotate on Linux to automate this.
... View more
04-10-2019
12:31 AM
1 Kudo
One possibility could be the fetch size (combined with some unexpectedly wide rows). Does lowering the result fetch size help? >From http://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#idp774390917888 : --fetch-size Number of entries to read from database at once. Also, do you always see it fail with the YARN memory kill (due to pmem exhaustion) or do you also observe an actual java.lang.OutOfMemoryError occasionally? If it is always the former, then another suspect would be some off-heap memory use done by the JDBC driver in use, although I've not come across such a problem.
... View more
04-09-2019
10:00 PM
1 Kudo
To add on: If you will not require audits or lineage at all for your cluster, you can also choose to disable their creation: Impala - Configuration - "Enable Impala Lineage Generation" (uncheck) Impala - Configuration - "Enable Impala Audit Event Generation" (uncheck) If you are using Navigator with Cloudera Enterprise, then these audits and lineage files should be sent automatically to the Navigator services. If they are not passing through, it may be an indicator of problem in the pipeline - please raise a support case if this is true.
... View more
04-03-2019
07:01 PM
You could try running the mount with the debug option enabled (-d for hadoop-fuse-dfs). I'd recommend switching over to the NFS mount instead, if possible - that's seen more improvement work over the past few years than the fuse approach. See https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hdfs_nfsgateway.html
... View more
04-03-2019
06:53 PM
Is the job submitted to the source cluster, or the destination? The DistCp jobs should only need to contact the NodeManagers of the cluster it runs on, but if the submitted cluster is remote then the ports may need to be opened. The HDFS transfer part does not involve YARN service communication at all, so it is not expected to contact a NodeManager. It would be helpful if you can share some more logs leading up to the observed failure.
... View more
04-03-2019
06:35 PM
This may not help you directly as there isn't enough evidence in the post to investigate, but when you have a straggler task that takes several times longer than others in its phase, it is more likely to be due to a key based skew. If your source table's primary key set, or the selected split-by key is not high in cardinality, the division of tasks will be skewed. It is worth inspecting this (a COUNT over GROUP BY of any column will help tell) and adjusting the import query parameters accordingly to gain maximum parallelism. One way to confirm if this is the case is to check actual task record counters when the tasks are running, between those that complete in short durations and the straggler.
... View more
04-03-2019
02:10 AM
For CDH / CDK Kafka users, the command is already in your PATH as "kafka-consumer-groups".
... View more
04-01-2019
06:55 PM
1 Kudo
Could you share the full log from this failure, both from the Oozie server for the action ID and the action launcher job map task logs? The 8042 port is the NodeManager HTTP port, useful in serving logs of live containers among other status details over REST. It is not directly used by DistCp in its functions, but MapReduce and Oozie diagnostics might be invoking it as part of a response to a failure, so it is a secondary symptom. Note though that running DistCp via Oozie requires you to provide appropriate configs that ensures delegation tokens for both kerberized clusters are acquired. Use "mapreduce.job.hdfs-servers" with a value such as "hdfs://namenode-cluster-1,hdfs://namenode-cluster-2" to influence this on the Oozie server's delegation token acquisition phase. This is only relevant if you use Kerberos on both clusters.
... View more
03-25-2019
09:25 AM
Use HBase exclusively for data that requires, dominantly, random access to individual records or small ranges of sequentially related records. For everything else, including tables you primarily require building reports regularly on, you're better off using Kudu+Impala with appropriate partitioning. Checkout https://kudu.apache.org/docs/schema_design.html for a good reference on this. Queries that perform large scans (hundreds of thousands of rows+ for ex.) including full table scans are not the type of workload HBase is designed for.
... View more
03-21-2019
05:48 PM
The search-based HDFS find tool has been removed and is superseded in C6 by the native "hdfs dfs -find" command, documented here: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#find
... View more
03-20-2019
04:12 AM
Flume scripts need to be run under a Bash shell environment, but it appears that you are trying PowerShell in Windows.
... View more
03-17-2019
06:49 PM
1 Kudo
Thank you for confirming the details, Does the subject part of your klist output match the added username in the HBase Superusers configuration precisely? If your user is in a different realm than the cluster services, is the realm name present as part of HDFS -> Configuration -> 'Trusted Realms'? Are all commands done as the superuser failing? What HBase shell command/operation specifically is leading to your quoted error? As to adding groups, it can be done in the same field, except you need to add an '@' prefix to the name. For ex. if your group is cluster_administrators, then add it in as '@cluster_administrators' in the HBase Superusers config. When using usernames, the @ must not be specified. Both approaches should work though. P.s. If you'll be relying on groups, ensure all cluster hosts return consistent group lookup output for id <user> commands, as the authorization check is distributed across the cluster roles for HBase.
... View more
03-16-2019
06:15 PM
1 Kudo
Recently upstream hive versions do support that keyword at table creation but do not enforce it on the data end. They leave it up to you to maintain that, so I wouldn't rely on its use with sqoop. The feature currently exists as a way to explicitly specify such a data relationship (externally maintained of course, also mentioned on the quoted HQL doc link) to the query optimizer and planner so it can make better decisions. See https://issues.apache.org/jira/browse/HIVE-13076 for more details on the design and scope. I'd recommend using Kudu instead for it's ease of use in this area (it enforces primary keys).
... View more
03-14-2019
12:39 AM
The --verbose flag may help see the query Sqoop attempts to generate. Have you tried using --boundary-query '<your-statement>' instead of --split-by '<colname>'?
... View more
03-14-2019
12:20 AM
1 Kudo
The cache reports form the basis of awareness of cached block location information at the NameNode. It is basically a list of block IDs that are currently cached by the DataNode. Delaying this will impact the availability of cached block locations in the information NameNode serves to its clients, when the state changes due to cache modification (add/remove/timers/etc.). Since the changes to block cache are mostly asynchronously done, this should not impact any specific commands, but it can result in delayed or missed benefits to clients seeking cached locations of recently cached/uncached blocks depending on how far you delay the reports (default's every 10 seconds). The regular DataNode heartbeats only send cache capacity statistics, not the actual block ID information. The cache report should be a small list typically - an encoded array of block ID integers and shouldn't impact the NameNode in any significant way unless you have very large caches. Are you spotting an observance that is otherwise?
... View more
03-13-2019
11:58 PM
The release notes area of Cloudera Enterprise 6 documentation carries an equivalent matrix section: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_product_compatibility.html
... View more
03-13-2019
07:00 PM
Could you elaborate on 'does not recognize it'? If its just an options problem but the files appear OK if you try loading them manually into Hive, then you can import to HDFS as Parquet and run a simple LOAD DATA INPATH statement to move them to your Hive table as-is. If there's a deeper inconsistency that requires transformation from text to get it right, then both options seem fine.
... View more
03-13-2019
06:52 PM
Please create a new thread for distinct questions, instead of bumping an older, resolved thread. As to your question, the error is clear as is the documentation, quoted below: """ Spooling Directory Source This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted). Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated: If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop processing. If a file name is reused at a later time, Flume will print an error to its log file and stop processing. """ - https://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#spooling-directory-source It appears that you can get around this by using ExecSource with a script or command that reads the files, but you'll have to sacrifice reliability. It may be instead worth investing in an approach that makes filenames unique (`uuidgen` named softlinks in another folder, etc.)
... View more
03-13-2019
06:46 PM
Could you share your CDH version? I'm unable to reproduce this with a username added (without @ character prefix) to the config you've mentioned in the recent CDH 6.x releases. By 're-deployed' did you mean restart? I had to restart the service for all hosts to see the new superuser config.
... View more