Member since
07-31-2013
1924
Posts
460
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
956 | 07-09-2019 12:53 AM | |
4118 | 06-23-2019 08:37 PM | |
5459 | 06-18-2019 11:28 PM | |
5489 | 05-23-2019 08:46 PM | |
1927 | 05-20-2019 01:14 AM |
05-12-2019
07:52 PM
What have you specified as your Node Labels storage directory in your global yarn-site.xml file, via property "yarn.node-labels.fs-store.root-dir"? The default for this property uses a local filesystem /tmp/ generated directory that can get cleaned up periodically or at boot, and also not exist in sync between HA ResourceManager hosts. Try specifying a HDFS location for it and perform your modification tests again, as described at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeLabel.html#Configuration
... View more
05-12-2019
07:47 PM
Thank you for sharing that output. The jar does appear to carry one class set in the right package directory, but also carries another set under a different directory. Perhaps there is a versioning/work-in-progress issue here, where the incorrect build is the one that ends up running. Can you try to build your jar again from a clean working directory? If the right driver class runs, you should not be seeing the following observed log: > 19/05/11 02:43:49 WARN mapreduce.JobResourceUploader: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
... View more
05-10-2019
02:25 AM
Does the created table carry partition columns? If yes, are the HDFS files in appropriate partition directories and present when you perform a SHOW PARTITIONS query? Can you add the output of 'SHOW CREATE TABLE' for your tablename here? Also, by "Table count in Hue" do you mean some presented statistics on its UI, or does performing a literal "select count(*) from table;" query result in 0?
... View more
05-09-2019
09:17 PM
Can you describe the steps used for building the jar from your compiled program? Use the 'jar tf' command to check if all 3 of your class files are within it, and not just the WordCountDriver.class file.
... View more
05-09-2019
02:39 AM
1 Kudo
Spark running on YARN will use the temporary storage presented to it by the NodeManagers where the containers run. These directory path lists are configured via Cloudera Manager -> YARN -> Configuration -> "NodeManager Local Directories" and "NodeManager Log Directories". You can replace its values to point to your new, larger volume, and it will cease to use your root partition. FWIW, the same applies for HDFS if you use it. Also see: https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html
... View more
05-09-2019
02:09 AM
Quoted from documentation about using Avro files at https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_avro_usage.html#topic_26_2 """ Hive (…) To enable Snappy compression on output [avro] files, run the following before writing to the table: SET hive.exec.compress.output=true; SET avro.output.codec=snappy; """ Please try this out. You're missing only the second property mentioned here, which appears specific to Avro serialization in Hive. Default compression of Avro is deflate, so that explains the behaviour you observe without it.
... View more
05-09-2019
01:52 AM
The primary change appears to be that Impala JDBC 2.6 drivers began to shade-in slf4j jars and other dependencies instead of offering them as separate jars as in 2.5. Do you perhaps have older slf4j-log4j12*.jar and slf4j-api-*.jar still leftover when you upgraded the driver jar? Try removing them away to move past this, as the new driver jar does not require them to be independently present. Apache NiFi uses the log4j-over-slf4j*.jar so that likely shouldn't be removed.
... View more
05-09-2019
01:33 AM
Are all of your processes connecting onto the same Impala Daemon, or are you using a load balancer / varying connection options? Each Impala Daemon can only accept a finite total number of active client connections, which is likely what you are running into. Typically for concurrent access to a DB, it is better to use a connection pooling pattern with finite connections shared between threads of a single application. This avoids overloading a target server. While I haven't used it, pyodbc may support connection pooling and reuse which you can utilise via threads in python, instead of creating separate processes. Alternatively, spread the connections around, either by introducing a load balancer, or by varying the target options for each spawned process. See https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html and http://www.cloudera.com/documentation/other/reference-architecture/PDF/Impala-HA-with-F5-BIG-IP.pdf for further guidance and examples on this.
... View more
05-08-2019
09:56 PM
I replied on your other thread about the return code: https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/Sparktask-execution-error/m-p/86476#M3059 It appears you will require permissions to the logs or a copy of it to investigate further. You can also have your administrative team search for keywords such as your query or query ID and share log snippets.
... View more
05-08-2019
09:41 PM
The OIV tool is documented at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html and includes some examples. Try its Delimiter related options on a copy of your HDFS fsimage file and checkout the result.
... View more
05-08-2019
09:38 PM
As far as the SparkTask goes, the return code 3 may indicate _any/all_ exception caught during job closure [1]. The potential scenarios for this are numerous and cannot be narrowed down with just this message. The true exception with all its details should be printed on the target server logs (HiveServer2 logs) your JDBC is connecting to. I'd recommend beginning your investigation from there. [1] - https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTask.java#L286-L297
... View more
05-08-2019
09:28 PM
There is not a currently available way to do this with ALTER TABLE ADD PARTITION. Its implementation requires explicit definition of all partitions when providing the partition_spec [1]. Please file a feature request at https://issues.apache.org/jira/browse/HIVE for the Apache Hive team to consider feasibility of adding this in future. In the mean time, if you are able to script out your statements, you can consider using SHOW PARTITIONS and a HDFS listing to dynamically create all the explicit statements perhaps? [1] - https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L3446-L3479 (final arg to call) and https://github.com/apache/hive/blob/rel/release-3.1.1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L400-L403
... View more
05-08-2019
08:48 PM
What error do you get, specifically? Is your installation secure (with Kerberos, with TLS, etc.)? A sample HAProxy configuration to use with HS2 + Load Balancer is available at https://www.cloudera.com/documentation/enterprise/latest/topics/admin_ha_hiveserver2.html, along with caveats for secured clusters.
... View more
05-08-2019
08:45 PM
Like most other languages, Perl offers a function to execute a command on the host OS: https://perldoc.perl.org/functions/system.html You can use Perl to prepare your Hive CLI command lines / query text, and then execute the "hive" command with args from within via the `system(…)` call.
... View more
05-08-2019
07:34 PM
Configure your cluster with the steps described at https://www.cloudera.com/documentation/enterprise/latest/topics/admin_adls2_config.html and then you can use 'hadoop fs' commands with abfs:// or abfss:// URLs in the same way as you do with hdfs:// or s3a:// Edit: Apologies, I missed addressing the ACL specific part of your question. There is support for ACL management in the connector, so using 'hadoop fs -get/setfacl' with the ADLSv2 URLs should produce an effect: https://github.com/cloudera/hadoop-common/blob/cdh6.2.0-release/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystem.java#L638-L672
... View more
05-08-2019
07:33 PM
1 Kudo
Are you looking for a sequentially growing ID or just a universally unique ID? For the former, you can use Curator over ZooKeeper with this recipe: https://curator.apache.org/curator-recipes/distributed-atomic-long.html For the latter, a UUID generator may suffice. For a more 'distributed' solution, checkout Twitter's Snowflake: https://github.com/twitter-archive/snowflake/tree/snowflake-2010
... View more
05-08-2019
07:21 PM
This sounds like a network configuration issue. If you can connect onto your VM's network ports via telnet/nc, you should be able to access it in its entirety. To access HDFS directly from another host on a network, the client must have full access to every single cluster host and service port (including the NameNode ports, the DataNode ports, etc.) as the download of the data is done by connecting onto a DataNode. Check your firewall rules to see if the DataNode port is not allowed for remote access. Also check if your DataNodes are serving their ports on the right addresses that are remotely accessible (if you have multiple addresses in the VM).
... View more
05-08-2019
07:15 PM
There's no 'single' query tracking in HBase because of its distributed nature (your scan range may boil down into several different regions, hosted and served independently by several different nodes). Access to data is audited if you enable TRACE level logging on the AccessController class, or if you use Cloudera Navigator Audit Service in your cluster. The audit information will capture the requestor and the kind of request, but not the parameters of the request. If it is the parameters of your request (such as row ranges, filters, etc.) you're interested in, could you explain what the use-case is for recording it?
... View more
05-08-2019
07:12 PM
Please add more details: - A listing (hadoop fs -ls) that shows the anomaly - Details on how the file was originally created, how many such files are in such a state and if the issue is isolated to a few directories - Is there a 'setTimes' user operation within the HDFS audit logs?
... View more
05-08-2019
07:08 PM
There are a few options, You can grab the fsimage periodically with the 'hdfs dfsadmin -fetchImage' command and analyze its delimited or XML outputs via the 'hdfs oiv' tool. The metadata will carry file lengths and ownership information that can help you aggregate it into a report with your record processing software of choice. Cloudera Enterprise Reports Manager carries summary reports of watched directories: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_dg_reports.html Cloudera Enterprise Navigator carries HDFS analytics that help show how your HDFS is being used: https://www.cloudera.com/documentation/enterprise/latest/topics/navigator_dashboard.html#concept_cnv_dwt_5x Cloudera Enterprise Workload eXperience Manager (WXM) includes a small files reporting feature: https://www.cloudera.com/documentation/wxm/latest/topics/wxm_file_size_reporting.html
... View more
05-08-2019
06:42 PM
1 Kudo
Running over a public IP may not be a good idea if it is open to the internet. Consider using a VPC? That said, you can point HBase Master and RegionServer to use the address from a specific interface name (eth0, eth1, etc.) and/or a specific DNS resolver (IP or name that can answer to a dns:// resolving call) via advanced config properties: hbase.master.dns.interface hbase.master.dns.nameserver hbase.regionserver.dns.interface hbase.regionserver.dns.nameserver By default the services will use whatever is the host's default name and resolving address: getent hosts $(hostname -f) and publish this to clients.
... View more
05-08-2019
06:25 PM
Yes, unfortunately, this is due to your case sensitive configuration. Sqoop's hard-coding the catalog names in upper-case for SQL Server queries [1]. Could you please log a JIRA over at https://issues.apache.org/jira/projects/SQOOP describing the issue? That could help get traction into changing SQL Server manager queries to lower-case, which should work in case sensitive/insensitive modes (I think). As a workaround to 'list-databases' you can try the 'eval' sub-command instead which executes arbitrary SQL from user input: https://archive.cloudera.com/cdh5/cdh/5/sqoop/SqoopUserGuide.html#_literal_sqoop_eval_literal [1] - https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/manager/SQLServerManager.java#L232-L235
... View more
05-07-2019
09:58 PM
Depends on what you mean by 'storage locations'. If you mean "can other apps use HDFS?" then the answer is yes, as HDFS is an independent system unrelated to YARN and has its own access and control mechanisms not governed by a YARN scheduler. If you mean "can other apps use the scratch space on NM nodes" then the answer is no, as only local containers get to use that. If you're looking to strictly split both storage and compute, as opposed to just some form of compute, then it may be better to divide up the cluster entirely.
... View more
05-07-2019
08:56 PM
1 Kudo
Have you gone over Kafka documentation? Are there specific parts or scenarios beyond the ones mentioned that you have these questions about? > data loss Kafka provides topic partition replication. > data duplication Kafka does not do anything specific for deduplicating data. Assuming you're asking about exactly-once processing semantics, it depends on your application and how it leverages Kafka. One record of this is at https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
... View more
05-07-2019
07:29 PM
You may find success in using an unpacked parcel (tar xf file.parcel) if you pre-place the config files it relies on at the right locations (/etc/hadoop/conf, /etc/spark2/conf, etc. dirs and contents, obtainable from an existing CM-managed gateway host) and place the PARCEL_EXTRACT_ROOT/bin/ on the global path. Once setup, try running the spark2-shell from SPARK2-PARCEL-EXTRACT-ROOT/bin/ level. Note also that the Spark2 parcel relies on presence of CDH parcel, so both will need to be present.
... View more
05-07-2019
07:21 PM
I am not seeing a question in your description posted here, so I assume your ask is how to get rid of this alert? The alert exists to tell that you may face job failures due to exhaustion of available transient storage space in YARN. Typically HDFS and YARN worker roles share disks between them. If you're observing this alert, it could be one of these hogging up most of the space, leaving little for the other. For HDFS you can check the DataNodes usage reports on the NameNode Web UI or via du calls run over the HDFS DN dirs on the reported host(s). A stale snapshot is often the cause of excessive, unaccounted space use by HDFS. For YARN, typically the space is cleared soon after the container using it ends, but if you face this even when there are no jobs running, try investigating the contents of the directories shown in the alert.
... View more
05-07-2019
07:16 PM
The VM offered typically includes all required items pertaining to questions asked. I'm not certain about the exact version of Spark2 offered, but the included version will suffice in achieving the presented goals.
... View more
05-07-2019
07:11 PM
I was able to reproduce this with the stock Docker container memory settings of 2 GiB. Upon investigation (via logs and dmesg) it appears that the Linux Kernel is invoking a kill-on-oom on some large consuming process, of which the NameNode is the highest target. The NameNode is killed away by the Kernel, and thereby all other services fail to connect onto 8020 for HDFS work. Retry on a new run, but after reconfiguring docker to give additional memory to the containers. Adding 1 GiB to the existing value should help with this. Check https://stackoverflow.com/questions/44533319/how-to-assign-more-memory-to-docker-container if you need help with this.
... View more
05-07-2019
06:28 PM
The metric collection and query features are documented at https://www.cloudera.com/documentation/enterprise/latest/topics/cm_metrics.html - check out specifically the links within it about Metric Aggregation and the tsquery language. As to interpreting each graph, that may require additional knowledge of service/host architectures. If you have a few specific metric names you'd like more clarification on, please post back with details.
... View more
05-07-2019
06:25 PM
Our Isilon doc page covers some of your asks, including the differences on security features (as of posting, the Isilon solution did not support ACLs, or transparent encryption), but does support Kerberos Authentication: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html > extending an existed CDH HDFS cluster with Isilon If by extending you mean "merging" the storage under a common namespace, that is not currently possible (in 5.x/6.x). > using of Isilon as a backup of an existed CDH HDFS cluster Cloudera Enterprise BDR (Backup and Disaster Recovery) features support replicating to/from Isilon in addition to HDFS, so this is doable: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_pcm_bdr.html#supported_replication_isilon
... View more