Member since
02-27-2020
157
Posts
38
Kudos Received
43
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
200 | 05-20-2022 09:46 AM | |
111 | 05-17-2022 08:42 PM | |
172 | 05-06-2022 06:50 AM | |
185 | 04-18-2022 07:53 AM | |
147 | 04-12-2022 11:17 AM |
05-25-2021
10:43 AM
Hello, Hive compaction runs first, then a cleaner thread waits for all readers to finish reading the old base/delta files. When it determines that nobody is reading the old files anymore, only then does the deletion of the old files occur. Please run SHOW COMPACTIONS and look at the state of the compaction in question. Also would help to know which version are you on for CDH/HDP/CDP? If CDP, public or private cloud? Hope this helps, Alex
... View more
05-25-2021
10:12 AM
Hello, I haven't used Flume myself, but there is some mention of serializer.delimiter parameter in the Flume documentation. It would be helpful to know what the source of the data is (e.g. file on hdfs) and what the destination is (e.g. Hive). Also you should know that in Cloudera Data Platform, Flume is no longer a supported component. If you are just starting to learn it, I would recommend saving yourself some time and exploring NiFi, Kafka, and Flink (good starter blog post). Regards, Alex
... View more
05-25-2021
10:06 AM
Hi Nuno, I've sent you a private message to get some more information. Please respond there and we'll figure out how to help. Regards, Alex
... View more
01-21-2021
09:11 PM
Hi Igor, You can define what users can and cannot do in Atlas by way of defining authorization policies in Ranger. Details on how to do that can be found here: https://docs.cloudera.com/runtime/7.2.6/atlas-securing/topics/atlas-configure-ranger-authorization.html What you refer to as Bookmarks can potentially be done via Saved Searches (see here), depending on what you want to achieve. As for the popularity score, this could be made a metadata attribute that can be updated by users. There is no automation to derive this score with Atlas out-of-the-box. Hope this helps, Alex
... View more
01-15-2021
04:25 PM
That is an odd behaviour. Two things to try: In Cloudera Manager go hdfs -> Configuration -> Enable Log and Query Redaction and switch that to false. Go to HBase -> Configuration -> Enable Audit Collection and turn it off. Restart the HBase service and run your sqoop job again. See if that helps with performance.
... View more
01-15-2021
10:11 AM
Two things to check: 1. Does your nifi service user account have permissions on the table and hdfs location where it's trying to do the insert; 2. Your Hive SQL statement here looks a bit off to me: insert into transformed_db.tbl_airbnb_listing_transformed
select a.*, 20210113 partition_date_id from
staging_db.etbl_raw_airbnb_listing a Is 20210113 a column name? Are you missing a comma between that and parition_date_id? Is your source staging table partitioned? If you are trying to select only a specific date, than the syntax to do that is different.
... View more
01-14-2021
09:48 AM
There could be a problem with running the SQL in your Hive cluster. I would suggest checking the Hive logs for any relevant errors when NiFi flow is triggered. Another thing to check is the FlowFile itself to make sure it has the data that Hive table would be expected (i.e. schema matches). You can do this by forwarding the failed FlowFiles to a rejected flow.
... View more
12-18-2020
11:56 AM
Settings look fine. _HOST gets replaced by the actual FQDN of the host at runtime. One thing to check is to make sure reverse DNS lookup works on all hosts.
... View more
12-18-2020
10:35 AM
Out-of-the-box Hue can't properly parse this format. There are some potential solutions in this thread: https://stackoverflow.com/questions/13628658/hive-load-csv-with-commas-in-quoted-fields and it depends on what you are comfortable with: pre-processing the file to reformat the input or to use a different SerDe in Hive. Hope that helps, Alex
... View more
12-16-2020
11:03 PM
This could be as simple as a typo somewhere in your configs, but it's hard to tell where. Looks like the username was set to "host" somehow and hdfs can't authenticate that as a user. Without more context it's hard to say any more.
... View more
12-16-2020
10:57 PM
You'll need to look through the Region Server log files to find the root cause of the problem. The error message you shared is not enough information to go on.
... View more
12-16-2020
10:52 PM
Can you show the first couple lines of your file exactly as they appear in the file. You can open the CSV with a simple text editor of your choice and show the output in a comment here. When you are in the upload screen in Hue, note that under Extras section there are additional parameters that you might need to adjust to fit your file formatting.
... View more
12-15-2020
10:05 AM
1 Kudo
If you just execute SET hive.auto.convert.join=true; in your Hive session that will apply for the duration of your session. Keep in mind though that this setting is set to true by default since Hive 0.11.0. Regards, Alex
... View more
12-15-2020
09:56 AM
I was able to reproduce this error and it looks like the problem is the identical column name in your tableA and tableB. Namely, DateColumn is referenced in the subquery. Hive interprets this as a reference to the parent query which is not allowed (per limitation listed here). Essentially it's confused what you mean by this query due to overloaded column name. To solve this, you can explicitly specify table names when referring to columns: UPDATE tableA
SET tableA.ColA = "Value"
WHERE year(tableA.DateColumn) >= (
select (max(year(tableB.DateColumn))-1)
from tableB
) Let me know if this works. Regards, Alex
... View more
12-15-2020
09:16 AM
1 Kudo
Hi Bhushan, The best way to approach this is to reach out to your account team as they will have a better idea of your environment and nuances. At a high level, an in-place upgrade from HDP/HDF 3 to CDP will be available early 2021. Regards, Alex
... View more
12-08-2020
01:53 PM
The reason why doing these operations as cloudbreak user fail is because this is a service user for accessing the cluster's machines only and performing admin tasks on them. this user does not have access to the data (no kerberos principal and no IDBroker mapping). Instead, you can SSH to your cluster's EC2 machines with your username and workload password. That way you will have a kerberos principal working. Another thing to check is to make sure your user has IDBroker mapping to access S3 resources and potentially to access DynamoDB resources as well, since S3Guard relies on Dynamo. Hope this helps, Alex
... View more
12-08-2020
01:28 PM
1 Kudo
This could be a good start: https://community.cloudera.com/t5/Support-Questions/Using-NiFi-to-load-data-from-localFS-to-HDFS/td-p/212124
... View more
12-08-2020
01:23 PM
1 Kudo
I haven't been able to try this with distcp, but a similar thing happens with hdfs dfs commands. What I found is if you have your target folder created (e.g. hdfs dfs -mkdir /e/f/), then copying into that folder will give you all of your CSVs as separate files. If you don't have /e/f/ created ahead of time, then Hadoop will create it for you and rename your source csv to be called "f". Hope that makes sense and helps.
... View more
11-20-2020
10:57 AM
There is a way to provide frequency inside coordinator.xml that allows you to specify day-of-week. See here for details: https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_oozie_cron.html
... View more
11-20-2020
09:47 AM
You also want to check the structure of your raw data. Specifically, look for any instances where there are extra delimiters (e.g. a string field that includes commas as part of the string).
... View more
11-20-2020
09:42 AM
Ok, fair enough. There is also a CDH specific connector for Teradata (available here https://www.cloudera.com/downloads/connectors/sqoop/teradata/1-7c6.html). Try that. The installation and usage guide is here: https://docs.cloudera.com/documentation/other/connectors/teradata/1-x/PDF/Cloudera-Connector-for-Teradata.pdf
... View more
11-19-2020
09:52 AM
Your connection string is "teradat" instead of "teradata" - this can also lead to parsing error. Otherwise, have you tried exporting the entire table, without giving specific columns?
... View more
11-18-2020
09:09 PM
The issue is not with --columns parameter, actually. Problem is the sqoop can't parse the command because it expects "-m" instead of "--m". Remember, when using a short-form parameter that is a single letter, use a single dash, otherwise use double dash. Hope this helps!
... View more
11-13-2020
01:00 PM
With your original approach, each query can filter out whole partitions of the table based on the WHERE clauses (that is if your table is partitioned and at least some of the columns in the clause use those partitions). However, if your WHERE clauses are pretty different/unique, then you will be scanning big portion of the table for every one of your 100+ queries. With the suggested approach, there is only one scan of the table, but there is more processing that is happening for each row. The best way to see if performance is better is just to test it and go with the winner.
... View more
11-12-2020
10:25 AM
Do your WHERE conditions rely on different columns in MyTable or all the same columns, just different filter criteria? If it's the latter than the answer is partitioning your Hive table based on those key columns. Also if your MyTable is not too big, it would be most efficient to do your 100 queries in memory with something like SparkSQL, rather than Hive.
... View more
11-03-2020
09:57 AM
1 Kudo
The error likely indicates that some AWS resources were not reachable from CDP control plane. Double check your security policy settings and any proxy settings. Reach out to support as they will be able to better assist by being able to look at your particular environment setup. Regarding the logs, if the CM instance was stood up in your Data Lake, you can search the logs by clicking "Command logs" or "Service logs" in the Data Lake tab of your Data Lake environment.
... View more
11-03-2020
09:38 AM
The error indicates an issue with Kerberos. If things ran fine last week, perhaps your Kerberos ticket expired and needs to be renewed.
... View more
10-21-2020
09:02 PM
Try adding these options in your sqoop command: --relaxed-isolation
--metadata-transaction-isolation-level TRANSACTION_READ_UNCOMMITTED More info here: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.3/bk_data-movement-and-integration/content/controlling_trans_isol.html and here: https://www.tutorialspoint.com/java-connection-settransactionisolation-method-with-example Hope this helps. Regards, Alex
... View more
10-21-2020
05:48 PM
Can you share a little more about your use case? Are you only appending to the table or also updating records (UPSERT)? Will there be duplicate records? Also, what version of Hive are you working with?
... View more