Member since
02-27-2020
157
Posts
38
Kudos Received
43
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
200 | 05-20-2022 09:46 AM | |
111 | 05-17-2022 08:42 PM | |
172 | 05-06-2022 06:50 AM | |
185 | 04-18-2022 07:53 AM | |
147 | 04-12-2022 11:17 AM |
06-29-2020
09:29 AM
1 Kudo
This problem is typically solved by either (a) clearing cookies and restarting your browser; and/or (b) logging out and back into CDSW. Let me know if that works for you.
... View more
06-27-2020
10:48 PM
1 Kudo
Hi Guy, Please try adjusting your command to the following: ozone sh volume create --quota=1TB --user=hdfs o3://ozone1/tests Note the documentation states that the last parameter is a URI in the format <prefix>://<Service ID>/<path>. Service Id is what you found in ozone-site.xml.
... View more
06-12-2020
11:28 AM
Glad you solved it, @Anshul99 . Mind sharing what the solution was for others' sake?
... View more
06-12-2020
09:48 AM
Hi @Maria_pl , generally speaking the approach is as follows: 1. Generate a dummy flow file that will trigger (GenerateFlowFile processor) 2. Next step is UpdateAttribute processor that sets the start date and end date as attributes in the flow file 3. ExecuteScript is next. This can be a python script, or whichever language you prefer, that will use the start and end attributes to list out all the dates in between. 4. If your script produces single file output of dates, you can then use SplitText processor to cut each row into its own flow file and from there each file will have its own unique date in your range. Hope that makes sense.
... View more
06-11-2020
09:56 PM
Can you post your NiFi flow for clarity? Also show how your PutParquet processor is configured along with the schema controller service. My understanding is each time you run ExecuteSQL processor it creates one and only one flow file that contains all of the output from the select statement in AVRO format. So, where are you getting multiple flow files?
... View more
06-11-2020
01:57 PM
This should do the trick: https://community.cloudera.com/t5/Support-Questions/Nifi-Could-not-rename-file/td-p/231662 The main idea is to overwrite the flow file names with a known value and then set Overwrite Files = True in PutParquet processor. Hope this helps.
... View more
05-27-2020
03:25 PM
In a typical CDH setup, Tableau would connect through either Hive or Impala. The former is suited more fore recurring reports, while the latter is more suitable for interactive data exploration. CDH is compatible with other visualization tools (e.g. through JDBC and ODBC drivers) Here's the documentation for setting up connection to Impala from Tableau: https://help.tableau.com/current/pro/desktop/en-us/examples_impala.htm And here's the documentation for connect to Hive from Tableau: https://help.tableau.com/current/pro/desktop/en-us/examples_hadoop.htm Hope this helps and is an acceptable solution for your use case.
... View more
05-26-2020
07:48 PM
Ok, so regarding single quotes vs. double quotes, you have to use double quotes in shell every time. Text in single quotes is treated as liternal (see p.271 of HBase Definitive Guide). After some more research I came across this post which seems to describe your problem exactly, along with two solutions on how to modify your Java code. To summarize, Java client for HBase expects row keys to be in human readable format, not their hexadecimal representation. Solution is to read your args as Double type, not String. Hope that finally resolves it.
... View more
05-26-2020
03:02 PM
This thread may be relevant: https://community.cloudera.com/t5/Support-Questions/HDFS-is-almost-full-90-but-data-node-disks-are-around-50/td-p/180860
... View more
05-26-2020
02:52 PM
Perhaps something about how Java interprets the args you pass to it when you run your code? It may be different from how shell client interprets them (relevant discussion here). Can you show how the command that executes your Java code, complete with the arguments passed to it? Also, include printed arguments (e.g. System.out.println(rowId)) in your code. Execute the code for the same key as you did in shell (i.e. \x00\x0A@E\xFFn[\x18\x9F\xD4-1447846881#5241968320)
... View more
05-26-2020
02:16 PM
The issue is that the DROP TABLE statement doesn't seem to remove the data from HDFS. This is usually caused by the table being an external table that doesn't allow Hive to perform all operations on it. Another thing you can try is what's suggested in this thread (i.e. before you drop the table, change its property to be EXTERNAL=FALSE). Does that work for you?
... View more
05-26-2020
11:59 AM
Ok, so to be able to purge the table created by sqoop, because it is external, you'll need to add to your stanza: --hcatalog-storage-stanza 'stored as parquet TBLPROPERTIES("external.table.purge"="true")' Then when you load the data for the first time it will enable purging on that table. Executing purge command you have will then remove both metadata and the dat in the external table. Let me know if that works and if the solution is acceptable.
... View more
05-22-2020
12:51 PM
1 Kudo
Sqoop can only insert into a single Hive partition at one time. To accomplish what you are trying to do, you can have two separate sqoop commands: sqoop with --query ... where year(EventTime)=2019 (remove year(EventTime)=2020) and set --hive-partition-value 2019 (not 2020) sqoop with --query ... where year(EventTime)=2020 (remove year(EventTime)=2019) and set --hive-partition-value 2020 (not 2019) This way sqoop will write into the one partition you want. Since this is one-time import, the solution should work just fine. Let me know if this works and accept the answer if it makes sense.
... View more
05-22-2020
12:09 PM
Hi Heri, After you execute drop table test purge; can you check that the data is actually deleted? Do a query on the table first, but also check with hdfs dfs to see if the underlying files have been deleted from Hadoop (they should be). Let me know what you see. You may be right that EXTERNAL table data does not get deleted, just the metadata is deleted. That's why I'm asking you to check for data with hdfs dfs. Now, to be able to drop the EXTERNAL table (both metadata and data) you'd need to follow the steps here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_drop_external_table_data.html Hope that helps.
... View more
05-22-2020
11:28 AM
Some additional information you could provide to help the community answer the question: Are there any errors that Java returns when querying HBase or does it just silently not show any rows? Is the same user executing both tasks (through shell and java)? Can any other rows be retrieved from Java?
... View more
05-22-2020
11:25 AM
Hi Balu, Zeppelin is not shipped with CDP-DC 7.0.3. It will be part of the CDP-DC 7.1 release coming soon. However, depending on what you are trying to do with Zeppelin you may want to try a CDSW trial which provides an interface for scripting and data exploration over CDP-DC cluster. Hope that helps.
... View more
05-13-2020
01:43 PM
Try sending a private message to @cjervis as he's been able to get this done in the past. He's a Community Manager. Let me know if you can't reach him.
... View more
05-11-2020
02:58 PM
Thanks for clarifying the question, but I'm afraid I still don't know what you are trying to achieve. Based on your example I understand you have 10K records/documents with phone number P1 and 20K records/documents with phone number P2. You are retrieving all 10K documents only in a single query? And you want the performance of 10K row P1 query to be the same as a 10K row P2 query. Is that right? Solr was never built to retrieve large number of objects at one time. It's meant for faceted-search that returns humanly consumable number of records in the result set (see pagination). Are you doing this for UI display or for integration purposes? There is some useful documentation here on getting large number of records from Solr. It would be helpful if you shared your query, what your data structure is, and what is the use case. That way the community can better understand the problem and provide potential solution.
... View more
05-08-2020
01:10 PM
A Solr query can have a filter clause that will ensure only the documents from the last 7 days are fetched. What you are looking for is a filter query (fq) parameter. For example you could add this to your query: &fq=createdate:[NOW-7DAY/DAY TO NOW] You can read more about filtering in documentation here. If this is helpful please don't forget to give kudos and/or accept solution.
... View more
04-29-2020
01:54 PM
I just ran the following query through Hive and it worked as expected. select
col1,
col2,
case when col1 = "Female" and col2 = "Yes" then "Data Received" end
from table_name limit 100; Can you provide some steps to reproduce?
... View more
04-17-2020
07:58 AM
1 Kudo
Glad things are moving forward for you, Heri. Examining your sqoop command, I notice the following: -- check-column EventTime tells sqoop to check this column as the timestamp column for select logic --incremental lastmodified tells sqoop that your source SQL table can have both records added to it AND records updated in it. Sqoop assumes that when a record is updated or added its EventTime is set to current timestamp. When you run this job for the first time, sqoop will pickup ALL records available (initial load). It will then print out a --last-value timestampX. This timestamp is the cutoff point for the next run of the job (i.e. next time you run the job with --exec incjob, it will set --last-value timestampX) So, to answer your question, it looks like sqoop is treating your job as an incremental load on the first run: [EventTime] < '2020-04-17 08:51:00.54'. When this job is kicked off again, it should pickup records from where it left off automatically. If you want, you can provide a manual --last-value timestamp for the initial load, but make sure you don't use it on subsequent incremental loads. For more details, please review sections 7.2.7 and 11.4 of Sqoop Documentation If this is helpful, don't forget to give kudos and accept the solution. Thank you!
... View more
04-16-2020
02:29 PM
On the surface it looks like your second command should work too, since you've provided the --update-mode allowinsert parameter. Can you execute the second command with additional --verbose option and provide the output here?
... View more
04-16-2020
07:51 AM
I'm not aware of native way of doing this in NiFi, may be someone else will be able to chime in. Couple pieces of information that would help here is: How big are the CSV files? Do the CSV files have any common column to join on or is the order of records always ensured? Couple alternatives for you: Are the two CSVs being generated from the same database? If so, why not do a single execute SQL processor that does the selection of all columns for you. If there are common IDs and the CSV tables are on the small side, you can load one of them into Lookup Service and do a lookup with LookupRecord. Final solution is to write a shell or Python script to perform the operation and execute it from NiFi with ExecuteProcess.
... View more
04-15-2020
03:13 PM
You'll need to create a Kafka producer that reads the tail of the Ranger log files and pushes that to a Kafka topic. Here is an example of an implementation in Python.
... View more
04-15-2020
12:55 PM
2 Kudos
You will need to get both the year and the month of the original date (because of the leap year considerations). In terms of how to accomplish this in NiFi one way is to use ExecuteScript processor and Python's calendar library with the function calendar.monthrange(year, month) that returns the last day of month as the second argument.
... View more
04-15-2020
08:16 AM
1 Kudo
You can try the ReplaceText NiFi processor withe the approached described here. That will be a clean way of doing what you want without much scripting.
... View more
04-15-2020
08:10 AM
1 Kudo
This is the offending code: new = year+month+day+hour
# Considering date is in mm/dd/yyyy format
#converting the appendd list to strings instead of ints
b=[str(x) for x in new]
#joining all the data without adding
b = '/'.join(b)
#convert to unix
dt_object2 = datetime.strptime(b, "%Y/%m/%d/%H") It looks like at some point the values of year, month, day, hour are set to strings "Year", "month", "DOY", "Hour". Then when new = year+month+day+hour is called, the strings get concatenated into "YearmonthDOYHour". You then split and join that string so that's why you see a '/' character between each character in the Python error message. I'll leave it you to debug this, as I've lost track of all the code changes at this point. Also note that the incoming data may be providing you with Day of Year (DOY) instead of day of month, which is what %d. You may need to use %j to parse that out with zero padding (see documentation here). If this is helpful, don't forget to give kudos or accept solution.
... View more
04-15-2020
07:37 AM
1 Kudo
Once you call IOUtils.toString, you get the text variable containing your message(s). Then it is appropriate to call json.loads on that text variable, as that is the function that will convert text json structure to a callable python object. Should be something like this: text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
json_data = json.loads(text) After this you should be able to access the elements of the json with: json_data['Year'] Let me know if that works.
... View more
04-14-2020
05:28 PM
Can you show the output of the print statement for your dataframe df? That way we can tell how Spark interprets the read and whether is problem is with the read or the write operation.
... View more
- « Previous
- Next »