Member since
12-09-2015
35
Posts
13
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1874 | 06-27-2016 01:05 AM |
04-10-2017
10:38 PM
Thanks for clearing it up @Bryan Bende, much appreciated.
... View more
04-10-2017
10:36 PM
Thanks @Matt Clarke, much appreciated
... View more
04-10-2017
04:15 AM
1 Kudo
Hi @Srini Nalluri. I am not 100% sure if i have understood your question correctly, but are you mostly interested in manipulating the attributes of the data you are passing through as flowfiles, rather than the data itself? If so, Have you seen the NiFi Expression Language guide here https://nifi.apache.org/docs.html under General -> Expression Language Guide? It describes the ways in which we can manipulate attributes. For 1, you could use an UpdateAttribute processor and use the jsonPath function to pull out various individual values from your JSON attribute and assign them to new attributes. There are some good examples in the language guide: For 2, you could set the Put Response Body In Attribute property in your InvokeHTTP processor to store it as an attribute, and then use the jsonPath expression language function in an UpdateAttribute processor to evaluate it. For 3, in an UpdateAttribute processor you could use the append/prepend functions to merge or concatenate two attribute values together, e.g. (where attribute1 and attribute 2 have previously been set): Or, there is an AttributesToJSON processor https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.AttributesToJSON/index.html if you need to get your new attributes back into JSON format. Hope it helps
... View more
04-10-2017
01:34 AM
1 Kudo
Currently any NiFi templates that are created are not stored anywhere outside of flow.xml.gz unless they are explicitly downloaded using the NiFi UI. I can see them within my flow.xml.gz file and can still download them and import them using the UI, so I am not experiencing any issues in terms of the template function. However it was my understanding that active templates would automatically be persisted in directory conf/templates, or otherwise in a custom location using the nifi.templates.directory property. Is this the correct behaviour? Or is the templates directory more a convenience for downloaded templates to be manually stored into? I have NiFi version 1.1.2. Thanks!
... View more
Labels:
- Labels:
-
Apache NiFi
10-17-2016
12:51 AM
Hi @Ashish Vishnoi. In the sqoop-export doco, it says: "The --input-null-string and --input-null-non-string arguments are optional. If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string " There is another discussion here around a similar issue. If it doesn't work even for string columns, it may be that a workaround of some kind is needed. E.g. conversion of blanks to another character (or set of characters that wouldn't normally be a part of your data set) prior to export which can be converted back to blanks once in Teradata. Hope this helps.
... View more
10-10-2016
12:22 AM
1 Kudo
When specifying fully-qualified paths to copy data between two HA clusters with DistCp, e.g: hdfs://nn1:8020/foo/bar Is the address of nn1 really referring to the where the active HDFS NameNode is, or is it looking for the active ResourceManager? Thanks!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Cloudera Manager
08-18-2016
05:32 AM
Hi @mqureshi. Thanks for your response. Personally I have no motivation to use Federation I am just curious about it as I see it mentioned occasionally, and I hadn't really come across a concrete example of its practical application and how that would work.
... View more
08-15-2016
12:24 AM
Hi @mqureshi. How are the clients divided up between the namenodes? Can the whole cluster still interact fully? E.g. if pig and hive connect to different nameservices/namenodes can they still operate on the same data on HDFS?
... View more
07-29-2016
03:04 AM
Hi @Sridharan Govindaraj. Did running your command from Unix shell instead of HBase shell solve this issue? You might also want to fully qualify your hdfs file path: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs:///user/hbase.csv
... View more
07-14-2016
01:34 AM
Hi @Sunile Manjee @Steven O'Neill's solution above fixed the issue. Originally, trying to retrieve information for a cell which should be null looked like: Using REST API and going to the HBase URL: Internet explorer would download a 0KB file HBase Shell get command: 1 row returned, "timestamp=14856365476857, value=" After doing a pig filter and store for each individual column, trying to retrieve the same cell looks like: REST: http error 404 HBase Shell get: 0 rows returned So the empty cells were not being automatically dropped out plus HBase was storing the key for the cells with no value.
... View more
07-05-2016
05:36 AM
@Sunile Manjee thank you, yes I believe so. I have read both that every cell stores a full row key, and that empty cells do not exist at all in HBase. So I think I expect the behaviour of a get command for a specified cell to bring back either a key and a non-empty value, or nothing (no key) as it "does not exist". If I request a cell that I know does not exist e.g. a real key, real column family, fake column qualifier, I get no errors but 0 rows returned. However for a cell that "doesn't exist" based on a null value, I get one row returned (the key and empty value). How does the get command know to return the key if it isn't stored as a part of that cell? Does this come from metadata rather than the cell itself? In any case, I hope this explains my confusion.
... View more
07-05-2016
04:36 AM
We currently have a Pig script that just loads in image (blob) data using AvroStorage with a predefined Avro schema, then stores into HBase with HBaseStorage specifying which columns to use. Each row from the original DB consists of a row ID and 5 image columns, although any number of the image columns could be empty/NULL. e.g. KEY IMG1 IMG2 IMG3 1 null blob blob 2 blob null blob In the HBase table, the column family for the images is i, with column names i1, i2, etc. It was my understanding that any cells containing a NULL value would automatically not be stored in HBase, however those cells are being stored as a key-value pair with a full key and an empty value '', rather than not existing. Is this the normal/expected behaviour? If not, what is the best way to get around it? Do I have to project the key with each of the five columns, filter out nulls and store them individually (i.e. store 5 times)? Or could it be related to a difference in the way Avro, Pig and HBase all represent null values? Is there a simple type conversion I could do on all the image columns so that they would automatically not be stored if they are empty? Versions: we are running plain HDP 2.1, soon to be upgraded to 2.4
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Pig
06-30-2016
12:21 AM
I haven't tested it, but I believe using -tagFile will prepend the file name, which will place it at position 0 instead of 1. I.e. GENERATE
(chararray)$0 AS Filename,
(chararray)$1 AS ID, etc. Hope this solves it!
... View more
06-28-2016
12:21 AM
1 Kudo
Hi @João Souza, no problem. Yes you should still be able to use split, just with IF (date=='2016-06-23') comparing string type instead of date type. Hope this helps!
... View more
06-27-2016
01:05 AM
3 Kudos
There's currently no mechanism to force the name of MapReduce output files. Once you've loaded all the data and added the extra column, you can split your alias into one per date, then store each one in a different directory. e.g. SPLIT Src INTO Src23 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'), Src24 IF date==ToDate('2016-06-24', 'yyyy-MM-dd'), Src25 IF date==ToDate('2016-06-23', 'yyyy-MM-dd'); STORE Src23 INTO '/data/Src/2016-06-23' using PigStorage(' '); This way, you could merge the output files in each date directory using -getmerge (and specify the resulting file name), and then copy them back onto HDFS. Another option is to force a reduce job to occur (yours is map only), and and set PARALLEL 1. It will be a slower job, but you will get one output file. E.g. Ordered23 = ORDER Src23 BY somecolumn PARALLEL 1; STORE Ordered23 INTO '/data/Src/2016-06-23' using PigStorage(' ');
You would still have to rename the files outside of this process.
... View more
04-28-2016
11:51 PM
Nice solution @Predrag Minovic. Simple and neat, thanks! +1
... View more
03-18-2016
12:13 AM
1 Kudo
Thanks @Rushikesh Deshmukh for your response. Using these, how would you recommend 'correcting' an existing store of data - compactions reduce the number of files per region, but how would you reduce the number of existing regions? Is this possible with the current status of merge tools?
... View more
03-15-2016
05:33 AM
3 Kudos
http://hbase.apache.org/0.94/book/important_configurations.html suggests manually managing HBase region splits. Do others in the community do this? If so: Do you have a use case or example regarding the steps required (including setting hbase.hregion.max.filesize) and how/what you use to implement them? and Have you found it worthwhile in terms of effort vs benefits? Thanks
... View more
Labels:
- Labels:
-
Apache HBase
12-16-2015
12:21 AM
Hi @Chris Nauroth thanks for the confirmation, and great to know the option has been suggested 🙂
... View more
12-16-2015
12:18 AM
Hi @Neeraj Sabharwal, than you for the script line - looks like i will be adding that in!
... View more
12-14-2015
04:28 AM
1 Kudo
When copying files from HDFS to a local file system: hdfs dfs -copyToLocal <source> <dest> you have options -crc and -ignoreCrc to turn the checksum files on/off. I am merging/copying out to local using hdfs dfs -getmerge <sourceDir> <destFile> and end up with a hidden .destFile.crc file for each destFile. Is there an equivalent way to turn this function off, or otherwise automatically remove the .destFile.crc if the corresponding destFile is deleted (from the local file system)? Thank you!
... View more
Labels:
- Labels:
-
Apache Hadoop
12-11-2015
05:37 AM
Thanks to @Deepesh for the workaround. Also wanted to add (for info) that these steps will not be required after HDP upgrade. We will use ALTER TABLE activeTable CONCATENATE;
to combine the many smaller ORC files into fewer larger ones (possible from Hive 0.14+). https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate
... View more
12-11-2015
12:28 AM
Hi Deepesh, gave this a try - worked perfectly! Thank you!
... View more
12-10-2015
11:31 PM
Hi Scott, it's HDP 2.1.11 (Hive 0.13.1), and the data type is "timestamp". The DDLs are identical. I am trying to avoid storing the data as a different type but can do this until an upgrade if necessary
... View more
12-10-2015
06:11 AM
1 Kudo
AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script). OPTIONS: 1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal' extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1; 2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path' QUESTION: Which way is more efficient/practical and why? Are there any other ways?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
12-10-2015
04:36 AM
We have experienced an issue where (re)processing data in Hive overwrites timestamp data.
This occurs with HDP 2.1, but not 2.3.
We are using Hive to run an ad hoc 'reorg' or 'reprocess' on existing Hive tables to reduce the number of files stored - improving query performance and reducing pressure on the cluster (found a nice explanation from @david.streever here
https://community.hortonworks.com/questions/4024/how-many-files-is-too-many-on-a-modern-hdp-cluster.html).
The active Hive table is added to daily, creating at least one ORC file per day. The schema contains several timestamp columns (e.g. created_timestamp for when each record was origingally created on the source system).
We then create a reorgTable with an identical schema to activeTable, copy the data from activeTable to the reorgTable which combines many of the smaller daily files reducing the overall number.
However, this process edits/overwrites timestamp data (and does not touch other columns):
1. Contents of activeTable
ID
created_timestamp
01
2000-01-01 13:08:21.110
02
1970-01-01 01:02:03.450
03
1990-10-08 03:09:02.780
2. Copy data from activeTable to reorgTable
INSERT INTO TABLE reorgTable SELECT * FROM activeTable;
3. Contents of the reorgTable
ID
created_timestamp
01
1990-10-08 03:09:02.780
02
1990-10-08 03:09:02.780
03
1990-10-08 03:09:02.780
Has anyone else experienced this? Is there a solution other than upgrading?
Or an alternative way to reprocess the data that might not have the same effect?
Thank you!
... View more
Labels:
- Labels:
-
Apache Hive