Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2106 | 03-02-2018 01:19 AM | |
3425 | 03-02-2018 01:04 AM | |
2338 | 08-02-2017 05:40 PM | |
2334 | 07-17-2017 05:35 PM | |
1693 | 07-10-2017 02:49 PM |
08-09-2016
02:15 PM
@Johnny Fugers In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS. One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.
... View more
08-09-2016
01:32 PM
If you want to delete files and have them not go in the trash, use: hdfs dfs -rm -r -skipTrash <directory name>
... View more
08-08-2016
04:41 PM
1 Kudo
@Pedro Rodgers Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row. REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray);
DUMP A;
Another way to do it is: input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;
... View more
08-05-2016
07:11 PM
I'm not aware of an existing script already in HDP to do this for you. However, I did run across this: https://github.com/nmilford/clean-hadoop-tmp Note, that script is written in ruby. You could follow the logic an write it in Python, Perl or Bash.
... View more
08-05-2016
02:26 PM
1 Kudo
@Avijeet Dash You may find this HCC Article helpful: https://community.hortonworks.com/articles/49505/how-to-correctly-setup-the-hdfs-encryption-using-r.html
... View more
08-04-2016
03:12 PM
@sankar rao The file stored in /tmp should be automatically removed when the job finishes. However, if the job does not finish properly (due to an error or some other problem), the files may not always be deleted. See here: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:
On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/<username> Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.
... View more
08-04-2016
03:05 PM
@Gnanasekaran G As @Benjamin Leonhardi said, there is very little overhead to using an external table to do this. The only thing stored in the Hive Metastore is the schema about the CSV and the pointer to where the data is on HDFS. The data is left where you put it on HDFS. Using an external table is a very common way of solving this problem. Having said that, you can use Pig to load CSV data directly from HDFS. You have to define the schema for the CSV within the Pig script and you can write the data to a Hive ORC table. Be aware that the Hive ORC table must be created before you can write to it with Pig. Here is a tutorial that covers this: http://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/ Here is an example of loading CSV data via Pig: STOCK_A = LOAD '/user/maria_dev/NYSE_daily_prices_A.csv' USING PigStorage(',')
AS (exchange:chararray, symbol:chararray, date:chararray,
open:float, high:float, low:float, close:float, volume:int, adj_close:float);
DESCRIBE STOCK_A;
... View more
08-04-2016
02:02 PM
@sankar rao The files in /tmp are used as a temporary staging location while jobs are running. In my experience, if all of your jobs have completed and the files are dated older than a day or two from "now", then you can delete those files without issue.
... View more
08-04-2016
01:51 PM
@Prasanna Kulkarni It looks like there is are JIRAs for this. They are not resolved and there hasn't been any recent activity: https://issues.apache.org/jira/browse/HIVE-6897 https://issues.apache.org/jira/browse/HCATALOG-551
... View more
08-02-2016
06:05 PM
According to the documentation, you can't update partitioned or bucketed columns. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update Partitioning columns cannot be updated.
Bucketing columns cannot be updated.
... View more