About myoung

myoung · ‎08-09-2016

@Johnny Fugers In many scenarios, Hive is used much like an RDBMS, but with better scalability and flexibility. Hive scales to PB of data which is difficult for a typical RDMBS. One of the big benefits Hive provides is a low barrier to entry for end-users. More specifically, users are able to use standard SQL to interact with the data. One of the most common use cases is to off-load many of the data processing tasks done in a typical RDMBS and have them done in Hive instead. This frees up resources on those systems for more time sensitive tasks.

myoung · ‎08-09-2016

If you want to delete files and have them not go in the trash, use: hdfs dfs -rm -r -skipTrash <directory name>

myoung · ‎08-08-2016

@Pedro Rodgers Pig won't automatically interpret the header line of your file, so you need to specify the "as (field1:type, field2:type)" definition. If you just load the file, you will get the header line as a row of data, which you don't want. There are a couple of ways you can deal with that, but using the CSVExcelStorage module from PiggyBank allows you to skip the header row. REGISTER '/tmp/piggybank.jar'; A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER') AS (field1: int, field2: chararray); DUMP A; Another way to do it is: input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray); ranked = rank input_file; NoHeader = Filter ranked by (rank_input_file > 1); New_input_file = foreach NoHeader generate row1, row2;

myoung · ‎08-05-2016

I'm not aware of an existing script already in HDP to do this for you. However, I did run across this: https://github.com/nmilford/clean-hadoop-tmp Note, that script is written in ruby. You could follow the logic an write it in Python, Perl or Bash.

myoung · ‎08-05-2016

@Avijeet Dash You may find this HCC Article helpful: https://community.hortonworks.com/articles/49505/how-to-correctly-setup-the-hdfs-encryption-using-r.html

myoung · ‎08-04-2016

@sankar rao The file stored in /tmp should be automatically removed when the job finishes. However, if the job does not finish properly (due to an error or some other problem), the files may not always be deleted. See here: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/<username> Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

myoung · ‎08-04-2016

@Gnanasekaran G As @Benjamin Leonhardi said, there is very little overhead to using an external table to do this. The only thing stored in the Hive Metastore is the schema about the CSV and the pointer to where the data is on HDFS. The data is left where you put it on HDFS. Using an external table is a very common way of solving this problem. Having said that, you can use Pig to load CSV data directly from HDFS. You have to define the schema for the CSV within the Pig script and you can write the data to a Hive ORC table. Be aware that the Hive ORC table must be created before you can write to it with Pig. Here is a tutorial that covers this: http://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/ Here is an example of loading CSV data via Pig: STOCK_A = LOAD '/user/maria_dev/NYSE_daily_prices_A.csv' USING PigStorage(',') AS (exchange:chararray, symbol:chararray, date:chararray, open:float, high:float, low:float, close:float, volume:int, adj_close:float); DESCRIBE STOCK_A;

myoung · ‎08-04-2016

@sankar rao The files in /tmp are used as a temporary staging location while jobs are running. In my experience, if all of your jobs have completed and the files are dated older than a day or two from "now", then you can delete those files without issue.

myoung · ‎08-04-2016

@Prasanna Kulkarni It looks like there is are JIRAs for this. They are not resolved and there hasn't been any recent activity: https://issues.apache.org/jira/browse/HIVE-6897 https://issues.apache.org/jira/browse/HCATALOG-551

myoung · ‎08-02-2016

According to the documentation, you can't update partitioned or bucketed columns. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update Partitioning columns cannot be updated. Bucketing columns cannot be updated.

Online	Offline
Last Visited	‎02-08-2019 07:03 PM

Member Since	‎02-09-2016 09:44 PM
Last Visited	‎02-08-2019 07:03 PM
Posts	559
Kudos received	413

Cloudera Community

Re: How can I force the getTwitter processor to no...

Re: Send Ambari Metric to Elasticsearch

Re: Ingesting unformatted, unordered data from hdf...

Re: What would the audit record on Zeppelin users ...

Re: Automate loading data into HDFS

Re: Apache Hive - What kind of activities are norm...

Re: HDFS Space not reclaimed

Re: Apache PIG - Create a Schema or the Schema is ...

Re: hadoop /tmp files deletion confirmation ,can i...

Re: Ranger KMS Tutorial

Re: hadoop /tmp files deletion confirmation ,can i...

Re: How to load CSV file directly into Hive ORC ta...

Re: hadoop /tmp files deletion confirmation ,can i...

Re: How to append or overwrite the existin partiti...

Re: How to append or overwrite the existin partiti...