Member since
02-04-2016
189
Posts
70
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3895 | 07-12-2018 01:58 PM | |
8087 | 03-08-2018 10:44 AM | |
3949 | 06-24-2017 11:18 AM | |
23671 | 02-10-2017 04:54 PM | |
2367 | 01-19-2017 01:41 PM |
02-27-2018
09:21 PM
Here's my scenario: I have an S3 bucket full of partitioned production data: data_day=01-01-2017/000000_0 data_day=01-01-2017/000000_1 data_day=01-02-2017/000000_0 data_day=01-02-2017/000000_1 ... etc I spin up an EMR cluster and pull down some dirty data and clean it up, including de-duplicating it against the prod data. Now, on my cluster, in HDFS, I have maybe data_day=01-01-2017/000000_0 data_day=01-02-2017/000000_0 This represents new data: I know that I can create a table and point the 'location' at the bucket described above and do an "insert into" or an "insert overwrite", but this is very slow - it will use one reducer that will copy ALL the new data. Instead, I want to use s3-dist-cp which will update the data much more quickly. However, my 000000_0 chunks will overwrite the old ones. I have a script that renames the chunks: 000000_0 -> BCF704E2-B8A7-4F71-8747-A68AD52E50B7 but it takes about 3 seconds per partition, which is over an hour. So, here's my question: is there a HFDS setting to change the way the chunks are named? For example, can I force the chunks to be named using the date or a GUI? Thanks in advance
... View more
Labels:
- Labels:
-
Apache Hadoop
01-04-2018
11:54 AM
Hey everyone, I have a somewhat similar question, which I posted here: https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html I would really appreciate any ideas. cc @Lester Martin @Jagatheesh Ramakrishnan @rbiswas
... View more
01-03-2018
07:59 PM
Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12"). Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc). The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries. But suppose these chunks are mostly very small - is there a simple command to consolidate these? So instead of 100 small files under a partition, I get 2 well-sized ones, etc. I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy. Is there some kind of hdfs command to "defrag" the data? FWIW, I'm using EMR with data in S3. Thanks in advance.
... View more
Labels:
- Labels:
-
Apache Hive
11-11-2017
11:42 AM
I have a similar question. In my case, I need to connect to Hive using a Sas tool that only provides me with the following fields: Host(s) Port Database And then there is a tool to add "server side properties", which creates a list of key/value pairs. Can anyone tell me what server side properties I can use to force this connection to always use a specific queue? Or, a way to associate this connection with a user and associate that user with a key/value pair?
... View more
08-01-2017
10:52 AM
I often need to export data from Hive to CSV files so that I can share with folks - usually they will ultimately import the CSV data into some sort of standard DB. Currently, I use a CLI command like this: hive -e 'set hive.cli.print.header=true; select * from blah where condition ' | sed 's/[\t]/,/g' > myfile.csv However, when I do it this way, null values actually get printed as "NULL". For example, an output row might be: 0|true|NULL|1|0|'my string'|NULL|etc So, my question: What can I add to my command to replace those NULL entries with just an empty character? In other words, how do I instead get this: 0|true||1|0|'my string'||etc ?
... View more
Labels:
- Labels:
-
Apache Hive
07-11-2017
10:29 AM
Suppose I have a Hive query like this: insert into table my_table select 1, ${hiveconf:my_variable} from some_other_table; Is there a config setting so that the value of "my_variable" will be displayed in the logging or in the verbose output of the query? All I ever see is "${hiveconf:my_variable}". Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
06-29-2017
11:24 AM
Right. I didn't mean to imply a relationship between Druid and Slider. Just meant: "Also want to understand how to assign processes to servers during installation of Druid." Thanks!
... View more