About zack_riesland

zack_riesland · ‎02-27-2018

Here's my scenario: I have an S3 bucket full of partitioned production data: data_day=01-01-2017/000000_0 data_day=01-01-2017/000000_1 data_day=01-02-2017/000000_0 data_day=01-02-2017/000000_1 ... etc I spin up an EMR cluster and pull down some dirty data and clean it up, including de-duplicating it against the prod data. Now, on my cluster, in HDFS, I have maybe data_day=01-01-2017/000000_0 data_day=01-02-2017/000000_0 This represents new data: I know that I can create a table and point the 'location' at the bucket described above and do an "insert into" or an "insert overwrite", but this is very slow - it will use one reducer that will copy ALL the new data. Instead, I want to use s3-dist-cp which will update the data much more quickly. However, my 000000_0 chunks will overwrite the old ones. I have a script that renames the chunks: 000000_0 -> BCF704E2-B8A7-4F71-8747-A68AD52E50B7 but it takes about 3 seconds per partition, which is over an hour. So, here's my question: is there a HFDS setting to change the way the chunks are named? For example, can I force the chunks to be named using the date or a GUI? Thanks in advance

zack_riesland · ‎01-08-2018

Thanks for the feedback

zack_riesland · ‎01-04-2018

Hey everyone, I have a somewhat similar question, which I posted here: https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html I would really appreciate any ideas. cc @Lester Martin @Jagatheesh Ramakrishnan @rbiswas

zack_riesland · ‎01-03-2018

Suppose a scenario with a Hive table that is partitioned by day ("day=2017-12-12"). Suppose some process pushes data to the file store behind this table (new data under "day=2017-12-12" and "day=2017-12-13", etc). The "msck repair table" command updates the metastore to recognize all the new "chunks", and the data correctly shows up in queries. But suppose these chunks are mostly very small - is there a simple command to consolidate these? So instead of 100 small files under a partition, I get 2 well-sized ones, etc. I recognize that I can create a copy of the table and accomplish this, but that seems pretty clumsy. Is there some kind of hdfs command to "defrag" the data? FWIW, I'm using EMR with data in S3. Thanks in advance.

zack_riesland · ‎11-11-2017

I have a similar question. In my case, I need to connect to Hive using a Sas tool that only provides me with the following fields: Host(s) Port Database And then there is a tool to add "server side properties", which creates a list of key/value pairs. Can anyone tell me what server side properties I can use to force this connection to always use a specific queue? Or, a way to associate this connection with a user and associate that user with a key/value pair?

zack_riesland · ‎08-01-2017

I often need to export data from Hive to CSV files so that I can share with folks - usually they will ultimately import the CSV data into some sort of standard DB. Currently, I use a CLI command like this: hive -e 'set hive.cli.print.header=true; select * from blah where condition ' | sed 's/[\t]/,/g' > myfile.csv However, when I do it this way, null values actually get printed as "NULL". For example, an output row might be: 0|true|NULL|1|0|'my string'|NULL|etc So, my question: What can I add to my command to replace those NULL entries with just an empty character? In other words, how do I instead get this: 0|true||1|0|'my string'||etc ?

zack_riesland · ‎07-11-2017

Thanks Frank, That's interesting feedback.

zack_riesland · ‎07-11-2017

Suppose I have a Hive query like this: insert into table my_table select 1, ${hiveconf:my_variable} from some_other_table; Is there a config setting so that the value of "my_variable" will be displayed in the logging or in the verbose output of the query? All I ever see is "${hiveconf:my_variable}". Thanks!

zack_riesland · ‎07-10-2017

added some more tags hoping to get some response on this

zack_riesland · ‎06-29-2017

Right. I didn't mean to imply a relationship between Druid and Slider. Just meant: "Also want to understand how to assign processes to servers during installation of Druid." Thanks!

Online	Offline
Last Visited	‎06-10-2019 05:13 PM

Member Since	‎02-04-2016 01:07 PM
Last Visited	‎06-10-2019 05:13 PM
Posts	189
Kudos received	70

Cloudera Community

Re: Help with spark partition syntax (scala)

Re: Can I control naming patterns for HDFS chunks

Re: How to connect to Spark2 Thrift Server via JDB...

Re: Hive: Convert int timestamp to date

Re: How to clear temp data from dataflow / nifi?

Can I control naming patterns for HDFS chunks

Re: How to "defragment" hdfs data?

Re: The best approach to the thousands of small pa...

How to "defragment" hdfs data?

Re: Hive Server 2 connection configured to go to a...

How to handle nulls when exporting from Hive?

Re: How to track LLAP queries

How to display Hive variable values in logging?

Re: How to track LLAP queries

Re: Installing Slider - which nodes should have a ...