Member since
12-09-2015
106
Posts
40
Kudos Received
20
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1637 | 12-26-2018 08:07 PM | |
1228 | 08-17-2018 06:12 PM | |
749 | 08-09-2018 08:35 PM | |
7954 | 01-03-2018 12:31 AM | |
472 | 11-07-2017 05:53 PM |
09-12-2017
09:47 PM
When choosing number of buckets I would think about your "largest" write since the parallelism of the write is limited to the bucket count. Note that update/delete (and merge) are writes. From read side I would optimize bucket count for reading a fully (major) compacted table though since Acid tables require system defined sort order they do not support SMB joins and I'm not sure what else can benefit from bucketing. Since the writes vary greatly, I would not worry much about file sizes (meaning I don't think there is a good way to get this right) in delta directories and make sure that compaction runs frequently enough to mitigate this. hive.tez.dynamic.semijoin.reduction should be available in Hive2 in 2.6.1. The implementation of Merge uses a Right Outer Join (if you have Insert clause) and SJ reduction is designed to help this when the inner side (the target of merge) is large.
... View more
09-12-2017
08:46 PM
1 Kudo
what version are you on? Is hive.tez.dynamic.semijoin.reduction available/enabled? Small files are a transient issue - compaction will merge them into fewer. A side note: hive.merge.cardinality.check=false is probably a bad idea. This should make very little difference for perf but could lead to data corruption if you the condition it checks for is violated (i.e. if more than 1 row from sources matches the same row on target).
... View more
09-06-2017
11:07 PM
1 Kudo
Acid tables require system determined sort order so you should not specify Sort By. Also, since Acid tables have to be bucketed the system should determine which rows go to which writer based on "Clustered By (...) into N buckets" clause of the DDL so it should not need Distribute By either.
... View more
08-18-2017
06:47 PM
1 Kudo
key may be a reserved keyword - have you tried quoting it? You also want to list the columns from source table explicitly so that you have the same number of projections in source query and target table
... View more
08-02-2017
06:29 PM
since you already created directories in delta_23569546_23569546_0000 format, the compactor can't understand then. if for each X in delta_X_X you only have 1 directory (which should be the case) you can just rename it by stripping the suffix. This should let the compactor proceed. This will interfere with ongoing queries of course.
... View more
08-01-2017
03:54 PM
1 Kudo
This suffix is a feature when you are using LLAP and there is no way to avoid it. Is upgrading to HDP 2.6 an option? Compactor in 2.6 is able to handle it. If you make the target table transactional=false it won't be creating any delta directories. If you use transactional=true but don't go through LLAP on 2.5 you won't see this suffix.
... View more
07-13-2017
06:19 PM
If the file is not splittable it will be processed by 1 task.
... View more
07-10-2017
05:21 PM
1 Kudo
This is not supported. Transactional table data cannot be simply copied from cluster to cluster. Each cluster maintains a global transaction ID sequence which is embedded in the data files and file names of transactional tables. Copying the data files confuses the target system. The only way to do this right now is to copy the data to a non-acid table on source cluster using "Insert ... Select..." and then using import/export to transfer it to target side.
... View more
07-10-2017
05:16 PM
3 Kudos
ACID in Hive is enabled globally and per table. There is no such thing as enabling it per job. Existing queries will not be affected if they started before ACID was enabled.
... View more
07-03-2017
08:32 PM
Have you considered SQL Merge statement? It's designed specifically for this.
... View more
06-22-2017
02:29 PM
Is your target table partitioned? If so, have you tried hive.optimize.sort.dynamic.partition ? Providing target table DDL may be useful
... View more
06-16-2017
09:05 PM
Only managed tables can be transactional - not external.
... View more
05-23-2017
03:57 PM
1 Kudo
The advantages depend on your particular use case. If you use a single SQL Insert over JDBC to add a large batch once an hour, for example vs streaming API to do the same - there won't be any advantage. If you have batches that you want to insert every minute, for example, then streaming will be much better. Generally, if your data is available as a continuous stream, streaming will allow you to land it in the target table at very small time intervals and make it immediately visible to readers. The same can't be done efficiently with JDBC. Streaming API has been integrated with NiFi, Flume and Storm - so there are tools for ingesting event streams into Hive vs the build-it-youself JDBC approach.
... View more
05-15-2017
06:22 PM
Hive streaming API is a Java library so it should be possible to use it from any Java process.
... View more
04-19-2017
07:43 PM
2 Kudos
The short answer is you can ignore these. When you are using Hive Streaming Ingest (https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) which is used by Storm/Flume/NiFi, Hive creates these file for it's internal housekeeping for maintaining transactional consistency. They should normally be removed as soon as the TransactionBatch is closed (usually once transaction Y of delta_X_Y/ finishes). The flush_legth file may remain around if the Writer process crashes before TransactionBatch is closed. They will eventually be cleaned by the Compactor process.
... View more
04-04-2017
03:17 PM
Could you give the exact table DDL and "ls -R" directory listing of the partition?
... View more
03-28-2017
04:10 PM
This is not supported yet
... View more
03-21-2017
02:18 AM
1 Kudo
I'm a developer of this system - I believe this is safe (though I've not ran a full test to prove it)
... View more
03-16-2017
03:51 PM
it may be simpler to modify your table to add a primary key to NEXT_COMPACTION_QUEUE - any synthetic PK will work.
... View more
03-14-2017
06:29 PM
This is not streaming, but SQL merge command may be useful here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge
... View more
03-06-2017
06:38 PM
Order by produces a total order for the result set. Sort By sorts the output of each reducer so in general this will not produce the same answer
... View more
02-23-2017
07:33 PM
2 Kudos
PutHiveStreaming relies on Streaming API which has 2 relevant concepts: number of events per transaction and number of transactions per batch. Generally, the more events you write per transaction the faster the ingest. I don't see the 1st of these properties in the NiFi doc referenced above - perhaps there is some NiFi specific property that controls this.
... View more
02-17-2017
05:03 PM
can you perhaps change the column names but still provide the types? w/o it it's not possible to set up a repro case
... View more
02-16-2017
09:59 PM
@Jeet Shah, could you provide table definition, Hive version and query run please?
... View more
02-16-2017
05:08 PM
This _tmp file should be created in the Mapper of the compaction job. Is there anything about it in the job logs?
... View more
02-16-2017
04:55 PM
@Davide Ferrari could you post "dfs -lsr" of your table dir please Are you able to see the _tmp... file?
... View more
02-16-2017
04:43 PM
This looks like https://issues.apache.org/jira/browse/HIVE-15309 in which case it can be ignored
... View more
02-16-2017
04:00 PM
1 Kudo
This has been fixed in https://issues.apache.org/jira/browse/HIVE-15181. In the meantime you can set hive.direct.sql.max.query.length=1;set
hive.direct.sql.max.elements.in.clause=1000; for the Standalone Metastore process(es).
... View more
02-15-2017
10:42 PM
@Navendu Garg, could you share your table definition (show create table T)
... View more
- « Previous
-
- 1
- 2
- Next »