Member since
08-14-2017
7
Posts
0
Kudos Received
0
Solutions
02-06-2018
08:22 PM
No, it is not deterministic as far as Hive 1.X behavior, with the default of hive.support.concurrency=false. Hive 1.X has a non-ACID ZK-based lock manager, however, this makes readers wait and it's not recommended. The ACID implementation doesn't block readers, but is not available in the current HDP releases. It may also be worth looking at EXCHANGE PARTITION, however, this is not exactly atomic, it is just a smaller window for the non-determinism. The way it works without locks is
the files are written to HDFS in a new dir, and then the dir is renamed. This can lead to a race condition without locking. Queries that started on the basis of the old directory could fail (LLAP is an exception to this rule because it uses inodes not filenames as references).
... View more
02-06-2018
07:20 PM
Try: //sourceDF = source data set
//joinedDF = source data set joined to target data set on key
val validationColumns = for (i <- 0 to (sourceDF.columns.length-1)) yield (joinedDF.col(joinedDF.columns(i)) <=> joinedDF.col(joinedDF.columns(i+(sourceDF.columns.length))))
val matchedDF = joinedDF.filter(validationColumns.reduce((_&&_)))
// Then...
logger.info("Number of Matched records: "+matchedDF.count())
... View more
08-15-2017
02:22 PM
@Joe Olson I didn't review the article before posting. My guess is the advice is to avoid bucketing because you will have much better control over performance with partitioning. Most user queries never use a table's unique id in a WHERE clause. Just be careful not to have partitions too small. A GB or more per partition is a good rule of thumb.
... View more