Member since
08-14-2017
7
Posts
0
Kudos Received
0
Solutions
02-06-2018
07:28 PM
Suppose I have a large Hive table, partitioned by date. I INSERT OVERWRITE a partition while a read query is currently using that table. What happens to the read query? Assume the read query is either a Hive SQL job, or a Spark SQL job. Is whatever happens deterministic? I.E. if I repeat it exactly, will I get the same or different results? Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
02-06-2018
07:20 PM
Try: //sourceDF = source data set
//joinedDF = source data set joined to target data set on key
val validationColumns = for (i <- 0 to (sourceDF.columns.length-1)) yield (joinedDF.col(joinedDF.columns(i)) <=> joinedDF.col(joinedDF.columns(i+(sourceDF.columns.length))))
val matchedDF = joinedDF.filter(validationColumns.reduce((_&&_)))
// Then...
logger.info("Number of Matched records: "+matchedDF.count())
... View more
09-28-2017
11:55 AM
We are migrating a data warehouse from a Teradata environment to Hive. I'm in the process of developing Spark code to perform a full outer join between the source table in Teradata and the target table in Hive once the movement is complete to prove the movement occurred properly: val sourceDataFrame = <table from Teradata> val targetDataFrame = <table from Hive> val joinedDataFrame = sourceDataFrame .join(targetDataFrame,sourceDataFrame.col("source_unique_id") === targetDataFrame.col("target_unique_id"),"fullouter") Both source and target tables have identical number of columns: N. The joined table has 2N columns. Is there a slick way in scala to validate that the values of each of the source columns = the values of each of the target columns for the join key WITHOUT using the column labels? Right now, the only way I know how to do source = target for all columns is val filteredCount = joinedDataFrame.filter(($"source_field1" === $"target_field1") && ($"source_field2" === $"target_field2") && .... ).count() if filteredCount = joinedDataFrame.count() then PASS Obviously this code can't be re-used as is specific for the table it is written for. There must be a way in scala to soft code this validation using column indexes rather than column names, and to have it work on an arbitrary number of columns, but I have been struggling to find it. Can anyone help? Thanks in advance!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
08-15-2017
01:04 PM
@Scott Shaw Yes, that was blindingly obvious from the quoted article. They even printed it in bold text. But why the disclaimer "We strongly recommend not to use Bucketing"? That is an unusually direct statement that is not open for interpretation that undermines the entire article. I am assuming @Dinesh Chitlangia is on the right track when he says it is probably due to poor proofing. Bucketing ordered unique keys still results in a search. Logarithmic, but a such nonetheless. Is that the best we can do here?
... View more
08-14-2017
02:24 PM
Best practices for storing and retrieving unique keys using Hive / ORC. I have a large data set in the low billions, all data is uniquely identifiable. I will only be retrieving / joining on the unique key. I would like to know the best way to store this in Hive / ORC. I would assume this is a case for bucketing. However, from reading Hortonwork's documentation at https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html I am concerned with the disclaimer "We strongly recommend not to use Bucketing." Is this an appropriate case for bucketing? Should I be concerned about the disclaimer?
... View more
Labels:
- Labels:
-
Apache Hive