Member since
08-14-2017
7
Posts
0
Kudos Received
0
Solutions
02-06-2018
07:28 PM
Suppose I have a large Hive table, partitioned by date. I INSERT OVERWRITE a partition while a read query is currently using that table. What happens to the read query? Assume the read query is either a Hive SQL job, or a Spark SQL job. Is whatever happens deterministic? I.E. if I repeat it exactly, will I get the same or different results? Thanks!
... View more
Labels:
02-06-2018
07:20 PM
Try: //sourceDF = source data set
//joinedDF = source data set joined to target data set on key
val validationColumns = for (i <- 0 to (sourceDF.columns.length-1)) yield (joinedDF.col(joinedDF.columns(i)) <=> joinedDF.col(joinedDF.columns(i+(sourceDF.columns.length))))
val matchedDF = joinedDF.filter(validationColumns.reduce((_&&_)))
// Then...
logger.info("Number of Matched records: "+matchedDF.count())
... View more
11-14-2017
04:14 PM
Is it possible to sort data within a partitioned Hive table without bucketing? Scenario is a monster set of time series messages. I just need all messages between time T1 and T2. Would partition on a higher level date (maybe day, or half or quarter day), and sort by timestamp within that partition. Using the partition key as the bucket key would be a hack (one bucket per partition?). What about using a day as a partition key, and then bucketing by hour? Is that common practice? There are no parameters on the time difference between T1 and T2, although it will most like be hours, resulting in a few million rows being returned.
... View more
Labels:
10-30-2017
12:02 PM
I have a time series data stream I'd like to store in Hive. The current flow is: data lands in Kafka, we then use Nifi to move it to HDFS as ORC. I'd like to have the stream land in Hive sorted by timestamp in order to efficiently run 'between time T1 and T2' queries. After researching how the Nifi PutHiveStreaming processor works, I see there is a restriction on the Hive table - it must be bucketed, but not sorted. Is there a best practice write up for inserting sorted time series data in Hive using Nifi? Right now, my best option seems to be to land it in the bucketed, non sorted table using the PutHiveStreaming processor, then insert it again (how? external Hive job?? That defeats the purpose of the stream, having to batch process...) into a sorted table. Any suggestions? Thanks!
... View more
Labels:
09-28-2017
11:55 AM
We are migrating a data warehouse from a Teradata environment to Hive. I'm in the process of developing Spark code to perform a full outer join between the source table in Teradata and the target table in Hive once the movement is complete to prove the movement occurred properly: val sourceDataFrame = <table from Teradata> val targetDataFrame = <table from Hive> val joinedDataFrame = sourceDataFrame .join(targetDataFrame,sourceDataFrame.col("source_unique_id") === targetDataFrame.col("target_unique_id"),"fullouter") Both source and target tables have identical number of columns: N. The joined table has 2N columns. Is there a slick way in scala to validate that the values of each of the source columns = the values of each of the target columns for the join key WITHOUT using the column labels? Right now, the only way I know how to do source = target for all columns is val filteredCount = joinedDataFrame.filter(($"source_field1" === $"target_field1") && ($"source_field2" === $"target_field2") && .... ).count() if filteredCount = joinedDataFrame.count() then PASS Obviously this code can't be re-used as is specific for the table it is written for. There must be a way in scala to soft code this validation using column indexes rather than column names, and to have it work on an arbitrary number of columns, but I have been struggling to find it. Can anyone help? Thanks in advance!
... View more
Labels:
08-15-2017
01:04 PM
@Scott Shaw Yes, that was blindingly obvious from the quoted article. They even printed it in bold text. But why the disclaimer "We strongly recommend not to use Bucketing"? That is an unusually direct statement that is not open for interpretation that undermines the entire article. I am assuming @Dinesh Chitlangia is on the right track when he says it is probably due to poor proofing. Bucketing ordered unique keys still results in a search. Logarithmic, but a such nonetheless. Is that the best we can do here?
... View more
08-14-2017
02:24 PM
Best practices for storing and retrieving unique keys using Hive / ORC. I have a large data set in the low billions, all data is uniquely identifiable. I will only be retrieving / joining on the unique key. I would like to know the best way to store this in Hive / ORC. I would assume this is a case for bucketing. However, from reading Hortonwork's documentation at https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html I am concerned with the disclaimer "We strongly recommend not to use Bucketing." Is this an appropriate case for bucketing? Should I be concerned about the disclaimer?
... View more
Labels: