Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark SQL DF to Hbase table not storing all the records

Highlighted

Spark SQL DF to Hbase table not storing all the records

New Contributor

We are trying to save the Spark SQL DF with 1,30000 records to Hbase table, but i can see only around 100000 records are stored in Hbase.

We are using three node cluster each with 1 region server.

And our rowkey is the combination of timestamp and transaction id.

2 REPLIES 2

Re: Spark SQL DF to Hbase table not storing all the records

Super Guru
@Sankaraiah Narayanasamy

30K records cannot be just lost like that. Is it possible that you have duplicate records that are being overwritten? Do you see any exceptions in Spark SQL logs? What about region server logs?

Re: Spark SQL DF to Hbase table not storing all the records

Super Collaborator

If you haven't seen any exceptions during the execution I would suggest to check for duplicated rowkeys. If it's not possible to do for the source, you may try to do it for HBase table itself. I would recommend to use ruby function in HBase shell:

def count_versions(tablename, num, args = {})
    table = @shell.hbase_table(tablename)
    # Run the scanner
    scanner = table._get_scanner(args)
    count = 0
    iter = scanner.iterator
    # Iterate results
    while iter.hasNext
        row = iter.next
        i = row.listCells.count
        if i > num 
          count += 1
        end
    end
    # Return the counter
    return count
end

and execute in in the way:

count_versions 'X', 10, {RAW =>true, VERSIONS=>3}

where 10 is the expected number of cells for each row which depends on your data schema. So if the regular row has 10 cells, the one that was overwritten will have 20. Function will return the number of such rows.

Don't have an account?
Coming from Hortonworks? Activate your account here