Created on 09-24-2017 07:03 AM - edited 09-16-2022 05:17 AM
hi
Need an effieciant impala function to mask/hide some charectors. java/hive is working fine with below.
for (int counter = 4; counter < len - 5; counter++)
if (input[counter] != ' ')
input[counter] = '*';
} else {
for (int counter = 0; counter < len; counter++) {
if (input[counter] != ' ')
input[counter] = '*';
}
}
so i need help with c++/impala
Regards
Created 09-25-2017 07:43 AM
[localhost:21000] > select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5; Query: select concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) from tpch.lineitem limit 5 Query submitted at: 2017-09-25 07:38:09 (Coordinator: http://tarmstrong-box:25000) Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=34f1a993e3cb99a:51d89bf800000000 +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | concat(substring(l_comment, 1, 3), regexp_replace(substring(l_comment, 4, length(l_comment) - 3), '[^ ]', '*'), substring(l_comment, length(l_comment) - 3)) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | egu*** ****** ***** *** the | | ly ***** ************* ***** **** old | | rio***** ******** ******* *** dep | | lit*** ******** **** **n de | | pe***** ****** ***** **y re | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
I'd recommend doing it with builtin functions since it will be easier to maintain. I included an example above of how you might do it using regexp_replace. I expect it will be quite fast - Impala's query compilation can inline functions like length() and substring() so those are essentially free in Impala (unlike many other SQL engines). The main cost is regexp_replace() but I'd expect that to be quite fast too.
Created 09-25-2017 07:43 AM
If you want to implement a C++ UDF though, I'd recommend starting with the docs here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_udf.html. There are some examples of string manipulation UDFs on that page.
Created 10-24-2017 12:45 AM
this has been resolve, UDFs are the best.
I wrote java and C++ codes, they check the current user. if user is not classified, data is masked for classified users it appear as is.
--final steps is to create UFS eg hrmask
--create a view with sensitive column preffixed with mask function.
[localhost.localdomain:21000] > create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact;
Query: create view redact_m as select Title,GivenName,Surname,hrmask(CCNumber),hrmask(idNumber) from redact
Fetched 0 row(s) in 0.41s
[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+------------------+---------------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+------------------+---------------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 5163458320525980 | 6345832052598 |
+-------+-----------+---------+------------------+---------------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber
Fetched 2 row(s) in 4.74s
[localhost.localdomain:21000] >
results are clear as i logged as admin, but once log as other user, sensitive columns are masked.
[localhost.localdomain:21000] > select * from redact_m limit 2;
Query: select * from redact_m limit 2
+-------+-----------+---------+----------+----------+
| title | givenname | surname | _c3 | _c4 |
+-------+-----------+---------+----------+----------+
| Title | GivenName | Surname | NULL | NULL |
| Ms. | Eva | Howard | 99999999 | 99999999 |
+-------+-----------+---------+----------+----------+
WARNINGS: Error converting column: 3 TO BIGINT (Data is: CCNumber)
Error converting column: 4 TO BIGINT (Data is: idNumber)
file: hdfs://localhost:8020/test/hive/fake.csv
record: Title,GivenName,Surname,CCNumber,idNumber
Fetched 2 row(s) in 0.15s
[localhost.localdomain:21000] >
I wish if this could be a built in function coming with sentry by default.