Member since
10-03-2020
160
Posts
13
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
490 | 12-15-2021 05:26 PM | |
665 | 10-22-2021 10:09 AM | |
1633 | 10-20-2021 08:44 AM | |
1648 | 10-20-2021 01:01 AM | |
1053 | 10-02-2021 04:19 AM |
08-02-2022
03:32 AM
Hello @syedshakir , Please let us know what is your cdh version? Case A: If I'm understanding correctly you have a kerberized cluster and the file is at local not on hdfs, so you don't need kerberos authentication. Just refer to below google docs, there are a few ways to do it: https://cloud.google.com/storage/docs/uploading-objects#upload-object-cli Case B: To be honest I never did it so I would try: 1. follow the below document to configure google cloud storage with hadoop: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_gcs_config.html 2. if distcp cannot work then follow this document to configure some properties: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_admin_distcp_secure_insecure.html 3. save the whole output of distcp then upload to here, I can help you to check. Remember to remove the sensitive information (such as hostname, ip) from the logs then you can upload. If the distcp output doesn't contain kerberos related errors then you can enable debug logs then re-run the distcp job and save the new output with debug logs: export HADOOP_ROOT_LOGGER=hadoop.root.logger=Debug,console;export HADOOP_OPTS="-Dsun.security.krb5.debug=true" Thanks, Will
... View more
04-22-2022
02:44 AM
Hello @arunr307 , What is the CDH version?Could you attach the full output of this command, from the command help menu there's no properties about split size: # hbase org.apache.hadoop.hbase.mapreduce.Export ERROR: Wrong number of arguments: 0 Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]] Note: -D properties will be applied to the conf used. For example: -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec -D mapreduce.output.fileoutputformat.compress.type=BLOCK Additionally, the following SCAN properties can be specified to control/limit what is exported.. -D hbase.mapreduce.scan.column.family=<family1>,<family2>, ... -D hbase.mapreduce.include.deleted.rows=true -D hbase.mapreduce.scan.row.start=<ROWSTART> -D hbase.mapreduce.scan.row.stop=<ROWSTOP> -D hbase.client.scanner.caching=100 -D hbase.export.visibility.labels=<labels> For tables with very wide rows consider setting the batch size as below: -D hbase.export.scanner.batch=10 -D hbase.export.scanner.caching=100 -D mapreduce.job.name=jobName - use the specified mapreduce job name for the export For MR performance consider the following properties: -D mapreduce.map.speculative=false -D mapreduce.reduce.speculative=false Thanks, Will
... View more
01-18-2022
06:30 AM
Hi @rahul_gaikwad, The issue occurs due to a known limitation. As the code points out, it indicates that the single write operation cannot fit into the configured maximum buffer size. Please refer to this KB: https://my.cloudera.com/knowledge/quot-ERROR-Error-applying-Kudu-Op-Incomplete-buffer-size?id=302775 Regards, Will
... View more
01-18-2022
06:26 AM
Hi @naveenks, Please refer to below doc: https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_admin_distcp_data_cluster_migrate.html Thanks, Will
... View more
12-15-2021
05:26 PM
Hi @ryu Volume: As described in HDFS architecture , the NameNode stores metadata while the DataNodes store the actual data content. Each DataNode is a computer which usually consists of multiple disks (in HDFS’ terminology, volumes ). A file in HDFS contains one or more blocks. A block has one or multiple copies (called Replicas), based on the configured replication factor. A replica is stored on a volume of a DataNode, and different replicas of the same block are stored on different DataNodes. https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/ Directory(usually don't say it folders): like other file system, hdfs directory is hierarchical file structure https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html Regards, Will
... View more
12-15-2021
04:59 PM
I agree @Nandinin's suggestion. Adding some thoughts on hdfs side for your reference: 1. Now you know which 3 DNs maybe slow in the pipeline and the timestamp. So you can go to each datanode log, to see if there are "JvmPauseMonitor" ? or "Lock held"? or other WARN / ERROR ? 2. Refer to this KB https://my.cloudera.com/knowledge/Diagnosing-Errors-Error-Slow-ReadProcessor-Error-Slow?id=73443, check the Slow message from DN logs around the above timestamp to determine what is the main cause. Regards, Will
... View more
11-09-2021
04:02 AM
Hi @loridigia, Based on the current error you provided "org.apache.hadoop.hbase.NotServingRegionException: table XXX is not online on worker04" maybe some regions are not deployed on any RegionServers yet. please check this result to see is there any inconsistencies on this table: 1. sudo -u hbase hbase hbck -details > /tmp/hbck.txt 2. If you see inconsistencies please grep ERROR from hbck.txt you will see which region has problem. 3. Then you need to check if this region's directory is complete in this result: hdfs dfs -ls -R /hbase 4. Then need to check in hbase shell : scan 'hbase:meta', if this region's info are updated in hbase:meta table. 5. Based on type of the issue we need to use hbck2 jar to fix the inconsistencies. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2 These are general steps to deal with this kind of problem, there could be more complex issues behind it. We suggest you to file a case with Cloudera support. Thanks, Will
... View more
10-28-2021
02:57 AM
Hi @uygg, Please check if 3rd party jars like Bouncy castle jars are added. If that is the cause please remove them then restart RM. Thanks, Will
... View more
10-22-2021
10:09 AM
Hi @Rjkoop Visibility labels are not officially supported by Cloudera, please refer to this link: https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_unsupported_features.html#hbase_c6_unsupported_features Regards, Will
... View more
10-20-2021
08:44 AM
Hi @DA-Ka, SUM and JOIN won't change the timestamp of the underlying file. Example: create table mytable (i int,j int,k int); insert into mytable values (1,2,3),(4,5,6),(7,8,9); create table mytable2 (i int,j int,k int); insert into mytable2 values (1,2,6),(3,5,7),(4,8,9); select * from mytable; +------------+------------+------------+ | mytable.i | mytable.j | mytable.k | +------------+------------+------------+ | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 | +------------+------------+------------+ select * from mytable2; +-------------+-------------+-------------+ | mytable2.i | mytable2.j | mytable2.k | +-------------+-------------+-------------+ | 1 | 2 | 6 | | 3 | 5 | 7 | | 4 | 8 | 9 | +-------------+-------------+-------------+ # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2 drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0 1. Sum, timestamp is unchanged select pos+1 as col,sum (val) as sum_col from mytable t lateral view posexplode(array(*)) pe group by pos; +------+----------+ | col | sum_col | +------+----------+ | 2 | 15 | | 1 | 12 | | 3 | 18 | +------+----------+ # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 2. Inner Join, timestamp is unchanged select * from (select * from mytable)T1 join (select * from mytable2)T2 on T1.i=T2.i +-------+-------+-------+-------+-------+-------+ | t1.i | t1.j | t1.k | t2.i | t2.j | t2.k | +-------+-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 1 | 2 | 6 | | 4 | 5 | 6 | 4 | 8 | 9 | +-------+-------+-------+-------+-------+-------+ sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2 drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0 Regards, Will
... View more
10-20-2021
01:01 AM
Hi @DA-Ka, Below example is inspired by this link 1) use -t -R to list files recursively with timestamp: # sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1 drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0 2) filter the files older than a timestamp: sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}' # sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}' 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0 Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps. Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
10-19-2021
04:57 AM
1 Kudo
Hi @kras, From the evidences you provided, the most frequent warning is: WARN [RpcServer.default.FPBQ.Fifo.handler=10,queue=10,port=16020] regionserver.RSRpcServices: Large batch operation detected (greater than 5000) (HBASE-18023). Requested Number of Rows: 12596 Client: svc-stats//ip first region in multi=table_name,\x09,1541077881948.9bcc8cee00ab92b2402730813923c2f6. which indicates when an RPC is received from a client that has more than 5000 "actions" (where an "action" is a collection of mutations for a specific row) in a single RPC. Misbehaving clients who send large RPCs to RegionServers can be malicious, causing temporary pauses via garbage collection or denial of service via crashes. The threshold of 5000 actions per RPC is defined by the property "hbase.rpc.rows.warning.threshold" in hbase-site.xml. Please refer to this jira: https://issues.apache.org/jira/browse/HBASE-18023 for detailed explanation. We can identify the table name is "table_name", please check which application is writing / reading this table. Simplest way is to halt this application, to see if performance is improved. If you identified the latency spike is due to this table, please improve your application logic, control your batch size. If you have already improved the "harmful" applications but still see performance issues, I would recommend you read through this article which include most common performance issues and tuning suggestions: https://community.cloudera.com/t5/Community-Articles/Tuning-Hbase-for-optimized-performance-Part-1/ta-p/248137 This article has 5 parts, please read through it you will have ideas to tune your hbase. This issue looks like a little complex, t here will be multi-factors to impact your hbase performance. We encourage you to raise support cases with Cloudera. Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
10-17-2021
06:45 AM
Hi @dzbeda, The definition of "dfs.balancer.getBlocks.min-block-size" is "Smallest block to consider for moving". What is the version of hadoop? Is it CDH or HDP? What is the version of CDH / HDP? For CDH please refer to: https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hdfs_balancer.html#cmug_topic_5_14__section_lqb_rzp_x2b https://docs.cloudera.com/documentation/enterprise/6/properties/6.1/topics/cm_props_cdh5160_hdfs.html#concept_6.1.x_balancer_props HDFS Balancer and DataNode Space Usage Considerations: https://my.cloudera.com/knowledge/HDFS-Balancer-and-DataNode-Space-Usage-Considerations?id=73869 Regards, Will
... View more
10-13-2021
08:00 PM
Hi @kras, 1. Is it CDH or HDP, what is the version. 2. In regionserver logs is there “responseTooSlow” or “operationTooSlow” or any other WARN/ERROR messages. please provide log snippets. 3. How is the locality of the regions (check locality on hbase webUI, click on table, on right side there is a column shows each region locality.) 4. How many regions deployed on each RegionServer. 5. Any warning / errors in RS log around the spike? 6. Is any job trying to scan every 10 min? Which table contribute most I/O? Is there any hotspot. 7. is HDFS healthy? check DN logs, is there any slow messages around the spike? Refer to https://my.cloudera.com/knowledge/Diagnosing-Errors-Error-Slow-ReadProcessor-Error-Slow?id=73443 Regards, Will
... View more
10-02-2021
04:19 AM
1 Kudo
@Tamiri , Please click on your avatar and check My settings > SUBSCRIPTIONS&NOTIFICATIONS Another place is when you reply to post, on the top right select "Email me when someone replies". Regards, Will
... View more
10-01-2021
07:01 AM
Hello @rahuledavalath, What HDP version and what CDP version are you using? Regards, Will
... View more
09-29-2021
09:50 AM
1 Kudo
Then above solutions meet your needs.
... View more
09-29-2021
09:14 AM
Hi @Visvanath_JP, The question could be more specific like what hadoop versions are two clusters, are both clusters secured, are they CDH/CDP or HDP. Do you only migrate data in HDFS layer or other layer, for example hive / hbase / kudu. The most common way is using distcp to migrate data between hdfs clusters. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/scaling-namespaces/topics/hdfs-distcp-to-copy-files.html If you are using CDH/CDP, BDR job is another choice (distcp integrated) https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/replication-manager/topics/rm-dc-hdfs-replication.html Distcp guide: https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#:~:text=DistCp%20(distributed%20copy)%20is%20a,specified%20in%20the%20source%20list. Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-24-2021
10:09 PM
Hello @Clua , Looks like you solved it, if possible could you please share the code snippets how you added gssflags in the authGSSClientInit and which Transport function are you using. Thanks, Will
... View more
09-24-2021
08:15 PM
1 Kudo
Hi @drgenious, 1) where can I run these kind of queries? In CM -> Charts -> Chart Builder builder you can run tsquery. Refer to this link: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_dg_chart_time_series_data.html 2) where can I find the attributes like category and clusterName in cloudera? In Chart Builder text bar, write an incomplete query like: SELECT get_file_info_rate Below the text bar there is Facets, click on More, select any Facets you want, for example you select clusterName, then you will see a the clusterName shows in the chart's title. Then you can complete your tsquery: SELECT get_file_info_rate where clusterName=xxxxx If you want to build impala related charts, suggest to firstly review the CM > Impala service > Charts Library, many charts are already there for common monitoring purpose. You can open any of the existing charts to learn how to construct the tsquery and then build your own charts. Another very good place to learn is CM > Charts > Chart Builder, at right side you will see a "?" button, click on it you will see many examples and you could just cllick "try it". Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-22-2021
06:30 AM
1 Kudo
Hi @doncucumber , Now you can have a good rest 🙂 Please check this article for the detailed HA concepts: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_hag_hdfs_ha_intro.html Active NN will sync up edit logs to majority of JournalNodes, so standby NN is capable of reading the edits from JNs. From your NN log we could see the reason of recovery failure is the recovery time exceeds the timeout 120000ms for a quorum of nodes to respond, so that's why I requested JN logs to check if edits sync-up failed and we found JN2 & 3's problem. Regards, Will
... View more
09-22-2021
05:28 AM
Hi @doncucumber, From these JN logs we can say edits_inprogress_0000000000010993186 was not the latest edit log, in JN1 the edit_inprogress number is13015981 but the problematic JN2 and JN3 is still 10993186. You could try following steps: 1. stop whole HDFS service including all NN/JN. 2. On JN2 /JN3 which both has same error, move the fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) to another location, for example, move them to /tmp 3. Copy the good fsimage edits directory (/datos3/dfs/jn/nameservice1/current/) from JN1 to these problematic JN nodes. Now you have manually synced up all the JNs fsimage edits directories. 4. Start HDFS. Please let us know if this solution could help. Thanks, Will
... View more
09-22-2021
04:42 AM
Hi @doncucumber , Could you please update what is the error in JN logs at the same time? Thanks, Will
... View more
09-20-2021
01:30 AM
Hi @Tamiri, I think @Shelton has already answered in another post: https://community.cloudera.com/t5/Support-Questions/Hortonworks-HDP-3-0-root-user-password-doesn-t-work/m-p/286034# https://www.cloudera.com/tutorials/learning-the-ropes-of-the-hdp-sandbox.html Please check if it helps? Thanks, Will
... View more
09-19-2021
08:38 PM
1 Kudo
Introduction
Thrift proxy is a modern micro-service framework when comparing to other existing frameworks such as SOAP/JSON-RPC/Rest proxy. The Thrift proxy API has a higher performance, is more scalable, and is multi-language supported. (C++, Java, Python, PHP, Ruby, Perl, C#, Objective-C, JavaScript, NodeJs, and other languages).
The application can interact with HBase via Thrift proxy.
This article will discuss how to use correct libraries and methods to interact with HBase via Thrift proxy.
Outline
The basic concept of Thrift proxy and how the thrift language bindings are generated.
How Python thrift functions align with the correct settings of HBase configurations from Cloudera Manager.
Sample client codes in security disabled/ enabled HBase clusters.
Some known bugs when using TSaslClientTransport with Kerberos enabled in some CDP versions.
The basic concept of Thrift proxy and how the Thrift cross-language bindings are generated
The Apache Thrift library provides cross-language client-server remote procedure calls (RPCs), using Thrift bindings. A Thrift binding is a client code generated by the Apache Thrift Compiler for a target language (such as Python) that allows communication between the Thrift server and clients using that client code. HBase includes an Apache Thrift Proxy API, which allows you to write HBase applications in Python, C, C++, or another language that Thrift supports. The Thrift Proxy API is slower than the Java API and may have fewer features. To use the Thrift Proxy API, you need to configure and run the HBase Thrift server on your cluster. You also need to install the Apache Thrift compiler on your development system.
Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift
The IDL file named Hbase.thrift is in CDP parcels.
find / -name "Hbase.thrift"
IDL compiler will be installed by following the steps in Building Apache Thrift on CentOS 6.5.
Follow this article to generate Python library bindings (Server stubs). Now, you should be able to import Python libraries into your client code.
How Python functions align with the HBase Configurations from Cloudera Manager
In many examples, you will see several functions to interact with thrift. The concepts of Transport, socket, protocol are described in the book Programmer’s Guide to Apache Thrift.
Image credits: The above figure is copied from Programmer’s Guide to Apache Thrift
We will discuss how the functions work with HBase configurations.
These parameters are taken into consideration:
Is SSL enabled? (search “SSL” in CM > HBase configuration, usually auto-enabled by CM)
Use SSLSocket, otherwise use socket
hbase.thrift.security.qop=auth-conf ? This means Kerberos is enabled.
Use TSaslClientTransport
hbase.regionserver.thrift.compact=true?
Use TCompactProtocol, otherwise use TBinaryProtocol
hbase.regionserver.thrift.framed=true?
Use TFramedTransport otherwise use TBufferedTransport
hbase.regionserver.thrift.http=true and hbase.thrift.support.proxyuser=true?
means DoAs implementation is required. The http mode cannot co-exist with Frame mode. Use THttpClient
Sample client codes in security disabled/ enabled HBase clusters
Kerberos enabled / SSL disabled:
Settings:
SSL disabled
hbase.thrift.security.qop=auth-conf
hbase.regionserver.thrift.compact = false
hbase.regionserver.thrift.framed=false
hbase.regionserver.thrift.http=false
hbase.thrift.support.proxyuser=false
from thrift.transport import TSocket
from thrift.protocol import TBinaryProtocol
from thrift.transport import TTransport
from hbase import Hbase
import kerberos
import sasl
from subprocess import call
thrift_host=<thrift host>
thrift_port=9090
# call kinit commands to get the kerberos ticket.
krb_service='hbase'
principal='hbase/<host>'
keytab="/path/to/hbase.keytab"
kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+principal
call(kinitCommand,shell="True")
socket = TSocket.TSocket(thrift_host, thrift_port)
transport = TTransport.TSaslClientTransport(socket,host=thrift_host,service='hbase',mechanism='GSSAPI')
protocol = TBinaryProtocol.TBinaryProtocol(transport)
transport.open()
client = Hbase.Client(protocol)
print(client.getTableNames())
transport.close()
This works in CDH 6, but does not work in some CDP versions due to a known bug described in the next section.
Kerberos enabled /SSL enabled:
Settings:
SSL enabled
hbase.thrift.security.qop=auth-conf
hbase.regionserver.thrift.compact = false
hbase.regionserver.thrift.framed=false
hbase.regionserver.thrift.http=true
hbase.thrift.support.proxyuser=true
The following code is changed and tested based on @manjilhk 's post here.
from thrift.transport import THttpClient
from thrift.protocol import TBinaryProtocol
from hbase.Hbase import Client
from subprocess import call
import ssl
import kerberos
def kerberos_auth():
call("kdestroy",shell="True")
clientPrincipal='hbase@<DOMAIN.COM>'
# hbase client keytab is copied from /keytabs/hbase.keytab
# you can find the location using “find”
keytab="/path/to/hbase.keytab"
kinitCommand="kinit"+" "+"-kt"+" "+keytab+" "+clientPrincipal
call(kinitCommand,shell="True")
# this is the hbase service principal of HTTP, check with
# klist -kt /var/run/cloudera-scm-agent/process/<latest-thrift-process>/hbase.keytab
hbaseService="HTTP/<host>@<DOMAIN.COM>"
__, krb_context = kerberos.authGSSClientInit(hbaseService)
kerberos.authGSSClientStep(krb_context, "")
negotiate_details = kerberos.authGSSClientResponse(krb_context)
headers = {'Authorization': 'Negotiate ' + negotiate_details,'Content-Type':'application/binary'}
return headers
#cert_file is copied from CDP, use “find” to get the location, scp to your app server.
httpClient = THttpClient.THttpClient('https://< thrift server fqdn>:9090/', cert_file='/root/certs/localhost.crt',key_file='/root/certs/localhost.key', ssl_context=ssl._create_unverified_context())
# if no ssl verification is required
httpClient.setCustomHeaders(headers=kerberos_auth())
protocol = TBinaryProtocol.TBinaryProtocol(httpClient)
httpClient.open()
client = Client(protocol)
tables=client.getTableNames()
print(tables)
httpClient.close()
Nowadays, security (SSL/Kerberos) is very important when applications interact with databases. And many popular services like Knox and Hue are interacting with HBase via Thrift server over HTTP client. So, we recommend using the second method.
Some known bugs when using TSaslClientTransport with Kerberos enabled in some CDP versions
Upstream Jira HBASE-21652 where a bug is introduced related to Kerberos principal handling.
When refactoring the Thrift server, making thrift2 server inherit from thrift1 server, ThriftServerRunner ThriftServer is merged and the principal switching step was omitted.
Before the refactoring, everything is run in a doAs() block in ThriftServerRunner.run().
References
Programmer’s Guide to Apache Thrift
Python3 connection to Kerberos Hbase thrift HTTPS
Use the Apache Thrift Proxy API
How-to: Use the HBase Thrift Interface, Part 1
How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows
Disclaimer
This article did not test all the versions; both methods are tested in Python 2.7.5 and Python 3.6.8.
Change the code according to your need, if encounter an issue. Posting questions to the Community and raising cases with Cloudera support are recommended.
... View more
09-15-2021
08:23 AM
1 Kudo
Hi @Ellyly , Here is the example. (1). Firstly, list -R and grep "^d" to show all the subdirectories in your path: # sudo -u hdfs hdfs dfs -ls -R /folder1/ | grep "^d" drwxr-xr-x - hdfs supergroup 0 2021-09-15 14:48 /folder1/folder2 drwxr-xr-x - hdfs supergroup 0 2021-09-15 15:01 /folder1/folder2/folder3 drwxr-xr-x - hdfs supergroup 0 2021-09-15 15:01 /folder1/folder2/folder3/folder4 drwxr-xr-x - hdfs supergroup 0 2021-09-11 05:09 /folder1/subfolder1 (2). Then, awk -F\/ '{print NF-1}' to calculate each directory's depth, actually we print number of fields separated by /. After -F it is \ and /, no space in between, it is not character"V" !!! 🙂 # sudo -u hdfs hdfs dfs -ls -R /folder1/ | grep "^d" | awk -F\/ '{print NF-1}' 2 3 4 2 (3). Finally, sort and head # sudo -u hdfs hdfs dfs -ls -R /folder1/ | grep "^d" | awk -F\/ '{print NF-1}'|sort -rn|head -1 4 Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-12-2021
10:59 PM
1 Kudo
Introduction
Phoenix is a popular solution to provide OLTP and operational analytics on top of HBase for low latency. Hortonworks Data Platform (HDP), Cloudera Data Platform (CDP) are the most popular platforms for Phoenix to interact with HBase.
Nowadays, many customers choose to migrate to Cloudera Data Platform to better manage their Hadoop clusters and implement the latest solutions in big data.
This article discussed how to migrate Phoenix data/index tables to the newer version CDP Private Cloud Base.
Environment
Source cluster HDP 2.6.5 , HDP 3.1.5
Target cluster CDP PvC 7.1.5, CDP PvC 7.1.6, CDP PvC 7.1.7
Migration steps
The SYSTEM table will be automatically created when Phoenix-sqlline initially starts. It will contain the metadata of Phoenix tables. In order to show Phoenix data/index tables in the target cluster, we need to migrate SYSTEM tables from the source cluster as well.
Stop Phoenix service on the CDP cluster You can stop the service on Cloudera Manager > Services > Phoenix Service > Stop
Drop the system.% tables on CDP cluster (from HBase) In HBase shell, drop all the SYSTEM tables. hbase:006:0> disable_all "SYSTEM.*"
hbase:006:0> drop_all "SYSTEM.*"
Copy the system, data, and index tables to the CDP cluster There are many methods to copy data between HBase clusters. I would recommend using snapshots to keep the schema same. Source HBase:
Take snapshots of all SYSTEM tables and data tables hbase(main):020:0> snapshot "SYSTEM.CATALOG","CATALOG_snap"
ExportSnapshot to the target cluster sudo -u hdfs hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot CATALOG_snap -copy-to hdfs://Target_Active_NameNode:8020/hbase -mappers 16 -bandwidth 200 Your HBase directory path may be different. Check HBase configuration in Cloudera Manager for the path.
In the Target cluster, the owner may become a different user who triggers MapReduce. So, we need to change the owner back to default hbase:hbase sudo -u hdfs hdfs dfs -chown -R hbase:hbase /hbase
In HBase shell, use clone_snapshot to create new tables clone_snapshot "CATALOG_snap","SYSTEM.CATALOG" When you complete the above steps, you should have all the SYSTEM tables and data tables, and index tables in your target HBase. For example, the following is copied from HDP2.6.5 cluster and created in CDP. hbase:013:0> list
TABLE
SYSTEM.CATALOG
SYSTEM.FUNCTION
SYSTEM.SEQUENCE
SYSTEM.STATS
TEST
Start Phoenix service, enter phoenix-sqlline, and then check if you can query the table.
(Optional) If HDP already enabled NamespaceMapping, we should also set isNamespaceMappingEnabled to true on the CDP cluster in both client/service hbase-site.xml, and restart the Phoenix service.
Known Bug of Migration Process
Starting from Phoenix 5.1.0/ CDP 7.1.6, there is a bug during SYSTEM tables auto-upgrade. The fix will be included in the future CDP release. The customer should raise cases with Cloudera support and apply a hotfix for this bug on top of CDP 7.1.6/ 7.1.7.
Refer to PHOENIX-6534
Disclaimer
This article does not contain all the versions of HDP and CDP, and also does not test all the situations. It only chooses the popular or latest versions. If you followed steps but failed or met with a new issue, please feel free to ask in the Community or raise a case with Cloudera support.
... View more
09-11-2021
10:00 PM
1 Kudo
Hi @DanHosier, Just provide you a possible solution to bind the namenode http to localhost. Add following property to service side advanced hdfs-site.xml and restart hdfs. HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml <property> <name>dfs.namenode.http-bind-host</name> <value>127.0.0.1</value> </property> Then the property is added into /var/run/cloudera-scm-agent/process/<Latest process of NN>/hdfs-site.xml: # grep -C2 "dfs.namenode.http-bind-host" hdfs-site.xml </property> <property> <name>dfs.namenode.http-bind-host</name> <value>127.0.0.1</value> </property> And then test curl commands: # curl `hostname -f`:9870 curl: (7) Failed connect to xxxx.xxxx.xxxx.com:9870; Connection refused # curl localhost:9870 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="REFRESH" content="0;url=dfshealth.html" /> <title>Hadoop Administration</title> </head> </html> Now the webUI only served on NN's localhost. But you will see this alert on CM because Service Monitor cannot reach to NN WebUI: NameNode summary: xxxx.xxxx.xxxx.com (Availability: Unknown, Health: Bad). This health test is bad because the Service Monitor did not find an active NameNode. So this solution has side effects for service monitor, but actually hdfs is running well. Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-10-2021
10:48 PM
Hi @Ben621 , Please check this community post should answer your question. https://community.cloudera.com/t5/Support-Questions/How-are-the-primary-keys-in-Phoenix-are-converted-as-row/td-p/147232 Regards, Will If the answer helps, please accept as solution and click thumbs up.
... View more
09-10-2021
10:30 PM
Hi @clouderaskme Creating same folder name in same directory is not allowed. Test: # sudo -u hdfs hdfs dfs -mkdir /folder1 # sudo -u hdfs hdfs dfs -mkdir /folder1/subfolder1 # sudo -u hdfs hdfs dfs -mkdir /folder1/subfolder1 mkdir: `/folder1/subfolder1': File exists So if you see two subfolder under folder1 with same name, it may due to contain special characters in name. Can you log into the terminal and execute hdfs commands to check and also show us the output? hdfs dfs -ls /folder1 | cat -A Regards, Will
... View more