About willx

willx · ‎10-22-2021

Hi @Rjkoop Visibility labels are not officially supported by Cloudera, please refer to this link: https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_620_unsupported_features.html#hbase_c6_unsupported_features Regards, Will

willx · ‎10-20-2021

Hi @DA-Ka, SUM and JOIN won't change the timestamp of the underlying file. Example: create table mytable (i int,j int,k int); insert into mytable values (1,2,3),(4,5,6),(7,8,9); create table mytable2 (i int,j int,k int); insert into mytable2 values (1,2,6),(3,5,7),(4,8,9); select * from mytable; +------------+------------+------------+ | mytable.i | mytable.j | mytable.k | +------------+------------+------------+ | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 | +------------+------------+------------+ select * from mytable2; +-------------+-------------+-------------+ | mytable2.i | mytable2.j | mytable2.k | +-------------+-------------+-------------+ | 1 | 2 | 6 | | 3 | 5 | 7 | | 4 | 8 | 9 | +-------------+-------------+-------------+ # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2 drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0 1. Sum, timestamp is unchanged select pos+1 as col,sum (val) as sum_col from mytable t lateral view posexplode(array(*)) pe group by pos; +------+----------+ | col | sum_col | +------+----------+ | 2 | 15 | | 1 | 12 | | 3 | 18 | +------+----------+ # sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 2. Inner Join, timestamp is unchanged select * from (select * from mytable)T1 join (select * from mytable2)T2 on T1.i=T2.i +-------+-------+-------+-------+-------+-------+ | t1.i | t1.j | t1.k | t2.i | t2.j | t2.k | +-------+-------+-------+-------+-------+-------+ | 1 | 2 | 3 | 1 | 2 | 6 | | 4 | 5 | 6 | 4 | 8 | 9 | +-------+-------+-------+-------+-------+-------+ sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable drwxrwx---+ - hive hive 0 2021-10-20 15:11 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 743 2021-10-20 15:12 /warehouse/tablespace/managed/hive/mytable/delta_0000001_0000001_0000/bucket_00000_0 sudo -u hdfs hdfs dfs -ls -R /warehouse/tablespace/managed/hive/mytable2 drwxrwx---+ - hive hive 0 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 742 2021-10-20 15:23 /warehouse/tablespace/managed/hive/mytable2/delta_0000001_0000001_0000/bucket_00000_0 Regards, Will

willx · ‎10-20-2021

Hi @DA-Ka， Below example is inspired by this link 1) use -t -R to list files recursively with timestamp: # sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 drwxrwx---+ - hive hive 0 2021-10-20 06:14 /warehouse/tablespace/managed/hive/sample_07/.hive-staging_hive_2021-10-20_06-13-50_654_7549698524549477159-1 drwxrwx---+ - hive hive 0 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000 -rw-rw----+ 3 hive hive 48464 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0 2) filter the files older than a timestamp: sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}' # sudo -u hdfs hdfs dfs -ls -t -R /warehouse/tablespace/managed/hive/sample_07 |awk -v dateA="$date" '{if (($6" "$7) <= "2021-10-20 06:13") {print ($6" "$7" "$8)}}' 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000 2021-10-20 06:13 /warehouse/tablespace/managed/hive/sample_07/delta_0000001_0000001_0000/000000_0 Regarding your last question, if sum or join could change the timestamp, I'm not sure, please try and then use above commands to see the timestamps. Regards, Will If the answer helps, please accept as solution and click thumbs up.

willx · ‎10-19-2021

Hi @kras, From the evidences you provided, the most frequent warning is: WARN [RpcServer.default.FPBQ.Fifo.handler=10,queue=10,port=16020] regionserver.RSRpcServices: Large batch operation detected (greater than 5000) (HBASE-18023). Requested Number of Rows: 12596 Client: svc-stats//ip first region in multi=table_name,\x09,1541077881948.9bcc8cee00ab92b2402730813923c2f6. which indicates when an RPC is received from a client that has more than 5000 "actions" (where an "action" is a collection of mutations for a specific row) in a single RPC. Misbehaving clients who send large RPCs to RegionServers can be malicious, causing temporary pauses via garbage collection or denial of service via crashes. The threshold of 5000 actions per RPC is defined by the property "hbase.rpc.rows.warning.threshold" in hbase-site.xml. Please refer to this jira: https://issues.apache.org/jira/browse/HBASE-18023 for detailed explanation. We can identify the table name is "table_name", please check which application is writing / reading this table. Simplest way is to halt this application, to see if performance is improved. If you identified the latency spike is due to this table, please improve your application logic, control your batch size. If you have already improved the "harmful" applications but still see performance issues, I would recommend you read through this article which include most common performance issues and tuning suggestions: https://community.cloudera.com/t5/Community-Articles/Tuning-Hbase-for-optimized-performance-Part-1/ta-p/248137 This article has 5 parts, please read through it you will have ideas to tune your hbase. This issue looks like a little complex, there will be multi-factors to impact your hbase performance. We encourage you to raise support cases with Cloudera. Regards, Will If the answer helps, please accept as solution and click thumbs up.

willx · ‎10-17-2021

Hi @dzbeda, The definition of "dfs.balancer.getBlocks.min-block-size" is "Smallest block to consider for moving". What is the version of hadoop? Is it CDH or HDP? What is the version of CDH / HDP? For CDH please refer to: https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hdfs_balancer.html#cmug_topic_5_14__section_lqb_rzp_x2b https://docs.cloudera.com/documentation/enterprise/6/properties/6.1/topics/cm_props_cdh5160_hdfs.html#concept_6.1.x_balancer_props HDFS Balancer and DataNode Space Usage Considerations: https://my.cloudera.com/knowledge/HDFS-Balancer-and-DataNode-Space-Usage-Considerations?id=73869 Regards, Will

willx · ‎10-13-2021

Hi @kras, 1. Is it CDH or HDP, what is the version. 2. In regionserver logs is there “responseTooSlow” or “operationTooSlow” or any other WARN/ERROR messages. please provide log snippets. 3. How is the locality of the regions (check locality on hbase webUI, click on table, on right side there is a column shows each region locality.) 4. How many regions deployed on each RegionServer. 5. Any warning / errors in RS log around the spike? 6. Is any job trying to scan every 10 min? Which table contribute most I/O? Is there any hotspot. 7. is HDFS healthy? check DN logs, is there any slow messages around the spike? Refer to https://my.cloudera.com/knowledge/Diagnosing-Errors-Error-Slow-ReadProcessor-Error-Slow?id=73443 Regards, Will

willx · ‎10-02-2021

@Tamiri , Please click on your avatar and check My settings > SUBSCRIPTIONS&NOTIFICATIONS Another place is when you reply to post, on the top right select "Email me when someone replies". Regards, Will

willx · ‎10-01-2021

Hello @rahuledavalath, What HDP version and what CDP version are you using? Regards, Will

willx · ‎09-29-2021

Then above solutions meet your needs.

willx · ‎09-29-2021

Hi @Visvanath_JP, The question could be more specific like what hadoop versions are two clusters, are both clusters secured, are they CDH/CDP or HDP. Do you only migrate data in HDFS layer or other layer, for example hive / hbase / kudu. The most common way is using distcp to migrate data between hdfs clusters. https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/scaling-namespaces/topics/hdfs-distcp-to-copy-files.html If you are using CDH/CDP, BDR job is another choice (distcp integrated) https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/replication-manager/topics/rm-dc-hdfs-replication.html Distcp guide: https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#:~:text=DistCp%20(distributed%20copy)%20is%20a,specified%20in%20the%20source%20list. Regards, Will If the answer helps, please accept as solution and click thumbs up.

Online	Offline
Last Visited	‎12-13-2024 10:32 PM

Member Since	‎10-03-2020 06:12 AM
Last Visited	‎12-13-2024 10:32 PM
Posts	235
Kudos received	14

Cloudera Community

Re: Services not starting up after Enabling Kerber...

Re: What is the difference between volumes and fol...

Re: Hbase labels table creation

Re: All Hdfs file names older than N days

Re: All Hdfs file names older than N days

Re: Hbase labels table creation

Re: All Hdfs file names older than N days

Re: All Hdfs file names older than N days

Re: HBase latency spikes every 10 minutes

Re: HDFS balancer with small files

Re: HBase latency spikes every 10 minutes

Re: HDP 3.0.1 Sandbox on VirtualBox password admi...

Re: Cloning Phoenix Hbase table snapshot from a HD...

Re: HDFS Data migration from one data center to ot...

Re: HDFS Data migration from one data center to ot...