Member since
01-19-2017
3676
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 546 | 06-04-2025 11:36 PM | |
| 1092 | 03-23-2025 05:23 AM | |
| 560 | 03-17-2025 10:18 AM | |
| 2099 | 03-05-2025 01:34 PM | |
| 1316 | 03-03-2025 01:09 PM |
10-29-2019
01:03 PM
1 Kudo
@Elephanta Any updates? Why the high number of block deletes are those under replicated blocks? There is many parameters to evaluate like the number of data nodes, replication factor, and rack awareness! I would be happy to get your feedback
... View more
10-29-2019
11:46 AM
1 Kudo
@mike_bronson7 Your plans are doable and that's the way many companies have deployed their Kafka production clusters if you intend ONLY to use Kafka, but you could take it a step further by enabling HA and reliability but orchestrating all that with Kubernetes with PVC's it's a great idea. Running kafka as a microservices on Kubernetes has become the norm and the path of least resistance. It is very difficult to allocate physical machines with local disks for Kafka companies running on VMs have found deploying Kafka outside of Kubernetes causes significant organizational headache. Running Kafka on Kubernetes gets your environment allocated faster and you can use your time to do productive work rather than fire fighting. Kafka management becomes much easier on kubernetest becomes easier to scaleup adding new brokers is a single command or a single line in a configuration file. And it is easier to perform configuration changes, upgrades and restarts on all brokers and all clusters. Kafka is a stateful service, and this does make the Kubernetes configuration more complex than it is for stateless microservices. The biggest challenge is configuring storage and network, and you’ll want to make sure both subsystems deliver consistent low latency that where PVC's [Persistent Volume claims] come in use of shared storage. The beauty is Kafka will run like a POD and you can configure a fixed number that MUST be running at any time and scale when needed with a single Kubectl or HELM command is elasticity at play !! Kafka also poses a challenge most stateful services don’t Brokers are not interchangeable, and clients will need to communicate directly with the broker that contains the lead replica of each partition they produce to or consume from. You can’t place all brokers behind a single load balancer address you must devise a way to route messages to a specific broker here is a good reading Recommendations for Deploying Apache Kafka on Kubernetes paper Happy hadooping
... View more
10-28-2019
10:21 PM
1 Kudo
@Elephanta The name node start is usually faster than what you are experiencing frm the logs I am seeing maybe the root cause. During startup the namenode read the fsimage which is the last best cluster images usually a combination of fsimage and the last edits log. You will need to combine the current fsimage and the edits log in the below steps. Name node is in safe mode. When the name node is in this state it's a security future that disables any change that the namenode metadata that it can't register,/record to note all the cluster wise location, state, permissions,ownership etc are stored in the metadata which is stored in the name node. You will need to do the following steps as root user while the cluster is freezing during startup. # su - hdfs Get the current state $ hdfs hdfsadmin -safemode get That will confirm that the namenode is in safe mode, you will need to force the creation of a hdfs image point in time by saving the namespace to create a new fsimage $ hdfs dfsadmin -saveNamespace confirm safemode is off $ hdfs hdfsadmin -safemode get This time the output should be Off leave safe mode and next time the startup should be much faster $ hdfs hdfsadmin -safemode get You might need to tune the memory allocated to the name node it seems the numebr of files to manage has increased hence the need to reconfigure the memory have a look at the below link for guidance.. Configuring NameNode Heap Size to estimate the memory required for the namenode
... View more
10-28-2019
10:04 PM
@erkansirin78 That's exactly the output I was getting on my single node cluster, not Sandbox but I didn't know what you exactly wanted. When you start getting errors then you can ping me!
... View more
10-28-2019
12:32 PM
@MIkeL The best technical reference before you embark on deploying your cluster is to check the compatibility of the different moving parts of HDP/Cloudera binaries against an operating system of your choice, the first source of truth is please filter all the possible valid options using supportmatrix cloudera/hortonworks tool Hortonworks and Cloudera do run exhaustive tests on a particular Operating system before certifying it as production-ready and from the about RHEL/Centos 7.7 are not yet certified so I highly doubt whether RHEL/Centos 8 is certified that explains the Python errors you are encountering. HTH
... View more
10-28-2019
11:55 AM
@Mnju Ranger and Sentry don't offer data quality but a centralized security framework to manage fine-grained access control and policies across the cluster. Security administrators use it to easily manage policies for access to files, folders, databases, tables, or columns. These policies can be set for individual users or groups and then enforced consistently across the Cluster. The latest version of Ranger that ships with CDP now available for AWS and later this year for Azure manage access and authorization to the below resources using ranger plugins. HDFS Hive Ozone Atlas Nifi-Registry Storm HBase Knox Kafka YARN NiFi Solr Sentry is a granular, role-based authorization module for Hadoop and provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. It works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS (limited to Hive table data) and allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources Both are security tools built for Hadoop that are usually combined with Kerberos, KMS, and TLS to provide a robust security framework. Data quality is a broad subject of discussion but to my knowledge, no tool manages data quality unless you are talking of catalog tools like Alation, waterline. Atlas is a tool that provides metadata management, data lineage and governance capabilities to build a catalog of data assets, classification, and governance across these assets. HTH
... View more
10-27-2019
06:34 AM
@erkansirin78 Extract from spark.apache.org.In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master the argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. I am not a spark expert but trying to understand
... View more
10-27-2019
03:40 AM
1 Kudo
@mike_bronson7 Surely you can use that hdfs fsck / -delete but remember it will be put in the trash !!!
... View more
10-27-2019
03:11 AM
@erkansirin78 Can you share the steps you executed? Have a look at this spark-shell
... View more
10-27-2019
01:36 AM
@mike_bronson7 Regarding under replicated blocks, HDFS is supposed to recover them automatically (by creating missing copies to fulfill the replication factor) but in your case, your cluster-wide replication factor is 3 but the target is 10 It's suggesting have 5 data nodes while there are 10 replicas leading to the under replication alert! According to the output you have 2 distinct problems (a) Under replicated blocks, Target Replicas is 10 but found 5 live replica(s) [Last 2 lines] (b) Corrupt blocks with 2 different solutions Solution 1 under replicated You could force the 2 blk to align with cluster-wide replication factor by adjusting using -setrep $ hdfs dfs -setrep -w 3 [File_name] Validate by Now you should see 3 after the file permissions before the user:group like below $ hdfs dfs -ls [File_name] -rw-r--r-- 3 analyst hdfs 1068028 2019-10-27 12:30 /flighdata/airports.dat And wait for the deletion to happen or run the below snippets sequentially $ hdfs fsck / | grep 'Under replicated' $ hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files $ for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 3 $hdfsfile; done For Corrupt files $ hdfs fsck / | egrep -v '^\.+$' | grep -i corrupt ...............Example output............................ /user/analyst/test9: CORRUPT blockpool BP-762603225-192.168.1.2-1480061879099 block blk_1055741378 /user/analyst/data1: CORRUPT blockpool BP-762603225-192.168.1.2-1480061879099 block blk_1056741378 /user/analyst/data2: MISSING 3 blocks of total size 338192920 B.Status: CORRUPT CORRUPT FILES: 9 CORRUPT BLOCKS: 18 Corrupt blocks: 18 The filesystem under path '/' is CORRUPT Locate corrupted block $ hdfs fsck / | egrep -v '^\.+$' | grep -i "corrupt blockpool"| awk '{print $1}' |sort |uniq |sed -e 's/://g' >corrupted.flst Get the location in the above output corrupted.flst $ hdfs fsck /user/analyst/xxxx -locations -blocks -files Remove the corrupted files hdfs dfs -rm /path/to/corrupted.flst Skip the trash to permanently delete $ hdfs dfs -rm -skipTrash /path/to/corrupt_filename. You should give the cluster sometime to rebalance in the case of under-replicated files.
... View more