Member since
10-28-2015
61
Posts
10
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
819 | 09-25-2017 11:22 PM | |
3267 | 09-22-2017 08:04 PM | |
3347 | 02-03-2017 09:28 PM | |
2112 | 05-10-2016 05:04 AM | |
552 | 05-04-2016 08:22 PM |
07-07-2018
08:54 PM
In certain Apache Hadoop use cases we want to get the checksum of files stored in HDFS. This is specifically useful when we are moving data from/to hdfs to verify the file was transferred correctly. Earlier there was no easy way to compare that but starting Apache Hadoop 3.1 we can compare the checksums of a file stored in hdfs and a file stored locally. HDFS-13056 The default checksum algorithm for hdfs chunks is CRC32C. A client can override it by overriding dfs.checksum.type (can be either CRC32 or CRC32C). This is not a cryptographically strong checksum, however it can be used for quick comparison. When we run the checksum command (hdfs dfs -checksum) for a hdfs file it calculates MD5 of MD5 of checksums of individual chunks (each chunk is typically 512 bytes long). However this is not very useful for comparison with a local copy. Example For example, the below command computes the checksum of the file hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar stored in HDFS: hdfs dfs -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar MD5-of-0MD5-of-512CRC32C 000002000000000000000000c16859d1d071c6b1ffc9c8557d4909f1 However this checksum is not easily comparable to that of a local copy. Instead we can calculate the CRC32C checksum of the whole file by adding -Ddfs.checksum.combine.mode=COMPOSITE_CRC to same command: bin/hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar COMPOSITE-CRC32C 3799db55 Property dfs.checksum.combine.mode=COMPOSITE_CRC tells hdfs to calculate combined CRC of individual CRCs instead of calculating MD5-of-Md5-of-Crcs. It is important to note here that we can calculate checksum of type CRC32C or CRC32 for a hdfs file depending upon how it was originally written. For example we can't calculate CRC32 for file in above example as its chunks was originally written with CRC32C checksums. If we want to get CRC32 of above file we need to specify dfs.checksum.type as CRC32 while writing that file. hdfs dfs -Ddfs.checksum.type=CRC32 -put hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar /tmp
hdfs dfs -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar MD5-of-0MD5-of-512CRC32 0000020000000000000000009f26e871c80d4cbd78b8d42897e5b364
hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
/tmp/hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar COMPOSITE-CRC32 c1ddb422 This checksum can be easily compared to checksum of same file in local file system with the crc32 command. crc32 hadoop-common-2.7.3.2.6.3.0-SNAPSHOT.jar
c1ddb422
... View more
- Find more articles tagged with:
- checksum
- compare
- hadoop
- Hadoop Core
- HDFS
- How-ToTutorial
Labels:
03-29-2018
10:22 PM
2 Kudos
Ozone is an Object store for Hadoop. It is a redundant, distributed object store built by leveraging primitives present in HDFS. Below are some key features of ozone:
A Hadoop compatible file system called Ozone File system that allows programs like Hive or Spark to run against Ozone without any modifications. Ozone supports RPC and REST API for accessing the store. Built to support billions of keys in distributed environment. Ozone can run concurrently with HDFS. Like many other object stores, Ozone has a notion of volume. Only Administrators can create Volumes. Users create buckets in the volumes. To store data inside a bucket, users create keys. An ozone file system allows other Hadoop ecosystem applications like Hive and Spark to use ozone. Once a bucket is created, it is trivial to create an ozone file system. A 10-thousand foot view of Ozone
OzoneManager (Om) acts as namespace manager. All ozone entities like volumes, buckets and keys are managed by Om. Om talks to an independent block manager (Storage Container Manager, SCM) to get blocks and passes it on to the Ozone client. SCM: Storage Container Manager is the block and cluster manager for Ozone. Block: Blocks are similar to blocks in HDFS. They are replicated blocks of data. These components map very closely to the existing HDFS NameNode and DataNodes. The most significant difference is the presence of a block manager, SCM. Using Ozone
The easiest way to run ozone is to try it out using the docker. To build Ozone from source, please checkout the hadoop sources from github. Then checkout the ozone branch, HDFS-7240 and build it.
git checkout HDFS-7240
You can build ozone by running the following build command. mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true -Pdist -Phdsl -Dtar -DskipShade
skipShade is just to make compilation faster and not really required. Running Ozone via Docker
This assumes that you have a running docker setup on the machine. Please run following commands to see ozone in action. Go to the directory where the docker compose files exist.
cd hadoop-dist/target/compose/ozone Start ozone. docker-compose up -d
Log into the datanode container docker exec -it ozone_datanode_1 bash
Run the ozone load generator ./bin/oz freon Take a look at OzoneManager UI, to see all the requests made by Freon http://localhost:9874/ Congratulations! on your first ozone deployment. In the next part of this tutorial we will cover oz command shell and look at how to use ozone to store files.
... View more
- Find more articles tagged with:
- FAQ
- Hadoop Core
- hadoop-ecosystem
- HDFS
- storage
Labels:
10-17-2017
04:47 PM
@John Carter, It will depend on kind of latency, processing, and data volume you will be handling. Both are different approaches. Sqoop as you know will run mapreduce jobs while Nifi use case will be on streaming side. Given right resources both will work.
... View more
09-26-2017
02:36 AM
@Teja Damineni Good to know that your data is safe. Recommend taking regular backup of NameNode metadata to avoid any future issues.
... View more
09-25-2017
11:22 PM
@Teja Damineni If I perform a re-installation (of both Ambari-server and Namenode) on the same node with fresh OS installed, will it wipe my data ? If you have taken backup of NameNode metadata you can use it to reset the NameNode. Without the backup you can't recover your data. What about the services ?do I have to completely remove all services and re-install them ? If you have backup of ambari database you can use it to reinitialize ambari to its old state, else you have to reinstall everything.
... View more
09-25-2017
10:57 PM
Hi @Bin Ye Check for entry of "dfs.namenode.name.dir" in config files. Try to grep /hadoop/hdfs/namenode/current in config dir and see if you can locate the config file which is over-riding your settings.
... View more
09-22-2017
08:04 PM
Hi @Bin Ye, Check NameNode logs and share any error/exception. Some common issues: hostname specified in "fs.default.name" is valid. Also check if that port is not used by existing service. Proper file permission for dirs specified in "dfs.name.dir" and "dfs.data.dir"
... View more
09-22-2017
07:36 PM
1 Kudo
@John Carter Using executeStreamCommand will also work. Alternatively if you want to use sqoop for all the transfer you can wrap the sqoop command in shell script and use ExecuteProcess. You can decide after weighting pros and cons of various approaches. With actual processing inside nifi you will get inbuilt fault tolerance and monitoring.
... View more
09-14-2017
09:20 PM
@kerra Check if HiveServer2 you are trying to connect is configured properly in knox. Also check if HiveServer2 is set to hive.server2.transport.mode=http If zookeeper hosts are accessible than i will recommend using discovery as it will auto detect port,host and other details. Are you able to connect to hive via beeline?
... View more
09-14-2017
01:39 AM
@John Carter, depending on actual use case you have couple of options to choose in Nifi. In simplest form we can read hive records using "SelectHiveQL" which can output records in either csv or avro format. You can pass those records to "PutDatabaseRecord" processor which can read data in several formats including avro, csv. For this to work we need to configure below services: HiveConnectionPool (for "SelectHiveQL") Record Reader (Avro,CSV) DBCPConnectionPool This is one simple example. You can build more complex flows(which may involve filter, join, split or aggregation) based on actual requirements.
... View more
04-25-2017
06:56 PM
This issue was caused by port forwarding. Websocket used in Zeppelin UI was not able to communicate with backend.
... View more
02-21-2017
11:20 PM
1 Kudo
@Padmanabhan Vijendran Check if ozzie shared lib is configured properly and has right hbase dependencies. If not properly configured you can recreate shared lib using below commands: /usr/hdp/current/oozie/bin/oozie-setup.sh sharelib create -locallib /usr/hdp/<version>/oozie/oozie-sharelib.tar.gz -fs hdfs://<namenode-host>:8020 oozie admin -oozie http://<oozie-host>:11000/oozie -sharelibupdate
... View more
02-21-2017
11:15 PM
1 Kudo
@Connor O'Neal Question is not very clear but it seems you want to manage kafka offset from a consumer. You can manually manage the offset in your code using a low level consumer client.
... View more
02-16-2017
12:53 AM
@Kshitij Badani These two properties were commented. I made the changes and restarted zeppelin but outcome is same.
... View more
02-15-2017
10:34 PM
screen-shot-2017-02-15-at-23048-pm.pngWhile opening Zeppelin UI on HD2.5 (on AWS) i don't see any welcome page. When i create a new notebook nothing happens. Also search box "Search your notebook" is also disabled. zeppelin.anonymous.allowed is set to true. There is no Error/Exception in logs.
... View more
Labels:
- Labels:
-
Apache Zeppelin
02-15-2017
09:02 PM
@Karthick T Seems user starting namenode (usually hdfs) doesn't have ownership of version file "/trvapps/hadoop/hdfs/namenode/current/VERSION". Try again after changing ownership.
... View more
02-15-2017
09:00 PM
Change owner for /trvapps/hadoop/hdfs/namenode/current/VERSION to user starting namenode.
... View more
02-15-2017
08:55 PM
@shyam gurram Information you provided is insufficient. Plz share what operation you are trying? Is cluster secured? Mode of access ?
... View more
02-14-2017
04:16 PM
@Karthick T plz paste logs (/var/log/hadoop/hdfs/hadoop-hdfs-namenode-thdppca0.out/log)
... View more
02-14-2017
04:12 PM
@Raja Sekhar Chintalapati
"will ambari create keytab entries for mysql?"
As far as i remember answer for this is no.
if not how do we create it manually? You haven't given much context on this but if you really wanna create it manually, you can do it using kadmin command (more details) .
... View more
02-03-2017
09:30 PM
@Jagdish Saripella Are you using ranger for HDFS ACL? If yes configure ranger policy accordingly.
... View more
02-03-2017
09:28 PM
@Narasimha K Have you added this host to exclude list? Check /etc/hadoop/conf/yarn.exclude . If this host exists in this file remove it Execute yarn rmadmin -refreshNodes so YARN re-reads this configuration file
... View more
01-05-2017
11:21 PM
@Sami Ahmad 1- where should HADOOP_HOME point to ? HADOOP_HOME usually points to hadoop installation dir. For hortonworks it is /usr/hdp/current/hadoop/ or /usr/hdp/current/hadoop-client/ 2- where should HADOOP_CON_DIR point to ? /etc/hadoop/conf or some sym linc in /etc/hadoop/ 3- where are the hadoop log files ? as I am seeing to debug YARN Isues I need to look into hadoop log files By default all logs are under /var/log (Ex /var/log/hadoop)
... View more
05-11-2016
02:42 PM
@Steve Kaufman Could you please elaborate what exactly you mean by transaction type data? Do you want to store it in HDFS,Hive or HBase. What kind of processing you want to do on it and how you will be consuming it?
... View more
05-10-2016
07:09 PM
@Pradeep kumar ResourceManager UI-> Application -> ApplicationMaster-> job link-> map/reducer link
... View more
05-10-2016
05:04 AM
@Pradeep kumar When a application is currently running it will not be available in JobHistory UI until it is finished. As you correctly identified its tracking id will point to "Application Master". Once it is finished link will point to history server. After the completion of a map reduce job, logs are written to hdfs under the directory specified by mapreduce.jobhistory.intermediate-done-dir. History server continuously scans the intermediate directory and pulls any new logs if available and copies those logs to the directory specified by mapreduce.jobhistory.done-dir
... View more
05-04-2016
08:22 PM
@Bindu Nayidi You can edit corresponding log4j file. Ambari-> <Service> -> configs -> advanced log4j
... View more
05-02-2016
11:07 PM
Hi @simran kaur Change ozie.libpath=hdfs://serverFQDN:8020/user/oozie/share/lib to ozie.libpath=hdfs://serverFQDN:8020/user/oozie/shared/lib as you mentioned your oozie lib is under /user/oozie/shared/lib dir.
... View more
05-02-2016
11:02 PM
As others explained it depends on number of containers available on your cluster.
... View more
05-02-2016
06:25 PM
Adding HADOOP_TOKEN_FILE_LOCATION resolved the issue. -D mapreduce.job.credentials.binary=$HADOOP_TOKEN_FILE_LOCATION
... View more