About ravi1

ravi1 · ‎04-28-2016

Advantage of using HDF here is that you can do any preprocessing/filtering on your logs before you put into ElasticSearch. This is one of the common usecase where logs are preprocessed before putting into a system like Splunk/Logstash.

ravi1 · ‎04-28-2016

From hadoop FAQ on apache, 3.12. On an individual data node, how do you balance the blocks on the disk? Hadoop currently does not have a method by which to do this automatically. To do this manually: Shutdown the DataNode involved Use the UNIX mv command to move the individual block replica and meta pairs from one directory to another on the selected host. On releases which have HDFS-6482 (Apache Hadoop 2.6.0+) you also need to ensure the subdir-named directory structure remains exactly the same when moving the blocks across the disks. For example, if the block replica and its meta pair were under /data/1/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/, and you wanted to move it to /data/5/ disk, then it MUST be moved into the same subdirectory structure underneath that, i.e. /data/5/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/. If this is not maintained, the DN will no longer be able to locate the replicas after the move. Restart the DataNode. However, this is not something that I recommend. A cleaner approach that you can take is decommission node, change the mount point and add it back to the cluster. I say cleaner because directly touching data directory can corrupt your data with a single misstep.

ravi1 · ‎04-27-2016

There are pros and cons of both approaches. VM based pros: 1. 'Easier' managing nodes. Some IT infrastructure teams insist on VMs even if you want to map 1 physical node to 1 virtual node because all their other infrastructure is based on VMs. 2. Taking advantage of NUMA and memory locality. There are some articles on this from virtual infrastructure providers that you can take a look at. VM based disadvantages: 1. Overhead. As an example, if you are running 4VMs per physical node, you are running 4 OS, 4 Datanode services, 4 Nodemanagers, 4 ambari-agents, 4 metrics collectors and 4 of any other worker services instead of one. These services will have overhead compared to running 1 of each. 2. Data Locality and redundancy. Now, there is support to know physical nodes, so no two replicas go into same physical node but that is extra configuration. You might run into virtual disk performance problems if they are not configured properly. Given a choice, I prefer using Physical servers. However, its not always your choice. In those cases, make sure you try to get following. 1. Explicit virtual disk to physical disk mapping. Say you have 2 VMs per physical node and each physical node has 16 data drives. Make sure to split 8 drives to one VM and 8 more to another VM. This way, physical disks are not shared between VMs. 2. Don't go for more than 2 VMs per physical node. This is so you minimize overhead from the services running. Regarding your question of mixing physical and virtual machines, try to see that all your worker nodes are of similar hardware. While heterogenous hardware is supported, you can run into issues because nodes have different hardware profiles. However, we had some customers who used VMs for master services and physical nodes for worker nodes. This was one way to getting away from NN SPOF issues in Hadoop1 days.

ravi1 · ‎04-27-2016

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/bk_installing_manually_book-20160301.pdf Take a look at page 58. Terasort is already part of hadoop-mapreduce-examples-<version>.jar. This has steps on how to execute terasort after install is complete.

ravi1 · ‎04-27-2016

Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.

ravi1 · ‎04-27-2016

@Terry Padgett If you want to store these temporary tables as ORC, it is still possible. Here is an example. create temporary table tp1 stored as orcfile as select count(*) from table_params; My earlier answer was whether the text format which is default is compressed on hdfs.

ravi1 · ‎04-27-2016

Same exact error here too. From Ambari server logs, Below is the log from ambari-server log which has an error. metronambariserver.txt

ravi1 · ‎04-26-2016

@Terry Padgett These are stored as uncompressed text files.

ravi1 · ‎04-26-2016

@Nilesh Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza. sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -m 1

ravi1 · ‎04-26-2016

You can use any special character that is not part of your data. like '|' for a delimiter. But make sure, you get your raw data in that format. (like fields terminated by some special character (Control A which is default) and lines terminated by another special character) when you generate this raw data. Another option is if you are using sqoop to import this data, you can explicitly drops delimiters that are part of the data. (--hive-drop-import-delims and --hive-delims-replacement)

Online	Offline
Last Visited	‎12-18-2021 05:54 PM

Member Since	‎01-09-2019 05:01 PM
Last Visited	‎12-18-2021 05:54 PM
Posts	401
Kudos received	163

Cloudera Community

Re: 2 hosts not running master services

Re: ambari restart and service restart updating kr...

Re: How to automate sqoop incremental import using...

Re: Path to core-site.xml in sandbox?

Re: Curious to know why majority of the people are...

Re: HDF be used to feed Logstash?

Re: how to change a disk used by a hadoop cluster.

Re: Virtual Machines in Hadoop cluster

Re: Has anybody done a Terasort setup on HDP?

Re: Sqoop Import to Hive with Compression

Re: Is compression used for Hive temporary tables?

Re: Metron vagrant start hadoop services faiing

Re: Is compression used for Hive temporary tables?

Re: Sqoop Import to Hive with Compression

Re: Storing Multiline character fields in hive tab...