Member since
01-09-2019
401
Posts
163
Kudos Received
80
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2594 | 06-21-2017 03:53 PM | |
| 4282 | 03-14-2017 01:24 PM | |
| 2388 | 01-25-2017 03:36 PM | |
| 3831 | 12-20-2016 06:19 PM | |
| 2098 | 12-14-2016 05:24 PM |
04-28-2016
12:40 PM
Advantage of using HDF here is that you can do any preprocessing/filtering on your logs before you put into ElasticSearch. This is one of the common usecase where logs are preprocessed before putting into a system like Splunk/Logstash.
... View more
04-28-2016
12:30 PM
From hadoop FAQ on apache, 3.12. On an individual data node, how do you balance the blocks on the disk? Hadoop currently does not have a method by which to do this automatically. To do this manually:
Shutdown the DataNode involved Use the UNIX mv command to move the individual block replica and meta pairs from one directory to another on the selected host. On releases which have HDFS-6482 (Apache Hadoop 2.6.0+) you also need to ensure the subdir-named directory structure remains exactly the same when moving the blocks across the disks. For example, if the block replica and its meta pair were under /data/1/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/, and you wanted to move it to /data/5/ disk, then it MUST be moved into the same subdirectory structure underneath that, i.e. /data/5/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/. If this is not maintained, the DN will no longer be able to locate the replicas after the move. Restart the DataNode. However, this is not something that I recommend. A cleaner approach that you can take is decommission node, change the mount point and add it back to the cluster. I say cleaner because directly touching data directory can corrupt your data with a single misstep.
... View more
04-27-2016
09:21 PM
5 Kudos
There are pros and cons of both approaches. VM based pros: 1. 'Easier' managing nodes. Some IT infrastructure teams insist on VMs even if you want to map 1 physical node to 1 virtual node because all their other infrastructure is based on VMs. 2. Taking advantage of NUMA and memory locality. There are some articles on this from virtual infrastructure providers that you can take a look at. VM based disadvantages: 1. Overhead. As an example, if you are running 4VMs per physical node, you are running 4 OS, 4 Datanode services, 4 Nodemanagers, 4 ambari-agents, 4 metrics collectors and 4 of any other worker services instead of one. These services will have overhead compared to running 1 of each. 2. Data Locality and redundancy. Now, there is support to know physical nodes, so no two replicas go into same physical node but that is extra configuration. You might run into virtual disk performance problems if they are not configured properly. Given a choice, I prefer using Physical servers. However, its not always your choice. In those cases, make sure you try to get following. 1. Explicit virtual disk to physical disk mapping. Say you have 2 VMs per physical node and each physical node has 16 data drives. Make sure to split 8 drives to one VM and 8 more to another VM. This way, physical disks are not shared between VMs. 2. Don't go for more than 2 VMs per physical node. This is so you minimize overhead from the services running. Regarding your question of mixing physical and virtual machines, try to see that all your worker nodes are of similar hardware. While heterogenous hardware is supported, you can run into issues because nodes have different hardware profiles. However, we had some customers who used VMs for master services and physical nodes for worker nodes. This was one way to getting away from NN SPOF issues in Hadoop1 days.
... View more
04-27-2016
08:04 PM
1 Kudo
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/bk_installing_manually_book-20160301.pdf Take a look at page 58. Terasort is already part of hadoop-mapreduce-examples-<version>.jar. This has steps on how to execute terasort after install is complete.
... View more
04-27-2016
03:17 PM
Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.
... View more
04-27-2016
03:14 PM
@Terry Padgett If you want to store these temporary tables as ORC, it is still possible. Here is an example. create temporary table tp1 stored as orcfile as select count(*) from table_params; My earlier answer was whether the text format which is default is compressed on hdfs.
... View more
04-27-2016
03:41 AM
Same exact error here too. From Ambari server logs, Below is the log from ambari-server log which has an error. metronambariserver.txt
... View more
04-26-2016
10:44 PM
1 Kudo
@Terry Padgett These are stored as uncompressed text files.
... View more
04-26-2016
02:44 PM
1 Kudo
@Nilesh Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza. sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -m 1
... View more
04-26-2016
02:32 PM
You can use any special character that is not part of your data. like '|' for a delimiter. But make sure, you get your raw data in that format. (like fields terminated by some special character (Control A which is default) and lines terminated by another special character) when you generate this raw data. Another option is if you are using sqoop to import this data, you can explicitly drops delimiters that are part of the data. (--hive-drop-import-delims and --hive-delims-replacement)
... View more