Member since
06-17-2015
61
Posts
20
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1117 | 01-21-2017 06:18 PM | |
1285 | 08-19-2016 06:24 AM | |
1155 | 06-09-2016 03:23 AM | |
1340 | 05-27-2016 08:27 AM |
02-02-2017
02:37 AM
yes similar to this https://community.hortonworks.com/questions/79103/what-is-the-best-way-to-store-small-files-in-hadoo.html#comment-80387
... View more
01-29-2017
06:26 PM
@mqureshiThanks a lot for your help and for guiding 🙂 . thanks for explaining in detail
... View more
01-29-2017
03:46 PM
thanks so much for answering , i think i am closer to answer please elaborate below solutions as advised by you , i am not much familiar with hive
Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there) Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get) FYI below: Please advice on how to store and read archive data from hive while storing data in hive , should i save it as .har in hdfs ? our application generatee small size xml files which are stored on NAS and XML associated metadata in DB . plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase Please advise what do you think will be the better approach for reading this archived data suppose i am storing this archived data in hive so how do i retrieve this archived data Please guide me for storing archived data in hive also guide for Retrieving/reading archived data from hive when needed
... View more
01-29-2017
03:42 PM
Hi thanks for your reply I appreciate your inputs Please advice on how to store and read archive data from hive while storing data in hive , should i save it as .har in hdfs ? our application generatee small size xml files which are stored on NAS and XML associated metadata in DB . plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase Please advise what do you think will be the better approach for reading this archived data suppose i am storing this archived data in hive so how do i retrieve this archived data Please guide me for storing archived data in hive also guide for Retrieving/reading archived data from hive when needed
... View more
01-29-2017
03:42 PM
@hduraiswamy i appreciate your inputs Please advice on how to store and read archive data from hive while storing data in hive , should i save it as .har in hdfs ? our application generatee small size xml files which are stored on NAS and XML associated metadata in DB . plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase Please advise what do you think will be the better approach for reading this archived data suppose i am storing this archived data in hive so how do i retrieve this archived data Please guide me for storing archived data in hive also guide for Retrieving/reading archived data from hive when needed
... View more
01-29-2017
03:40 PM
@mqureshiHi our application generate small size xml files which are stored on NAS and XML associated metadata in DB . plan is to extract metadata from DB into 1 file and compress xml into 1 huge archive say 10GB , suppose each archive is 10GB and data is 3 months old so i wanted to know best solution for storing and accessing this archived data in hadoop --> HDFS/HIVE/Hbase Please advise what do you think will be the better approach for reading this archived data suppose i am storing this archived data in hive so how do i retrieve this archived data Please guide me for storing archived data in hive also guide for Retrieving/reading archived data from hive when needed
... View more
01-23-2017
05:38 PM
I would like to know if anyone has tried running cloudera or hortonworks docker image in kubernetes is anyone aware of good github project running cloudera/hortonworks in containers ?
... View more
Labels:
- Labels:
-
Apache Hadoop
01-22-2017
06:55 PM
Hi Team, consider a hadoop cluster with default block size of 64Mb , we have a case wherein we would like to make use of hadoop for storing historical data and retrieving it as per need historical data would be in form of archive containing many small files (millions) , so thats the reason we would like to reduce default block size in hadoop to 32MB ? I also understand that changing default size to 32MB may adversely affect if we plan to use that cluster for applications which , store files which are huge in size , so can anyone advise what to do in such situations
... View more
Labels:
- Labels:
-
Apache Hadoop
01-22-2017
06:23 PM
We have a situation where in we have lots of small xml files residing on Unix NAS and its associated metadata in Oracle DB.
we want to combine this 3 month old XML and its associated metadata in 1 archive file (10GB) and want to store in hadoop . whats the best way to implement this in hadoop ? Note after creating 1 big archive , we will have many small files (each file size may be 1Mb or less) inside my archive so i would reduce block size to 32MB for example may be
I read about hadoop archive .har files or storing data in hbase would like to know pros/cons from hadoop community experience whats the recommended practice for such situations can you please advise also reducing hdfs block size to 32 MB to cater to this requirement ? how does it look I want to read this data from hadoop whenever needed without affecting performance Thanks in advance
... View more
Labels:
- Labels:
-
Apache Hadoop
01-21-2017
06:18 PM
thanks for confirming , so what i wrote is correct that is changing dfs.blocksize . restart anyways will happen
... View more
01-19-2017
07:12 AM
can you please advise if we need to change any other parameter apart from changing dfs.blocksize in HDFS config any other suggestions are also welcome as long it helps in saving space in block and not wasting any block space also do we need any change on DataNodes side also ?
... View more
Labels:
- Labels:
-
Apache Hadoop
09-19-2016
10:41 AM
1 Kudo
Please see below options and NOTE NOTE : for both options CopyTable and Export/Import Since the cluster is up, there is a risk that edits could be missed in the export process. http://hbase.apache.org/0.94/book/ops_mgt.html#copytable CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows: $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename http://hbase.apache.org/0.94/book/ops_mgt.html#export 14.1.7. Export Export is a utility that will dump the contents of table to HDFS in a sequence file. Invoke via: $ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration. 14.1.8. Import Import is a utility that will load data that has been exported back into HBase. Invoke via: $ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
... View more
08-24-2016
09:20 AM
@da li hey please have a look at below https://community.hortonworks.com/questions/153/impersonation-error-while-trying-to-access-ambari.html if it helps accept the answer You need to create the proxy settings for 'root', since Ambari runs as root. This allows it to impersonate the user in hdfs. similar thing you need to do for oozie user , like its done for root hadoop.proxyuser.root.groups=* hadoop.proxyuser.root.hosts=*
... View more
08-22-2016
04:16 PM
@Scott Shaw Thanks but please see below questions 1. can i get same performance what i get in my optimized and purpose-built infrastructure HDP cluster ? because data lake is central and can i tune it specifically for 1 application ? 2. how can i manage different HDP versions in data lake ? 3. if something goes wrong with security or configuration because of 1 application then my whole data lake will be impacted ?
... View more
08-22-2016
03:54 PM
1 Kudo
Hi i have a small application that generates some reports without using any map reduce code i want to understand what are the real benefits of using Data lake, i think it will be useful for enterprise if there are many products which are writing data to various hadoop clusters and in order to have unified view of the various issues and having common data store , apart from this what are the other real benefits ? How does data lake work if i want particular HDP version ? i think its easier to switch to particular HDP in a separate cluster from ambari but what about data lake? also if multiple applications use data lake and just 1 application require frequent changes like hbase coprocessor for testing various things , is it advisable to go for data lake ? HA we get in cluster as well , so what are the main advantages technically if we dont bother cost
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
08-19-2016
11:02 AM
1 Kudo
Hi Team, is anyone aware of issues during installation , that why do we get so many broken symlink issues during installation , this issue i faced in HDP 2.3.4 and ambari-2.2.2.0 Please see below https://community.hortonworks.com/questions/33492/hdp-234-failed-parent-directory-usrhdpcurrenthadoo.html i was installing 3 Node HDP 2.4.0.0 cluster, wherein at step "Install,Start and Test" on 1 node installation went fine but on other 2 nodes there were random symlink issues i had to fix broken symlink issues manually most of the times and final after spending so much time i was able to successfully installed HDP 2.4.0.0 issues like as below and shown in image 2016-08-18 21:20:17,474 - Directory['/etc/hive'] {'mode': 0755} 2016-08-18 21:20:17,474 -
Directory['/usr/hdp/current/hive-client/conf'] {'owner': 'hive', 'group':
'hadoop', 'recursive': True} 2016-08-18
21:20:17,474 - Creating directory Directory['/usr/hdp/current/hive-client/conf']
since it doesn't exist I had proper prerequisites available before starting installation as given in http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_hdp_24_repositories.html also randomly doing retries it works 😞 Please advise if you think i am doing something wrong or any good best practices for installation and debugging Thanks,
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hadoop
-
Apache Hive
08-19-2016
06:24 AM
2 Kudos
@Ted Yu @emaxwell @Josh Elser thanks all for your confirmation , that's why i asked if rpm is relocatable 🙂 so the bottom line is Hortonworks installation directories cannot be changed , all binary and config files of HDP go in /usr and /etc .. since its hardcoded in RPM and RPM is not relocatable i will close this thread But I believe it should support relocatability from corporate IT policy POV , wherein we many times we have issue putting files in /usr and /etc also i suggest at the time of RPM creation hortonworks should make RPM to be relocatable in order to allow installing binary and config files in other directories instead of /usr and /etc . i understand there are other software's which HDP consists of, but ultimately Hortwonworks can customize this bundle to support user specific needs I should open this as an idea , WDYT ?
... View more
08-19-2016
06:23 AM
@Ted Yu @emaxwell @Josh Elser thanks all for your confirmation , that's why i asked if rpm is relocatable 🙂 i will close this thread But I believe it should support this from corporate IT policy POV , wherein we many times have issue putting files in /usr and /etc i should open this as an idea , WDYT ?
... View more
08-18-2016
07:19 PM
1 Kudo
Hi team, i see HDP stores its lib files and packages in /usr/hdp and maintains diff versions can we control HDP installation packages or rpm and make installation relocatable to other directories like /opt if my It team does not permit installation inside /usr then what to do ? # ls /usr/hdp/ 2.4.0.0-169 2.4.2.0-258 current Please advise rpm -ql hdp-select-2.4.2.0-258.el6.noarch
/usr/bin/conf-select
/usr/bin/hdp-select
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
08-18-2016
06:35 AM
2 Kudos
Hi Team, we see logs of various hadoop services are stored in /var/log can we change it to our customized location if i dont want to store logs in below location then /var/log/ambari-agent/
/var/log/ambari-metrics-monitor/
/var/log/ambari-server/
/var/log/hbase
/var/log/zookeeper i see in ambari changing log location is disabled ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
08-03-2016
06:26 AM
1 Kudo
Hi Team, My hadoop namenode servers are without HBA and but servers are RAID 10 so do i need NFS point to save namenode metada file edits etc on NFS location as well if i have active namenode as well in cluster also my question is if my hardware is without HBA storage and RAID 10 so can i connect to NFS point from such hardware ? basically what are the recommendations for namenode HA ?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
08-01-2016
07:09 AM
Thanks a lot Kuldeep i agree and thats why i wanted suggestions from experts like you 🙂
... View more
08-01-2016
06:11 AM
1 Kudo
Hi Team, I have 3 virtual machines in HDP cluster ,if i have huge capacity in data nodes disk in TBs
so can i use the same disk with diff mount points to store Data nodes data, NN namenode data, SN data, JT data (master node data) and /usr and /var . I know then if my disk has some issue then all data will be affected
basically i wanted to know if my data node disks have lot of space in TBs, so do you recommend creating diff mounts on same data node disks for diff purposes like /usr,/var and storing NN SN JT data Also each HDP version data is in /usr/hdp
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HBase
07-19-2016
06:02 AM
We can store application related data and logs on SAN/NAS However
SAN/NAS are not at all recommended for I/O sensitive and CPU bound jobs , that
is to avoid bottleneck situations while reading data from disk or from network
or in processing data So
for Logs/application data --> SAN/NAS Data
nodes data --> DAS with JBOD
configuration NO RAID NN/SN/JT nodes --> should be highly available [ RAID
5/10(depends on usecase) ] Hadoop
is a scale out and shared nothing architecture http://www.bluedata.com/blog/2015/12/separating-hadoop-compute-and-storage/ https://community.emc.com/servlet/JiveServlet/previewBody/41473-102-1-132603/Virtualizing%20Hadoop%20in%20Large%20Scale%20Infrastructures.pdf Also I understand sometimes
true cost of DAS is also more considering Hadoop replication , but this is how
Hadoop is thriving (One of the key tenets of Hadoop is to bring the compute to
the storage instead of the storage to the compute.)
... View more
07-19-2016
05:55 AM
@Sbandaru: i researched over this more deeply and conclusion is , we don't need edge node
We don’t
need edge node if Hadoop cluster and application are in same network its only needed when hadoop cluster and application are in diff network , at that time edge node acts as a gateway to hadoop cluster ( like a proxy ) thanks for your inputs
... View more
07-16-2016
03:34 AM
1 Kudo
Hi Team,
We are going to deploy HDP 2.3.4 for Big Env setup
Can Some one Please explain me the architecture of Edge node in hadoop . I am able to find only the definition on the internet. I have some queries
1)What is edge node? 2) when and why do we need it ? 3) does every production cluster contain this edge node? 4) Does the edge node a part of the cluster (What advantages do we have if it is inside the cluster . Does it store any blocks of data in hdfs. any performance improvement? 5)Should the edge node be outside the cluster . 6) Please refer any docs where i can know about it. Preferably Hortonworks docs
... View more
Labels:
- Labels:
-
Apache Hadoop
07-13-2016
05:32 PM
will be helpful
... View more
07-13-2016
02:20 PM
answer looks goo d.. Thanks for your answer can you please advise how to decide DiskIO in cluster ? which factors to consider for Disk I/O calculation ?
... View more