About bpgergo

bpgergo · ‎01-05-2018

whatever I tried, all yarn applications ended up in the default queue these are the things I tried: 1) setting property in the titan-hbase-solr.properties (none of the following worked) mapred.job.queue.name=myqueue mapreduce.job.queue.name=myqueue mapred.mapreduce.job.queue.name=myqueue 2) setting property in the gremlin shell gremlin> graph = TitanFactory.open("/usr/iop/4.2.5.0-0000/titan/conf/titan-hbase-solr.properties") gremlin> mgmt = graph.openManagement() gremlin> desc = mgmt.getPropertyKey("desc2") gremlin> mr = new MapReduceIndexManagement(graph) gremlin> mgmt.set('gremlin.hadoop.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root.gremlin]: hadoop gremlin> mgmt.set('hadoop.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root]: hadoop Display stack trace? [yN] n gremlin> mgmt.set('titan.hadoop.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root]: titan Display stack trace? [yN] n gremlin> mgmt.set('mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root]: mapred Display stack trace? [yN] n gremlin> gremlin> mgmt.set('mapreduce.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root]: mapreduce Display stack trace? [yN] n gremlin> mgmt.set('gremlin.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root.gremlin]: mapred Display stack trace? [yN] n gremlin> mgmt.set('gremlin.hadoop.mapred.job.queue.name', 'myqueue') Unknown configuration element in namespace [root.gremlin]: hadoop Display stack trace? [yN] n gremlin>

bpgergo · ‎08-22-2017

Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?

bpgergo · ‎08-22-2017

For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs. What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop. for i in {1..125} do dirs="" for j in {1..8000}; do dirs="$dirs /user/d$i.$j" done echo "$dirs" hdfs dfs -mkdir $dirs done Apparently it takes hours to create 1M folders this way. My question is, what would be the fastest way to create 1M empty folders?

bpgergo · ‎03-28-2017

What do you exactly mean by "if an user arun is trying to access hdfs"? Are you trying to access a file/folder with the "hadoop fs" command while you are logged into linux as user "arun"?

bpgergo · ‎03-27-2017

Not sure if I got your question right, the CreateEvent will contain the HDFS path. The name of the file is part of the HDFS path (comes after the last '/'). I hope this answers your question. see the inotify patch here, https://issues.apache.org/jira/secure/attachment/12665452/HDFS-6634.9.patch#file-8 /** * Sent when a new file is created (including overwrite). */ public static class CreateEvent extends Event { public static enum INodeType { FILE, DIRECTORY, SYMLINK; } private INodeType iNodeType; private String path; private long ctime; private int replication; private String ownerName; private String groupName; private FsPermission perms; private String symlinkTarget;

bpgergo · ‎03-23-2017

This question is too board in this form. You need to understand this: if you want to get advise on which solution (computing engine) to choose, you should give a descrption first on what you are trying to accomplish, what kind of problem are you trying to solve, what is the nature of your workload.

bpgergo · ‎03-23-2017

If I understand correctly, you say you have large tables (3 million records) return a query like this relatively fast: Select * from example_table Limit 10 or Where serial = “SomeID” but when you run similar query against an external table stored on AWS S3, it performs badly. Did you try to copy table data file to hdfs, and then create an external table on the hdfs file? I bet that could make a big difference in the performance. I assume the difference is because in case the table data is stored on S3, hive first needs to copy the data from S3 onto a node where hive runs and the speed of that operation will depend on network bandwidth available.

bpgergo · ‎03-23-2017

1) store PDF files in HDFS It would be possible to store your individual PDF files in HDFS and have the HDFS path as an additional field, stored in the Solr index. What you need to consider here, HDFS is best at storing small number of very large files, so it is not effective to store large number of relatively small PDF files in HDFS. 2) store PDF files in HBase It would also be possible to store the PDF files in a object store, like HBase. This is an option that is definitely feasible and I have seen several real life implementation of this design. In this case, you would store the HBase id in the Solr index. 3) store PDF files in the Solr index itself I think it is also possible to store the original PDF file in the Solr index as well. You would use a BinaryField type and you would set the stored property to true. (Note that you could even accomplish the same with older version of Solr, lacking the BinaryField type. In this case, you would have to convert your PDF into text (e.g. with base64 encoding) then store this text value in a stored=true field. Upon retrieval, you would convert it back to PDF). Without an estimation on the number of PDF files and the average size of a PDF, it would be hard to choose the best design. It could be also in important factor if you want to update your documents frequently or you just add to to the index once and then they won't change anymore.

bpgergo · ‎03-22-2017

have you copied the jar file to the hdfs? if you run this command, what is the result? hadoop fs -ls /path/to/your/spark.jar

bpgergo · ‎03-21-2017

linux username/password: root/hadoop this will also help: https://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/

Online	Offline
Last Visited	‎08-21-2019 06:12 AM

Member Since	‎08-12-2016 08:27 AM
Last Visited	‎08-21-2019 06:12 AM
Posts	39
Kudos received	7

Cloudera Community

Re: Is there a way that inotify createEvent can ca...

Re: About installation of HDP sandbox

Re: Is there way to add comment to a phoenix table...

How can yarn queue be specified for Titan mapreduc...

Re: What is the fastest way to create large number...

What is the fastest way to create large number of ...

Re: Ranger policy not working

Re: Is there a way that inotify createEvent can ca...

Re: how can i decide i use spark or Mpareduce ?

Re: Hive Performance issues with Tez Engine for E...

Re: Where to store documents if I use Solr to stor...

Re: Minimal executable jar based on Scala code pac...

Re: About installation of HDP sandbox