About bleonhardi

bleonhardi · ‎06-07-2016

Phoenix is a library that is part of the hbase installation, if you mean this with client then the answer is yes. Phoenix is nothing but a client library that translates your jdbc query into an hbase call and a server side library of functions that are used ( for example hbase coprocessors ). So if you use the normal non PQS client your client (Java) program will do some aggregations and needs access to all region servers. However its fast, simple, elegant. If you cannot give access to all data nodes you can use PQS, you would put them on edge nodes and in this way only access the edge PQS servers with this "thin" jdbc client. The PQS server then connects to the Regionservers. It will be a bit slower since you have an extra step in the middle. However you can make sure the potentially heavy client side aggregations happen not in the client but in a dedicated server machine. So trade off.

bleonhardi · ‎06-07-2016

You obviously know that better josh, but I had the impression that on a typical cluster that is not too full most data would end up local anyway since each compaction would put at least a copy on the local datanode. I suppose that changes a bit once you install Regionservers only on a couple nodes or HDFS gets really full.

bleonhardi · ‎06-07-2016

1) You don't have to 2) Its relatively benign since HBase will have most data available on the local datanode ( because of HDFS local first write policy ) The biggest problem is that it makes your cluster config more complex. You will need two configs for hbase and non hbase nodes and potential performance implications when you run heavy workloads in the same cluster as hbase. 3) You don't HAVE to install Query server at all. Its optional normally the connection goes directly through the HBAse API and a lot of computation is client side. This is normally the better way to do things, the query server is like a proxy covering that client side functionality. In general I would prefer the normal client but there can be access issues for example ( normal api needs access to all datanodes ) 4) The PQS as said is like a proxy, normally the client side computations are quite light but that can be different if you return large amounts of data from distributed region servers. So you would need to keep an eye out for the CPU performance in the PQS. In addition to that you might use some CPU resources and compete with the other services on the node. 5) Same as 4 with the exception of the additional CPU competition on those nodes. Most of the time PQS is used because of firewall issues. If you cannot have access to all nodes it is convenient to put PQS on a couple edge nodes so you do not have to open up the full cluster to client connections.

bleonhardi · ‎06-07-2016

"Let me put the question this way. If I have hive.execution.engine=tez; why do I need the property hive.server2.tez.initialize.default.sessions to set it to "True"? Whats the use-case for this property? I ran multiple tests but my hive.execution.engine property drives how the query works and not this default sessions property" The default session parameter has nothing to do with the way the query is executed. It is for pre-creating Tez sessions. If this is false the first query on an empty system will take at least 20seconds to create a session. Time for a Tez query: Hiveserver prepare, compilation, ...: ~1sec Not much you can do here however it continuously gets faster. Initialize Tez Application Master ( Session 😞 ~10 seconds To reduce that Hive can reuse Sessions, that are idle, AMs are kept for normally 120s after a query is run. Or you can instantiate default sessions if you cannot live with that delay. Initialize Containers : 3-10s The next step is to allocate the work containers to the Session, again Tez can reuse containers or you can preheat containers. ( pre allocate the containers ) The actual query That depends on your data.

bleonhardi · ‎06-07-2016

The Plugin connects to Ranger to get the updated policy file and writes it to a local file in case it has to restart. It doesn't monitor the local files for changes to reload them. Because why would it. I assume if you stop ranger ( to make sure it cannot be contacted for an update ) then restart HDFS he would take your changes. ( Didn't try it but sounds like a reasonable assumption ). Now to exploit this you would need to force an hdfs restart and block access to ranger I suppose. Now how to stop people from tampering it? Just make sure random people cannot become root or users from the hadoop group on your system. the policy cache files can only be written by hive, ranger users etc. Once somebody is root on any node on the cluster you will have a hard time stopping anybody from doing things esp. if they can log in to the master servers. On the client you still might have a chance.

bleonhardi · ‎06-07-2016

I did a couple test queries on a 15GB data set once, uncompressed was slowest everything else was more or less identical. I tested Snappy and Zlib in 64 and 256MB stripes. However it was heavily tuned for short running queries on a single column so the majority of time was not in the execution. You might have slightly different results in a heavily CPU restricted system. However as Gopal said proper sorting and distribution is much more important.

bleonhardi · ‎06-06-2016

I see different issues here: a) Split up a line with multiple records If you have multiple communications by line you will need some pre processing. Hive provides maps and arrays but it is hard to use them in normal SQL. So tons of different ways but my suggestion would be to write a Pig UDF to split up a line into multiples potentially adding a column that adds the line information if you need to group them together somehow. http://stackoverflow.com/questions/11287362/splitting-a-tuple-into-multiple-tuples-in-pig b) Get date from Filename There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there: 1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable 2) Write your own recordreader 3) Pig seems to provide some value called tagsource that can do the same http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-pig-latin-script c) Do Graph analysis You can use Hive/pig/Spark for preprocessing and Spark provides a cool graph api. Tons of examples out there. http://spark.apache.org/graphx/ Good luck.

bleonhardi · ‎06-06-2016

If you say that increasing the heap doesn't help are we talking about decent sizes like 8GB+? Also did you increase the java opts AND the container size? set hive.tez.java.opts="-Xmx3400m"; set hive.tez.container.size = 4096; If yes then you most likely have a different problem like for example loading data into a partitioned table. ORC writers keep one buffer open for every output file. So if you load badly to a partitioned table they will keep a lot of memory open. There are ways around it like optimized sorted load or the distribute by keyword. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data If however you use significantly less than 4-8GB for the task then you should increase that.

bleonhardi · ‎06-06-2016

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. ) A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

bleonhardi · ‎06-02-2016

"Does this mean that my user (later on it will be a system user) needs to have a keytab created on the linux file system and distributed to all the nodes?" You would put the keytab in HDFS with access rights for only the user and use the oozie files tag to load it to your temp execution directory, https://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#a3.2.7_Java_Action "Moreover, it might not be a great option, but isn't this authentication possible using only username/password ?" To do this you need PAM or LDAP authentication, thats why I mentioned it :-). You can either hardcode it or do the same thing we discussed above with a password file in hdfs. For this you can set access rights. "Option 3 - I'm using the current mechanism as it is the only one I found some examples on the net. I checked shortly on PAM/LDAP, I'm not sure yet if that will require some changes from the Hadoop cluster side. If not, I'll be happy to try it" https://community.hortonworks.com/articles/591/using-hive-with-pam-authentication.html 🙂

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Number of Region Servers, Phoenix Server on HB...

Re: Number of Region Servers, Phoenix Server on HB...

Re: Number of Region Servers, Phoenix Server on HB...

Re: Hi..whats the difference between "Start Tez se...

Re: How does Ranger HDFS plugin avoid reading tamp...

Re: ORC with Zlib vs ORC with No Compression

Re: Best way to analyze and transform big data in ...

Re: I am getting outofmemory while inserting the d...

Re: Load terabytes of data from Local system to ...

Re: JDBC client to Hive - No data or no sasl data ...