About mbigelow

mbigelow · ‎01-16-2017

Yes. Cloudera does not support Tez on any CDH version. Hence they do not ship the Tez jar and have it in the classpath. It will take quite a bit of work to build tez and maintain it with each CDH release. Here is a link if you are up to it. Otherwise be satisfied with Hive on Spark or Impala. https://gist.github.com/epiphani/dd37e87acfb2f8c4cbb0

mbigelow · ‎01-16-2017

I struggled to scope this reponse so it wouldn't ballon out. Lets separate the discussion into reads and writes. HBase isn't terrible great at data ingestion. It has been the achilies heel of the system and probably were most large orgs eventually start looking elsewhere. So I am not suprised that MySQL did better at ingestion data. When it comes to large data sets HBase shines in a few ways. First it lives on HDFS which is an excellent distributed file system specifically for large datasets. HBase doesn't store empty columns. HBase also splits the data into regions. This makes it efficient and effective to either fetch a single row or column or scan through and grab many. For what it is worth, the latency of HBase in retrieving a single record is comparable to other RDBMS; I don't think it is necessarily faster. So think of in terms of "hey I have this petabyte of data but I want sub-second latency to get one or set of rows". HBase can do that while MySQL cannot. I thought of using an example but maybe a summary is better. HBase writes are comparable to most RDBMS for transactions per second. HBase has comparable latency on single row lookup to most RDBMS. HBase can effeciently and effectively retrieve a large chunk of rows better than any RDBMS. HBase, thanks to HDFS, can operation on Petabytes and larger datasets. So if you need to scale to TBs and PBs of data, want comparable performance and latency to RDBMS then go HBase. There is something about how HBase stores the data to makes it efficient for pulling a single column but I don't recall off the top of my head and am not digging through the HBase doc tonight. The gist of it would, due to the storage format HBase can find the specific row in the region's index file, comparable to MySQL, and then fetch the specific column and only return that. MySQL on the other hand can do a fast index search, find the row, and return the entire row.

mbigelow · ‎01-16-2017

@justin3113 to run jobs across all nodes a user must exist on each node, I'd justin3113 for example. And each user needs a HDFS user directory under /user in HDFS, the user must have read and write access. This is so the job can write temporary data to HDFS from whatever node the job is running. The error is stating that it is trying to create that user directory but only the hdfs user has that permission. Opening up access gets around it but that is not advisable. You should run for each user su - hdfs hdfs dfs -mkdir /user/justin3113.

mbigelow · ‎01-16-2017

What does hdfs dfs -du -s -h /path/to/table output?

mbigelow · ‎01-16-2017

Both query examples read through all of the data. The expectation is that they take how ever long it takes to read all of it across X mappers. You need to get to the MR job counters to see if there is a bottleneck somewhere. On a 3 node, it is probably peaking out on what it can do. With that said, having an index should have sped up the query with the restrictive where clause. Was the index over the column in the where clause? How did you set it up and have you check its existence with show index on table1;?

mbigelow · ‎01-13-2017

describe formatted/extended <table> partition <partition spec> This will output stats like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. So not exactly this table is X size. It would seem that if you include the partition it will give you a raw data size. Otherwise, hdfs dfs -du -s -h /path/to/table will do. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

mbigelow · ‎01-09-2017

I figured it had to do with history server but haven't dealt with this specifically. Never fear StackOverflow is here.0 The short is that the history server is enabled but the path it is trying to write to is not valid. Either give it a valid path or turn it off. I'd give it a valid path as it is how the UI contains all the wonderful information. spark.eventLog.enabled=true spark.eventLog.dir=/user/spark/applicationHistory (make sure it exist and the spark user as r/w access) http://stackoverflow.com/questions/36038188/error-sparkcontext-error-initializing-sparkcontext

mbigelow · ‎01-09-2017

1. For Implementing the Impala JDBC username & password based connection, is it mandatory to have either Kerberos or LDAP or Sentry Implented? You need to have LDAP auth enabled for Impala. 2. If not, then using JDBC with Impala local user credentials does need any additional parameters or change at the server level or change in Impala local user permission? No changes to Impala or the local user beyond installing and configuring the JDBC driver and connection. 3. Can you please provide any example to implement it, like any server level or client level changes if my implementation is wrong. Driver type1 Impala with LDAP DriverManager.getConnection("jdbc:hive2://server:21050/default;auth=noSasl;","hostuser","password") Impala without any auth DriverManager.getConnection("jdbc:hive2://server:21050/default;auth=noSasl;") Note: noSasl is required when SSL is not configured Driver type2 Impala with LDAP DriverManager.getConnection("jdbc:impala://Server:21050/default;AuthMech=3;UID=hostuser;PWD=password") Impala with Kerberos DriverManager.getConnection("jdbc:impala://Server:21050/default;AuthMech=1;) Impala with Kerberos and SSL DriverManager.getConnection("jdbc:impala://Server:21050/;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=Server.REALM.COM;KrbServiceName=impala;SSL=1;SSLTrustStore=C:\Users\mbigelow\.ssl\bdp3-prd-ts.jks;SSLTrustSTorePwd=*****; Impala without any auth DriverManager.getConnection("jdbc:impala://Server:21050/default;AuthMech=0;") I think AuthMech could be left off here as it should be the default. Cloudera has plenty of documents covering enabling Kerberos and LDAP, although both should be considered well before delving into them.

mbigelow · ‎01-09-2017

@benassi check who all belongs to the hadoop group. It should be hdfs, mapred, and yarn. The yarn account, as that is that the RM, NM, and JH run as, will need to have read/write access to be able to remove any old logs.

mbigelow · ‎01-05-2017

Is /tmp/logs, and all subdirs, set to 770 and hadoop group? Have you check for actual log files? The log directories are not removed. It may appear that the logs are lingering. Use hdfs dfs -du -s -h /tmp/logs/ to see if there is any decrease over time or if it is just increasing?

Online	Offline
Last Visited	‎03-25-2019 05:55 PM

Member Since	‎08-16-2016 08:51 PM
Last Visited	‎03-25-2019 05:55 PM
Posts	642
Kudos received	129

Cloudera Community

Re: Configuring the HDFS superuser in Kerberos

Re: Hive process crash

Re: Upgrade from CDH 5.11 Express to Enterprise

Re: Adding user to Cloudera Manager using REST AP...

Re: Running in non-interactive mode, and data appe...

Re: Tez Engine not working over CDH 5.8.2

Re: Hbase Vs MySQL database ( Hadoop ) vs (Convent...

Re: Permission Error while running spark-shell

Re: Can we check size of Hive tables? If so - how?

Re: Hive Queries run slowly

Re: Can we check size of Hive tables? If so - how?

Re: Error initializing SparkContext

Re: Impala JDBC with username & password Does not ...

Re: Yarn Aggregate Log Retention Setting

Re: Yarn Aggregate Log Retention Setting