Member since
10-18-2017
52
Posts
2
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
387 | 01-27-2022 01:11 AM | |
3741 | 05-03-2021 08:03 AM | |
3464 | 02-06-2018 02:32 AM | |
4996 | 01-26-2018 07:36 AM | |
3069 | 01-25-2018 01:29 AM |
01-27-2022
01:11 AM
For future reference: I am on a hbase cluster, and also need access to the hive metastore. It seems that in case the hive-site.xml contains some wrong values, you can have this behavior.
... View more
08-13-2021
12:51 AM
Maybe you are still asking more than what is available? It really depends on what kind of cluster you have available. It depends on following paramaters: 1)cloudera manager-> yarn-> configuration -> yarn.nodemanager.resource.memory-mb (= Amount of physical memory, in MiB, that can be allocated for containers=all memory that yarn can use on 1 worker node) 2) yarn.scheduler.minimum-allocation-mb (container memmory minimum= every container will request this much memory) 3) yarn.nodemanager.resource.cpu-vcores (Container Virtual CPU Cores) 4)how many worker nodes? Cluster with x nodes? I noticed you really are requesting a lot of cores too. Maybe you can try reduce these a bit? This might also be a bottleneck.
... View more
08-12-2021
01:32 AM
I see 2 things that would be good to understand: 1)why do the yarn containers exceed their size 2)why does he not provide the number of executors that you request? 1) It seems like you are exceeding the yarn container size of 10GB. The executors will run in yarn containers. Maybe you need to increase the minimum yarn container size a bit? I think the message suggests the minimum container size for yarn is 10GB. If you request a 8GB executor, and there is some (2GB)overhead, he might hit the ceiling of what was assigned to him and this executor will exit. 2) Looks like your cluster is not capable of providing the requested 10 executors of 8GB? Other relevant info to share would be: how many nodes do you have, and for each node, how much memory is assigned to yarn , and how much is the yarn minimum container size? Example: suppose the yarn container size is 3 GB. Suppose you have 9 nodes. Suppose your executor memory is 1GB . Suppose 10 GB on each node is allocated to yarn. This means you have on each node enough memory to start 3 containers (3x3GB< 10GB). THerefore, when dynamic allocation is enabled, he will start 27 execuctors. Even if you would ask for more than this 27, he will only be capable of providing 27. Maybe this helps?
... View more
08-11-2021
07:02 AM
Thank you for this very valuable input! (I had somehow missed the response). I see indeed increased latencies, but see that should be neglectable for hot data. I have observed this, but think there is a limit to how much data you can keep 'hot'. This depends on a combination of settings at the level of the hbase catalog properties and the hbcase cluster. We have discussed this also in following thread: https://community.cloudera.com/t5/Support-Questions/simplest-method-to-read-a-full-hbase-table-so-it-is-stored/m-p/317194#M227055 It would be very interesting if a more in depth study would ever be conducted and reported, as this is very relevant for applications with hbase as back-end that require some more advanced querying of the data (like in my case aggregations to compute a heatmap using a high volume of data points).
... View more
08-11-2021
06:52 AM
Dear experts, I notice when I try to load a hbase data in pyspark, it tells me java.io.IOException: Expecting at least one region for table : myhbasetable at org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase.getSplits(MultiTableInputFormatBase.java:195) at org.locationtech.geomesa.hbase.jobs.GeoMesaHBaseInputFormat.getSplits(GeoMesaHBaseInputFormat.scala:43) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:130) It looks like it is telling me the table has to at least have some data in at least 1 region. This is the relevant piece of code: --> https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormatBase.java try (Connection conn = ConnectionFactory.createConnection(context.getConfiguration())) {
while (iter.hasNext()) {
Map.Entry<TableName, List<Scan>> entry = (Map.Entry<TableName, List<Scan>>) iter.next();
TableName tableName = entry.getKey();
List<Scan> scanList = entry.getValue();
try (Table table = conn.getTable(tableName);
RegionLocator regionLocator = conn.getRegionLocator(tableName)) {
RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
regionLocator, conn.getAdmin());
Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();
for (Scan scan : scanList) {
if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region for table : "
+ tableName.getNameAsString());
} can see in the hbase master, that his table has data spread out over 4 regions. And in hbase shell, I can scan the data with no error. This is on hbase 2.1. It seems he is not finding the fact there are regions for this table. I wonder what could cause this. Did anyone every encounter this error?
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
05-26-2021
09:04 AM
Would by any chance you know of any page that very clearly describes the fields shown in the hbase master ui? I find lots of blogs about it, but would be interesteing in seeing one that brings clarity of all metrics.
... View more
05-26-2021
07:17 AM
Many thanks for this detailed answer which again far exceeded my expectations. I marked the whole discussion as the correct solution as it will be of interest for others. I only wanted to add : I conform I do see an effect of the rowcount. If I do my aggregation queries after running the rowCount mapreduce job, the time improves by a factor of 3-4. Possibly the consequence of one of the other settings I apply (like bloom filter? or something else)? When I look at the difference between L1/L2 cache, I see following behavior after running the MR job for rowcount: AFTER MR JOB, ON 1 REGION 12K BLOCKS IN L1 CACHE (BLOOM.INDEX) AFTER MR JOB, ON 1 REGION MANY MORE BLOCKS IN L2CACHE (DATA) This shows that most of caching happens in L2 cache. According to this article https://blog.cloudera.com/hbase-blockcache-101/ for me (using CombinedBlockCache), the L1 is used to store index/bloom filter blocks and the L2 is used to store data blocks. Now when I really use my application (and fire heavy aggregations, that now go much faster thanks to the MR job) I see similar behavior. As I assume the true aggregations will make use of caching, I almost am inclined to think that the MR job for rowcount does help me to fill the necessary L2 cache: After running my aggregate queries (now faster thansk to MR), L1 cache again slightly filled. After my aggregate queries, L2 cache also heavily filled.
... View more
05-21-2021
08:23 AM
Thank you for this very useful answer. Some remarks: 1)I have the feeling the row counter does help. When I let it run, the read time for that table afterwards is reduced by a factor of 3. I did not look into the details and would expect it just counts the rows and does not access the data, but my test suggests it does also store some of the data. 2) Isn't there any mapreduce job availalbe that will access all the data 1x then? If this were sql, I would write a order by or SORT(something that kicks off a full table scan)? I was also trying CellCounter, but this makes my regionservers become overloaded. 2) I have set the IN_MEMORY =true already for my tables. 3)For my use case I need to do lots of heavy aggregation queries. It is always on the same couple tables. I would prefer to have these cached completely. It would be very useful if there is a method to permanently cache them. I guess the best thing I can do is the in_memory=true, but it seems like it doesnt permanently store it. I have the feeling that after a while the cache is cleaned randomly. I guess this is what you mentioned about the LRU cache that will do this... I think I am just using the standard implementation (only changed it in cloudera manager using the hfile.block.cache.size to a higher value of 0.55). I did not understand this comment "We can use RegionServer Grouping for the concerned Table's Region & ensure the BlockCache of the concerned RegionServer Group is used for Selective Tables' Regions, thereby reducing the impact of LRU. ". So, there is a way to prevent that the cache is cleaned again for these tables? I can tell from the hbase master that the on-heap memory is only partially used (still more than 50% available, after I have accessed all my tables once). I have 180GB avaliable and the tables would fit in there. Too bad I can not permanently store them in memory. Relevant settings are: HBase: hbase.bucketsize.cache = 10 GB HBase: hfile.block.cache.size = 0.55 HBase: Java Heap Size HBASE regionserver = 20GB and at the catalog level: CACHE_DATA_ON_WRITE true CACHE_INDEX_ON_WRITE true PREFETCH_BLOCKS_ON_OPEN true IN_MEMORY true BLOOMFILTER ROWCOL 5)The article is very interesting-I am still digesting it. One important thing I see is that the cached data is decompressed (I was not aware of this). I see that it is recommended to have block sizes comparable to the row size of the data blocks you are interested in. Otherwise you are caching a majority of data you are not interested in. But I only use this cluster to access the same few tables which I all would like to have completely cached, so for me all data is relevant and can be cached, so maybe this will not help me. Finally, I should figure out how to know what the size is of 1 row of data. Once again, your responses are very advanced and helpful so thank you!
... View more
05-19-2021
08:50 AM
I have the feeling this is a good guess: hbase org.apache.hadoop.hbase.mapreduce.RowCounter It counts rows, but not entirely sure it will cache everything.
... View more
05-19-2021
04:04 AM
Dear Experts, I am interested in accessing all the data in a hbase table once, so it will be cached. I am using an application that reads from hbase and I notice it is a lot faster when the data was read before, so I would like to "preheat" the data by warming it up: accessing it all once, so it is stored in the cache. I am quite familiar with the cluster configuration necessary to enable caching but wonder what is the easiest way to do this. Is there a command that accesses all the data in all regions 1x?
... View more
Labels:
- Labels:
-
Apache HBase
05-12-2021
03:08 AM
I am wondering what difference in IO can be expected for hbase with storage in the cloud VS storage on hdfs. I would expect that when data is retrieved from hdfs, it will be a lot faster than from the cloud (like adls-in my specific case adls gen2=abfs). Is there somewhere where I can test this? Or find a previous study for this? If this is the case, then one would expect that current hbase performance for reading data is a bit less than some years ago when everything was on premise and typically hdfs was used? Maybe I am missing something obvious here, so any insight is appreciated !
... View more
Labels:
- Labels:
-
Apache HBase
05-03-2021
08:03 AM
For future reference: I would like to add that the reason for the observed behaviour was an overcommit of the memory. While I am writing, the memory used of the box at some point comes so close to the maximum available on the regionservers, that the problems start. In my example at the start of writing I use about 24/31GB on the regionserver, and after a while this becomes > 30GB/31GB and eventually failures start. I had to take a way a bit of memory from both the offheap bucketcache and a bit of the regionserver's memory. Then the process starts with 17GB/31GB used, and after writing for an hour it maxes at about 27GB, but the failure was not observed anymore. The reason I was trying to use a much of the memory as possible is that when reading, I would like to have the best performance. Then making use of all resources, does not lead to errors. While writing however it does. Lesson learned: when going from a period that is write-intensive to a period that is read-intensive, it could be recommended to change the hbase config. Hope this can help others! PS: although the reply of @smdas was of very high quality and lead me to many new insights , I believe the explanation above in the current post should be marked as the solution. I sincerely want to thank you for your contribution, as your comments in combination with the current answer, will help others in the future.
... View more
03-19-2021
10:22 AM
Hello, When posting I had never hoped to get such a fast and remarkably clear and useful answer!Really helped for me to think more about the problem. Hereby some comments: SOLUTION 1 : Indeed, allowing some more failures might be a quick fix. Will try. But true fix lies in solving #2 below probably. SOLUTION 2: If I understand correctly, when JVM is full, GC takes place to clean up. And if this is really urgent, the actual JVM pauses. But if this happens longer than the zookeeper timeout (=60seconds), then the regionserver is believed to have died, and the master will copy all regions to other healthy regionservers. (I am not the expert on GC, but see that my regionserver starts with "-XX:+UseParNewGC -XX:+UseConcMarkSweepGC" ) But I had expected to see this mentioned somewere clearly in the regionserver's logs or in cloudera manager and I fail to do so. When I see my spark job saying "regionserver worker-x not available" at that exact timestamp I see no ERROR in the worker-x regionserver log. Here some more info wrt your comments 1)regionserver out of memory, I assume in the /var/log/hbase/regionserver*out this should definitely show up as error/warning. This seems not the case. 2)I believe in case there would be a JVM pause, this would show up in the regionserver's logs "Detected pause in JVM or host machine (eg GC): pause of approximately 129793ms No GCs detected" https://community.cloudera.com/t5/Support-Questions/Hbase-region-server-went-down-because-of-a-JVM-pause-what/td-p/231140 -> I see no such message 4)note 32GB=total memory of the server; In fact I was wrong: 10GB(not 20GB)=regionserver heap size . You make a very good point: the other 29 days of the month we want read efficiency. So that is why the memstore only receives 0.25. I should change it to 0.4% when writing and see if the error still persists. 5)I have defined my table to have 3x more shards than there are regionservers. I think this shuold avoid hotspotting. +Bulk load indeed would bypass the need for memory. I understand it directly would create the hfiles then. But I am using some external libraries related to geostuff and not sure it is possible. Thanks agin for your valuable contribution!
... View more
03-18-2021
04:08 AM
I have a question about regionservers that go to bad health when I am writing data from Spark. My question is: how do I know (in what logs to look exactly) what the cause is of the bad health? BACKGROUND: My spark job processes a month of some data and writes to hbase. This runs 1/month. For most months there are no problems. For some months (probably with slightly higher traffic), I notice the regionservers go into bad health. The master notices this and when the server goes down, it moves all regions to another regionserver, and then it becomes healthy again. But as my writing is going on, the same happens to other regionservers and eventually my spark job fails. I am confident my error in write is not due to corrupt data (like impossible UTC time or so), since sometimes this happens and then I clearly see in my spark logs "caused by value out of bounds". Now I see 21/03/17 15:54:22 INFO client.AsyncRequestFutureImpl: id=5, table=sometable, attempt=10/11, failureCount=2048ops, last exception=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server myregionserver4,16020,1615996455401 is not running yet LOGS: In the logs of the master I mainly see that it starts to identify all regions on the regionserver that is dying, and moves them. In the logs of the regionserver around the time of failure I noticed a "org.apache.hadoop.hbase.io.hfile.LruBlockCache: totalSize=3.39 GB, freeSize=2.07 GB, max=5.46 GB, blockCount=27792, accesses=0, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=1049, evicted=0, evictedPerRun=0.0 2021-03-17 15:52:31,442 INFO " but this might be not relevant. This is the behavior of the #of regions for the regionserver that dies: it dies here 2x and the number of regions collapses as they are moved. 2 failures of regionserver at 3.54 and 4.15, followed by copying of regions The cpu looks as follows (looks ok to me): The memory looks like it comes close to the max available: --> Maybe it fails because of memory? Like that all writes are first written to memory and the available memory has been used up? But I had expected to see this kind of error message somewhere (logs or cloudera manager). These are the hbase settings (on CDP 7.2, hbase 2.2) 32 GB regionservers Java Heap Size HBASE regionserver =20GB hfile.block.cache.size =0.55 hbase.regionserver. memstore. global.size =0.25 hbase.bucketsize.cache=16GB I am wondering how I can understand better what the reason is that the regionserver fails. Then I can change the hbase config accordingly.
... View more
Labels:
- Labels:
-
Apache HBase
09-25-2020
04:07 AM
We are trying to use hbase with adls as a storage layer. Is this still possible in cdh 6.1?
This is described in the docs for cdh 5.12 :
https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/admin_using_adls_storage_with_hbase.html#hbase_adls_configuration_steps
but we are using cdh 6.1 and there I don't see a similar page anymore.
Off course I ask since I have an issue following the 5.12 docs. When I want to change in Cloudera Manager in the hbase configuration,
hbase.root.dir = adl://ourproject.azuredatalakestore.net/hbase
I can not save this method as I get an error saying the variable is not in the allowed format:
HDFS Root Directory: Path adls://ourproject.azuredatalakestore.net/hbase does not conform to the pattern "(/[-+=_.a-zA-Z0-9]+)+(/)*".
... View more
Labels:
07-10-2019
06:46 AM
Thank you for your interest! We are using cdh 6.2 , impala 3.2.0-cdh6.2.0, HIVE Hive 2.1.1-cdh6.2.0. Some updated info on the case above: I notice the files are not corrupted. So the files are created in some table in HIVE, but when queried he shows sometimes special characters. I have also encountered that he shows a concatenation of 2 lines that are not related to each other, on one line. I could even trace that he was getting one line from 1 of the files that make up the table, and incorrectly combined it with the end of another line that was even in a different file! Situation like this: file 1 contains id: some_id_1, data1, data2, data3 file 10 contains: some_id_2, otherdata1,otherdata2,otherdata3 SELECT * FROM <problematictable> WHERE id='some_id_1' should return --> some_id_1, data1, data2, data3 would return--> some_id_1, data1, data2herdata2,otherdata3 When I restart impala services, and the table is queried, it shows the results as expected. When you create a new external table with a different location, and cp the files to that location, and query this new table , the results are as expected. It might have to do with the metadata store? Maybe he has problems to know where he needs to retrieve the data? And after a restart of the services, everything is flushed and he does this correctly
... View more
07-01-2019
11:57 PM
I noticed my title is wrong (did not find the edit button )- it should be : Table Created in hive as text and queried by impala shows special characters in impala, not in hive.
... View more
07-01-2019
08:24 AM
Dear forum,
another case to think about!
I created a table in HIVE as a textfile. When I query it, it looks fine for all records.
Next, in Impala, I use the INVALIDATE METADATA statement and afterwards , query the table. Now Impala shows me for some records a question mark as if there are special characters for a couple of records (�). I notice the data in these fields can not be used anywhere else in following steps (instead of reading values, he will complain the values are not valid).
When I examine the textfile on hdfs (through some text editor like sublime with UTF-8 encoding ), I see no special characters and all characters encountered, look as expected. As said, invalidate metadata nor refresh fixes the issue but after a restart of the impala services, the data is available as expected in impala.
Currently we create the table as text file and get the behavior described above. Before we created the table as a parquet file, but then got the error :
-->
File 'hdfs://ourcluster/user/hive/warehouse/tmp.db/thetable/000000_0' has an invalid version number: <some value> This could be due to stale metadata. Try running "refresh tmp.thetable".
-->
Note that this <some value> would always be something that comes from the data (a part of a string that is in the data). The refresh would not fix it, (and as said we already do an invalidate metadata). Note that when we restart the impala service, this error goes away and the data can be queried. The files then seem to be "uncorrupted". I have read a similar post elsewhere that suggests the data would be corrupted when one encounters this error.
Note: we use a collect_set function to create the field that gives the problem during the creation of the table in HIVE.Our current trail of thought is that in some cases (15 out of several million) this gives problematic results but what happens exactly is not understood.
Thanks for any input!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
Apache Spark
05-24-2019
08:26 AM
Note that this error was solved by using the built in HUE option to add streaming to an oozie flow. Through the graphical user interface, one can select a step "streaming" and then specify the mapper / reducer. Before we were running an oozie script that invoked a shell script and in that shell script we called the streaming jar with appropiate parameters. This approach worked on cdh 5.12 but not anymore on cdh 6.2. Luckily we were able to overcome this problem.
... View more
- Tags:
- t we
05-21-2019
08:25 AM
We are running an oozie flow.
All steps that are hive/impala run fine. One step in the oozie flow is a shell script that launches the hadoop streaming jar. We have run the exact same code on an old cluster (cdh 5.12) and this worked fine. Also everything related to this script runs fine locally. However when ran through oozie it fails. We are using cdh6.2.
The job logs show following error:
Error launching job , Invalid job conf : cache file (mapreduce.job.cache.files) scheme: "hdfs" host: "poccluster" port: -1 file: "/thefilepath" conflicts with cache file (mapreduce.job.cache.files) /user/yarn/.staging/job_1558348093683_0051/files/run_it_parallel.sh Streaming Command Failed!
We have tried clearing the user/app cache in the yarn directories of our users. There is no configuration parameter mapreduce.job.cache.files. The /user/yarn/.staging directory is empty.
The container logs show following error:
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
at org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
at org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
at org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
Caused by: org.apache.oozie.action.hadoop.LauncherMainException
at org.apache.oozie.action.hadoop.ShellMain.run(ShellMain.java:76)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:104)
at org.apache.oozie.action.hadoop.ShellMain.main(ShellMain.java:63)
... 16 more
This looks like it is related to access rights. We do not have kerberos and sentry is disabled for all services.
We are quite sure that this process does not have an error on our side (have used it on different cluster and it runs on a worker node locally without issues). Possibly the fact that when we run it locally, we are using our own user (and not the users used by oozie/yarn) is related to why we get this error.
Any input, maybe especially on the first error from the job logs, would be hugely appreciated.
Thanks!
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache YARN
-
HDFS
-
MapReduce
-
Security
02-22-2019
02:38 AM
I found the problem: I was working wiht the official cloudera docker image, which is 3 years old and has a version of impala that does not suport kudu yet. I then grabbed the cdh5.13 image from CLoudera's quickstart website and then it works. This might help others. NB: It might be useful if cloudera would make a more recent version available on the official docker repo.
... View more
02-20-2019
12:18 AM
I am working with the cloudera quickstart image on docker which includes impala but not kudu. I would like to add kudu. What is the best approach? I am currently pursueing #1:
APPROACH 1
1)go inside the docker, install kudu master and server
2)see that kudu master and server are running
3)I tried to add the " -kudu_master_hosts=localhost " to the /etc/default/impalad file, and restart the impala service, but he does not recognize this flag.
3)Hence failure during create table -> it fails since he considers the word primary key as a syntax error.
APPROACH 2
I could do it using cloudera manager, using the parcels, but this looks harder to put in a Dockerfile
APPROACH 3
Using docker-compose : is there a reference somewhere on exactly how to do it?
Or other suggestions? This sounds a standard problem but I did not find any guideline online.
Thanks!
... View more
Labels:
- Labels:
-
Apache Kudu
-
Cloudera Manager
-
Quickstart VM
10-09-2018
08:50 AM
A kudu table
... View more
10-02-2018
08:21 AM
Note that this was resolved by restarting the impala and sentry service in the cloudera manager.
... View more
10-02-2018
02:18 AM
Hi all,
I have a cluster that was working fine for weeks and am mainly using Impala on Kudu tables. Sentry is running on the cluster. Since recently I get an error for the 'DROP TABLE ' command:
`ImpalaRuntimeException: Error making 'dropTable' RPC to Hive Metastore: CAUSED BY: MetaException: Failed to connect to Sentry service null`.
I believe that the data indeed was deleted, since a SELECT query on the table will complain he can not find the file in hdfs.
When I run the 'invalidate metadata' command before dropping the table, the error goes away, but not always.If I would try to drop the same table again(I believe at this point the data is already removed), after the 2nd attempt I would get following error:
`ImpalaRuntimeException: Table xxx no longer exists in the Hive MetaStore. Run 'invalidate metadata xxx' to update the Impala catalog.`
Note: this does not happen with tables I created in Hive and that I now try to query in Impala-they are created all in Impala.I did not have the error before, and feel like it started since I have run an 'invalidate metadata statement' for some other reason for the first time recently .
Thanks for input
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Kudu
-
Apache Sentry
03-06-2018
12:29 AM
Hi all, I notice "ingest (near) real time streaming" is a top for the CCA175 exam. Does this mean one can expect any questions on kafka/flume/spark streaming? I believe this is a new topic that was not included yet e.g. begin 2017. Also back then, the avrotools were part of the skillset, while now I do not see that mentioned anymore. Does anyone have more info on these 2 questions: CCA 175: 1)is kafka/flume/spark streaming part of exam skills? 2)no more avro tools Many thanks!!
... View more
Labels:
- Labels:
-
Certification
03-05-2018
12:06 AM
I was referring to the following which is not available yet in spark 1.6 : 1)create a DF 2)create a table to write direct sql queries on: df.createGlobalTempView("people") 3)query on this table : spark.sql("SELECT * FROM global_temp.people") But I think what is required for the section "data analysis: use spark sql to interact with the metastore programmatically in your application" is to create a SQL/HiveContext and then query on tables that are already stored in the HIVE metastore. ANy idea if this is correct?
... View more
03-02-2018
01:03 AM
Dear community, I notice the exam on the CCA175 will have spark version 1.6. One of the main topics of the exam is data analysis using spark SQL. I notice that the functionalities to load a dataframe into a format that can be used to perform sql queries, only exist since spark version >1.6 (e.g. registerTempTable or createorreplacetempview ). ANy thoughts on this? I am surprised that such an outdated version of spark is used for the exam. Best to all!
... View more
Labels:
- Labels:
-
Apache Spark
-
Certification
02-15-2018
12:45 AM
Well, that is a subjective question. Depends on who you ask. There was nothing completely unexpected, definitely some simple questions and some that require a little more knowledge of the whole system. It just depends how much affinity you have with cloudera, linux, mapreduce, the hadoop services etc. I think one of the best references out there is this blog post : http://www.hadoopandcloud.com/hadoop/cca131-cloudera-administration-certification/ . Good luck to you all.
... View more
02-14-2018
12:45 AM
1 Kudo
I took the exam and am now able to answer my own questions: 1)It is a cloudera exam, not a hadoop exam, so indeed the cloudera manager can be used for most questions ( in combination with basic shell/unix skills). 2)The complete cloudera documentation was available-this means everything under this link : https://www.cloudera.com/documentation/enterprise/latest.html . If you click on a link the 'seach docs' is also available. In addition to that, if there is a service that needs to be installed and then used to show that it works correctly, the user guide for this service will also available. Also the apache hadoop documentation was available. I think what you will get depends on the nature of the questions an can slightly vary from exam to exam (I mean the documentation of a service you might have to use). I hope this can help others in preparing for the exam.
... View more