Member since
04-22-2016
67
Posts
6
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1291 | 11-14-2017 11:43 AM | |
246 | 10-21-2016 05:14 AM |
05-24-2018
06:04 PM
I have a table in Phoenix used for ETL auditing. This table has a composite key of three columns. I have recently add a secondary index on a fourth column to speed up queries on this column. However I have noticed an odd behaviour: on all rows that were in the table before the index was created, columns other than the key columns and the index are <null>. For all rows inserted into the table after the creation of the index, all columns are populated. Since this is a write heavy table (~500 000 rows inserted per day), I have used a local index. I have two questions: 1) Is this expected behaviour for a Phoenix index, or does it indicate an error on my part? 2) How can I cause the <null> fields to be populated? I am willing to recreate the index if necessary
... View more
- Tags:
- Phoenix
Labels:
02-16-2018
12:48 PM
I am currently experiencing an issue with HDFS storage. We have (intentionally) deleted the majority of the data on our cluster. According to hdfs du, the total usage on HDFS is approximately 1TB. However, Ambari reports that the DFS used is 238.9TB. I can understand a small discrepancy here, for blocks that have not been deleted and such. However such a huge difference is worrying. On top of this, a huge number of the underlying disks are at 100% full, and no amount of HDFS balancing changes this. HDFS has been incredibly unstable over the past few weeks, and it's possible that this is the underlying cause. Is there any way I can safely clear this space? I don't mind losing data (we are repurposing this cluster) as long as HDFS remains stable. The issue with the full disks is my bigger concern, but improving that should also clear a lot of the falsely reported space on HDFS. Any advice will be appreciated.
... View more
Labels:
11-20-2017
06:43 AM
Unfortunately the only way I could get this to work was by reverting to batch processing. With Spark streaming it remained very slow.
... View more
11-14-2017
11:43 AM
@Josh Elser Disabling hbase backups did not improve the situation. After sifting through the logs for the cleaner, I have identified the following series of warnings: 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: ReplicationHFileCleaner received abort, ignoring. Reason: Failed to get stat of replication hfile references node.
2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: Failed to read hfile references from zookeeper, skipping checking deletable files
2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] zookeeper.ZKUtil: replicationHFileCleaner-0x15fb38de0a0007a, quorum=server01:2181,server02:2181,server03:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/replication/hfile-refs
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/hfile-refs
These repeat multiple times. So it appears that the replication HFile cleaner is failing due to an issue with zookeeper. We recently had some fairly severe zookeeper issues, but things have returned to a completely stable state now, apart from this. Do you have any advice for how I can move forward, either with forcing the HFile cleaner to run or with repairing the state of zookeeper?
... View more
11-12-2017
08:14 AM
Our cluster recently started having a problem with one of our ZooKeeper servers - this server will consistently refuse any connections made to ZooKeeper, which has lead to some problems in our cluster. This is not a network issue, as all other connections are successful. Our investigation has identified that ZooKeeper CPU usage on this server is absurdly high - ranging from 300% to 2200% CPU usage. For comparison, the other servers in the Quorum rarely show ZooKeeper in top at all, and when the do the CPU usage is <1%. A misconfiguration of this ZooKeeper server seems unlikely, since the cluster is managed through Ambari - all ZooKeeper servers should have exactly the same configurations. We have restarted ZooKeeper on this machine multiple times with no improvement. Even restarting the physical host did not cause any improvement. We are having authentication issues on that machine which may contribute to the issue. However the zookeeper user is accessible, and all commands through Ambari are successful, with no permission denied errors. Some possibly relevant information: We are running HDP-2.6.0.3, with ZooKeeper 3.4.6 Our Quorum size is three Does anybody have any suggestions for how this can be improved or resolved?
... View more
11-03-2017
04:51 AM
Thanks for the reply Josh. hbase.backup.enable is not defined on our cluster, so it defaults to true. I'll turn this to false and then see if things get to a more reasonable level. If that doesn't work I will turn on the TRACE logging, and update with extra information. If it changes anything, we're running HDP 2.6.0.3
... View more
11-02-2017
11:53 AM
We are having an issue with running out of disk space on HDFS. A little investigation has shown that the largest directory, by far, is /apps/hbase/data/archive. As I understand it, this directory keeps HFiles that need to be kept, typically because of snapshots. I know that a large archive directory is the first culprit of having too many snapshots. However snapshots do not seem to be the issue here: /apps/hbase/data/archive is a little larger than110TB, while the sum of all of our snapshots is <50TB. We have not set hbase.master.hfilecleaner.ttl, however I have read that the default is 5 minutes - this is definitely not the culprit for many of the HFiles we have, which frequently date back many months. What steps can I follow to try to reduce this usage?
... View more
Labels:
10-12-2017
04:35 AM
I set that configuration through Ambari, which should propagate the config to all nodes in the cluster if I understand correctly. Should I perhaps include a trailing / in the config? I am only using /tmp for testing purposes in replicating from our dev cluster. Once this goes to our production and DR clusters, I will choose a more suitable location for the configuration files.
... View more
10-11-2017
12:52 PM
I am currently in the process of setting up replication on HBase. Normal HBase replication (based on the WALs) is set up and working well. However I am struggling to set up replication for bulk loading as specified in [HBASE-13153] I am running HDP 2.6. I initially thought the version of HBase was the issue, but according to release notes this was patched in in HDP 2.4.3 so HBase version is no the issue here. I have followed the instructions provided in the link as well as I can understand them, but bulk loaded rows are not being replicated. I have set the following configurations: On the source cluster: hbase.replication.bulkload.enabled=true On the source cluster: hbase.replication.cluster.id=source On the peer cluster: hbase.replication.conf.dir=/tmp/fs_conf On the peer cluster: hbase.replication.source.fs.conf.provider=org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider I have copied the core-site.xml, hdfs-site.xml, yarn-site.xml, hbase-site.xml from the source cluster to the peer cluster, to all of the region servers under /tmp/fs_conf/source These are the only changes specified in the JIRA (at least as far as I understand it) however the rows that are bulk loaded are not being replicated. I suspect I have missed/misinterpreted part of the instructions. Any help will be appreciated.
... View more
Labels:
07-20-2017
01:27 PM
As I understand you're correct, the number of partitions should match the number of regions. However I discovered that my dataframe was defaulting to 200 partitions, even though it comes from an RDD with only 1 partition. When I coalesce into fewer partitions it doesn't significantly improve performance.
... View more
07-19-2017
08:48 AM
Hi. Yes, that is what I mean. I am using Spark 1.6.2 and Phoenix 4.7. The Phoenix table has no salt buckets, and only 1 region.
... View more
07-18-2017
12:14 PM
I am using spark-phoenix integration to load data from a dataframe into a Phoenix table. Unfortunately this is ridiculously slow - pushing 23 rows of 25 columns each takes 7-8 seconds. This is with two executors, meaning it's actually twice as slow. This makes it unusable in my case, since it is planned to use in a streaming application - the number of records dropped in a 15 second window take up to a minute to load. When I look at the Spark History Server, I see two really strange things: The slowest part, by far, is 'saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55', so the problem is not in my code. For 23 rows, 200 tasks are launched. This seems excessive. Does anybody have experience with how I could improve these loading speeds? Ideally I want to keep using Phoenix in some way, since we have secondary indexes on the table.
... View more
Labels:
07-07-2017
06:00 AM
Hi Matt I switched "Remove trailing Newlines" to false and got the number of fragments to 66443 as you suggested. This is a little confusing to me, as when I check the original file the number of lines is 66430. However your point is 100% correct. Thank you for opening the Jira request. While I wait for this, do you know of any useful workaround I can use in the time being to get the number of actually emitted fragments? It would be slower, but would it be possible, after the split, the merge the fragments (which would now include no newlines) and split them again? Thanks, Mark
... View more
07-06-2017
09:52 AM
I have a flow in NiFi which splits a file into individual lines, inserts those lines into a database and after those have been inserted updates a control table. The control table only updates after every line has been inserted. To achieve this, the fragment.index is compared to fragment.count - if these are equal, then I know that every line has been processed and we can move on to updating the control table. However recently some of our files failed to update the control table. I have outputted the Attributes of the flow files to disk, and it shows something that confuses me: the number of flow files that comes out of the split text processor is 66430, which matches the number of lines in the file. However, the fragment.count attribute is 66443. Does anybody know why the fragment index would be incorrect, and how I can fix this?
... View more
Labels:
06-30-2017
12:53 PM
I'm experimenting with broadcast variables in PySpark at the moment, and I've noticed that whenever I create an explicit Java object using sc._jvm, I get errors when I try to broadcast these variables. Looking at the stack trace, the problem seems to be related to pickling. Does anybody know how I can broadcast such variables?
... View more
06-30-2017
05:10 AM
It is the final step (saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55) that is slow. All the previous steps - two subtracts and distinct - are reasonably fast. However saving the data the Phoenix is the slow part.
... View more
06-29-2017
07:45 AM
I have a Spark ETL process which reads from a CSV file into an RDD, performs some transformations and data quality checks, converts this into a DataFrame and pushes the DataFrame into Phoenix using Spark-Phoenix integration. Unfortunately, actually pushing the data to Phoenix is ridiculously slow - the saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55 portion kicks off 120 tasks (which seems quite high for a file of ~3.5GB). Each task then runs between 15 minutes and 1 hour. I have 16 executors with 4GB of RAM each, which should be more than sufficient for the job. The job has currently been running for over two hours and will probably run for another hour or more, which is very long to push only 5.5 million rows. Does anybody have any insight into a) why this is so slow and b) how to speed it up? Thanks in advance. Edit 1: The job has completed in 4 hours.
... View more
Labels:
06-19-2017
09:51 AM
In the hopes that this will help somebody else in the future, I will post what I have discovered. The first important realisation I had was that each file was being loaded into its own partition within the underlying RDD, and that the RDD's debug string was showing the source for each of these partitions. Knowing that, it's fairly straightforward to get the names for each element. filenames = []
def mapIndex(index, iterator):
values = list(iterator)
yield (filenames[index], values)
def f(rdd):
debug = rdd.toDebugString()
lines = debug.split("\n")[2:]
for l in lines:
filenames.append(l.split()[1].split("/")[-1])
if not rdd.isEmpty():
rdd = rdd.mapPartitionsWithIndex(mapIndex)
if __name__ == "__main__":
#Set up SparkContext and StreamingContext
stream = streamingContext.textFileStream(<your_directory>);
stream.foreachRDD(f)
streamingContext.start()
streamingContext.awaitTermination()
The RDD initially contains the filenames only. After the execution of mapPartitionsWithIndex, it contains pairs of the form <filename, (line0, line1, line2, ..., linen)>. If you want the lines split again, do the following: rdd = rdd.flatMapValues( lambda x : x ) The RDD now contains pairs of the form <filename, line> for each line in each file. I hope this will help at least one other person.
... View more
06-15-2017
12:58 PM
I am currently developing a spark streaming application in Python to watch a directory and load new files into a Phoenix table. This part is fairly straightforward using Spark-Phoenix integration. However, for auditing purposes I need to keep track of what files are being loaded, and how many records come from each of these files. The textFileStream does not record the filename in elements of the RDD, and the only way I have found so far to get the filename is by getting the RDD's debug string. While this does give me the filename, it in no way indicates which elements belong to which files. What would be ideal is something like SparkContext's wholeTextFiles(...), which gives back and RDD of the form (filename, contents), while textFileStream acts more like SparkContext's textFiles(...) Is there any reasonable way to determine which elements belong to which files?
... View more
Labels:
06-12-2017
05:15 AM
My problem went away on it's own strangely. I left it without restarting and about 5 hours after the most recent restart everything was back to normal. For some reason the start up was very slow. Since then I've had to restart once, and that was quick. Sorry I can't help.
... View more
06-06-2017
09:45 AM
Today I encountered an error with NiFi that I have never seen before. We have been successfully running NiFi, managed by Ambari, for a number of months now. This morning while working in NiFi, the UI suddenly died. When refreshing the UI, I get ERR_CONNECTION_REFUSED. I have tried this in both Chrome and Explorer, so it is not a browser specific issue. I have now spent the morning trying to solve this issue. I can give the following info:
NiFi is managed by Ambari. According to Ambari NiFi is up NiFi is SSL secured. This is not a security issue, since when switching off SSL Authentication and restarting Ambari, the non-secure UI gives the same issue Our network team has confirmed that this is not a firewall issue - port 9091 is not blocked, but is refusing all connections. selinux is disabled ps -ef | grep nifi shows NiFi is up and running on the machine bin/nifi.sh status shows 2017-06-06 11:34:47,657 INFO [main] org.apache.nifi.bootstrap.Command Apache NiFi is currently running, listening to Bootstrap on port 51190, PID=13513 nifi-app.log stops writing anything shortly after startup. The last line is 2017-06-06 11:16:28,561 INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded a total of 96 properties. Including precedence overrides effective accessible registry key size is 96 (That was nearly half an hour ago. Nothing since) I have restart NiFi a number of times trying to solve this, but this has no effect. Any help would be greatly appreciated.
... View more
Labels:
06-02-2017
06:07 AM
hbck gives one inconsistency - a single Empty REGIONINFO_QUALIFIER. I know hbck has a tool to fix this, I haven't yet run it. That's the only inconsistency shown. So would this indicate that the offline regions are normal? Thanks.
... View more
05-25-2017
05:27 AM
Thanks for the response. No, the 700 regions were still up, but there were also 600 offline regions. If these are the child regions from the merger, do you know when the child regions would be cleared up? We had to do a restore_snapshot because of some system instability, and now the online region count is correct but the offline region count is over 2000. Will these clear up on their own or only after a master restart? We can't restart the cluster because this is a customer facing system.
... View more
05-24-2017
09:06 AM
Yesterday I performed a number of region mergers on one of our larger tables. This table had ~1400 regions when starting, but many of these were small regions - we wanted to get the average region size closer to our region limit size of 15GB. The region mergers went well, leaving us with just over 700 regions. However since then we have had a huge number of offline regions - nearly 600 regions are currently listed as offline. Does anybody know the cause of these offline regions, and how to fix them? I understand regions do sometimes go offline legally, but this many seems to me to indicate a problem.
... View more
Labels:
05-12-2017
02:17 PM
We recently had a failure of all of the region servers in our cluster, although the active master and standby master stayed up. When the region servers were brought back up regions were assigned relatively quickly. However two regions have not come up, and according to the UI they are stuck in the OFFLINE state.
I have tried running hbase hbck -repair a number of times, as well as various other options that I hoped would help (-fixAssignments, -fixSplitParents). None of these successfully brought the regions online. I check the logs of the region servers for the regions and there was no reference to them after they were closed prior to region server failure.
When I checked the master logs however I found the following:
master.AssignmentManager: Skip assigning table_name,13153,1485460927890.3d68e485cb6294345fe1469097fa5aca., it is on a dead but not processed yet server: server05,16020,1494493877392
The server listed as a dead server is alive and well, with over 200 regions already assigned to it. This error message led me to HBASE-13605, HBASE-13330 and HBASE-12440 which all describe pretty much the same issue. Unfortunately none of these JIRAs describe any way to fix the issue once it occurs. Does anybody have any advice for resolving this? This is a production system and so shutting down the master is a last resort.
... View more
Labels:
04-25-2017
12:46 PM
We have a cluster running HDP 2.4 with 8 worker nodes. Recently two of our datanodes go down frequently - usually they both go down at least once a day, frequently more often than that. While they can be started up without any difficulty, they will usually fail again within 12 hours. There is nothing out of the ordinary in the logs except very long GC wait times before failure. For example, shortly before failing this morning, I saw the following in the logs: 2017-04-25 03:49:27,529 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(192)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23681ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=23769ms I checked the free memory on that node, and it was slightly more than the free memory on a similar node that isn't shutting down. Since it is the same two nodes repeatedly, I assume the problem is something to do with the nodes themselves. Does anybody have any advice for this problem?
... View more
Labels:
04-04-2017
06:26 AM
I am currently having a serious problem with the output file for zookeeper. I recently noticed that /var/log had 15G full, but du -h /var/log only reported less than 1G.
I checked my deleted files (lsof | grep deleted | grep
/var/log) I noticed that there are a number of log files that are deleted, but still open. The most concerning of these is /var/log/zookeeper/zookeeper-zookeeper-server-xxxx.out, which is over 13G in size. In spite of being deleted the file is still open.
Our system admin suggested restarting Zookeeper to clear the file lock - unfortunately this is a production cluster and so restarting Zookeeper is an absolute last resort. I have come up with some possible solutions:
Since we have a quorum of 3 servers, would it be realistic to restart only the Zookeeper server on the machine that is giving the problem?
If that is not realistic, can I truncate the file (as per this article) to clear the space issue without causing problems to Zookeeper? If neither of these is possible, what are my other options? Thanks in advance.
... View more
03-22-2017
12:49 PM
I am using HBase snapshots for the purpose of backups in my cluster. I have weekly snapshots to facilitate recovery from HBase failure. However something concerns me. I was under the impression that HBase snapshots stored only metadata without replicating any data, making them ideal for low footprint backups. However, after a short time (3+ weeks) a snapshot will often be exactly the same size as the source table, sharing 0% of the data with the source table. This is a problem since it means that keeping even a few weeks of backups can consume 25+ TB of space. Can anybody explain to me why this happens, and if there is any way to avoid it?
... View more
Labels:
03-17-2017
11:13 AM
I have set up my NiFi instance (NiFi 1.0) to communicate over SSL and authenticate the admin user with a certificate, based on this article. This has worked correctly. Now I would like to add new users to NiFi, and have those users authenticate using username/password. I know this is possible with LDAP, as well as with Ranger and Kerberos, however I would prefer to manage my users directly through NiFi. Adding new users is straightforward, but there is no way I can see to set a password for a user. Is there any way to achieve this? I know it is possible to create my own LoginIdentityProvider. Does anybody have an example of code that could do what I want?
... View more
Labels:
03-17-2017
09:09 AM
I am currently having a timeout on a Hive query over and HBase table, similar to this article. As per the instructions in that article, I intend to increase the HBase timeout, including the RPC timeout. However before going ahead, I would like to understand the potential consequences of this change on HBase and external services that access HBase, since this is a production cluster. If someone could give me an idea of what negative consequences this might have it would help a lot.
... View more
Labels: