Member since
08-28-2016
4
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3751 | 08-28-2016 11:01 PM |
02-18-2017
08:44 AM
I did. but again, just the last row is returned.
... View more
02-16-2017
04:30 AM
2 Kudos
Dear All, At NOKIA technologies we are evaluating the SHC connector to seamlessly read/write SPARK dataframes in HBase. So far, the writes and modifications work perfectly but the version based reads are failing - always returning the latest version only. I have a Hbase table with multiple versions of a column family cf1: hbase(main):008:0* describe 'gtest' Table gtest is ENABLED gtest
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CE
LLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_
VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0350 seconds hbase(main):038:0> scan 'gtest',{VERSIONS=>5} ROW COLUMN+CELL 0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399503, value=138 0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399425, value=1 0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148399345, value=59 0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376205, value=138 0000adb15e1d04181ab9da3507e75dd7f61c946f5be445ef38718ca5f9fc2577 column=cf1:event_result, timestamp=1487148376095, value=1 1 row(s) in 0.0290 seconds I am trying to read all the versions of this column family using SHC connector. Scanning the table from HBase shell fetches all the versions as displayed above. However, using the SHC connector, the dataframe contains only the latest version. def catalog = s"""{
"table":{"namespace":"default", "name":"gtest"},
"rowkey":"account_id",
"columns":{
"account_id":{"cf":"rowkey", "col":"account_id", "type":"string"},
"event_result":{"cf":"cf1", "col":"event_result", "type":"string"}
}
}""".stripMargin
val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog, HBaseRelation.MIN_STAMP -> "0", HBaseRelation.MAX_STAMP -> "1487148399504", HBaseRelation.MAX_VERSIONS -> "5").format("org.apache.spark.sql.execution.datasources.hbase").load() df.show()
+--------------------+------------+ | account_id|event_result| +--------------------+------------+ |0000adb15e1d04181...| 138| +--------------------+------------+
... View more
Labels:
- Labels:
-
Apache HBase
-
Apache Spark
08-28-2016
11:01 PM
Well, found the problem... It seems the HUE server sends the request to webhdfs via the company proxy server - even though the HUE server is setup on the same host as the webhdfs server. The proxy server obviously didn't know the hostname of the webhdfs server so it couldn't forward the request to it. Changed the hostname to its public IP address in HUE.ini and everything works. If anyone knows how to allow HUE server to bypass that proxy, please let me know.
... View more
08-28-2016
10:28 PM
Dear HUE community, We at NOKIA technologies, have been using the cloudera stack for our analytics based products since a few months. We have had a very good experience using it, specially the HUE enabled OOZIE workflow designer. It has made our lives much easier than before. We have now decided to use HUE for the same purpose on our old clusters that do not run cloudera. We followed instructions and installed HUE(and other required stuff) on our old clusters. But when start the UI in a browser, we see that HUE server just cannot communicate with webhdfs. That webhdfs works fine is known from the fact the same GET command copied from HUE logs, works fine when run from the browser window and an entry is logged in the webhdfs log. That hue cannot communicate with webhdfs is known because no request entry is logged in the webhdfs log when hue file browser is accessed via the HUE UI. I have checked superuser configuration and privileges - and that is all fine. For simplicity, we have just one user for all services, which is also the hadoop superuser. I paste the relevant logs below: . . . [28/Aug/2016 22:21:33 -0700] middleware DEBUG {"1472473293": {"status": 200, "impersonator": null, "service": "jobbrowser", "url": "/jobbrowser/", "user": "spark1", "ip_address": "135.x.x.x", "authorization_failure": false}} [28/Aug/2016 22:22:04 -0700] access INFO 135.x.x.x spark1 - "GET /jobbrowser/ HTTP/1.1" [28/Aug/2016 22:22:04 -0700] connectionpool DEBUG "GET http://XXXX:8088/ws/v1/cluster/apps?limit=10000&user=spark1&finalStatus=UNDEFINED HTTP/1.1" 302 90 [28/Aug/2016 22:22:04 -0700] connectionpool INFO Resetting dropped connection: proxy.XXXXXX.com [28/Aug/2016 22:22:10 -0700] connectionpool DEBUG "GET http://www.XXXX.com/ws/v1/cluster/apps?limit=10000&user=spark1&finalStatus=UNDEFINED HTTP/1.1" 404 1245 . . . . . . (error 404): Traceback (most recent call last): File "/opt/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/core/handlers/base.py", line 112, in get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/opt/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/db/transaction.py", line 371, in inner return func(*args, **kwargs) File "/opt/hue/apps/jobbrowser/src/jobbrowser/views.py", line 121, in jobs raise ex RestException: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/> <title>404 - File or directory not found.</title> <style type="text/css">
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Oozie
-
Cloudera Hue