Created 09-28-2016 08:26 PM
I'm trying to get Polybase working with the Hortonworks Data Platform 2.5 sandbox.
The Polybase errors in the sys.dm_exec_compute_node_errors DMV boil down to:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)
The file I'm trying to get (/tmp/ootp/secondbasemen.csv) exists and I am able to access it through the WebHDFS interface on port 50070. I can do that through curl or through the Microsoft.Hadoop.MapReduce NuGet package. I've confirmed that the CSV file has global read access, all folders (/tmp and /tmp/ootp) have global read and execute access, and I left the "Allow All" rule on in Ranger.
Here are all of the relevant bits from logs that I could find on my Hadoop sandbox. My data node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-datanode-sandbox.hortonworks.com.log) and name node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-namenode-sandbox.hortonworks.com.log) logs did not appear to have any useful information.
http://sandbox.hortonworks.com:50070/logs/hdfs-audit.log:
2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
YARN logs show nothing at all.
The audit log shows that the pdw_user account is allowed to get info on /tmp/ootp/secondbasemen.csv and then is allowed to open the file, which it tries to do four times before the Polybase engine gives up and spits out the following error: Msg 8680, Level 17, State 26, Line 1 Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.
I've tried a couple of things like rebuilding the HDP 2.5 Docker container to expose port 50010 (in case Polybase *really* needed access to the data node) and creating an explicit pdw_user account, but neither did anything.
Here are my data source and table definitions within the SQL Server external table:
CREATE EXTERNAL DATA SOURCE [HDP2] WITH (TYPE = Hadoop, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION=N'sandbox.hortonworks.com:8050')GOCREATE EXTERNAL TABLE [dbo].[SecondBasemen] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH(LOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GOLOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GO
I was able to get this working under HDP 2.4, and I'd be happy to do more legwork if anyone has potential leads.
Created 05-30-2017 08:00 AM
Hi @Kevin Feasel and @Pedro Faro
To give a bit of background on my setup and how I got it working.
1) Create HDP 2.5 VM Sandbox in Azure and make IP static.
2) Create SQL Server 2016 VM in Azure and make IP static.
3) Install putty on SQL Server 2016 vm.
4) Create putty session to Sandbox IP with the following tunnels
10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996
5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com
6) Connect to sandbox using putty
7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.
😎 Restart Polybase services.
9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.
10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data
11) Create external source
CREATE EXTERNAL DATA SOURCE [Data-Source-Name]
WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );
12) Create external file format
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));
13) Create external table
CREATE EXTERNAL TABLE [dbo].[Test]
( CustomerID INT,
CustomerName NCHAR(10),
NumberOfOrders INT )
WITH (LOCATION='/user/maria_dev/data/Test.csv',
DATA_SOURCE = [Data-Source-Name],
FILE_FORMAT = TextFileFormat );
14) SELECT * FROM [dbo].[Test]
I am logged onto the SQL Server 2016 VM the from step 3 onwards.
This is just for testing the functionality. Hope this helps
Created 09-29-2016 12:22 PM
Hi @Kevin Feasel. Try setting authorization to "none" within the Hive configs in Ambari.
Created 09-29-2016 12:55 PM
Thanks for the reply, but I still get the same error message. Polybase doesn't use the Hive engine, so that would have been a bit surprising if it had been the answer.
Created 09-30-2016 03:56 AM
As an update, I ran a Wireshark trace on my VM setup with HDP 2.4 and another with 2.5. It looks like what's happening with Polybase is that the Polybase engine makes a request on port 8020, using the getBlockLocations command to ask for my file (/tmp/ootp/secondbasemen.csv in this example). The name node responds by giving the block name and a data node where that block lives. Here's where things differ a bit:
In HDP 2.4: Polybase connects to the data node (whose IP is the same as the name node) on port 50010 and retrieves the file contents I want through a TCP stream connecting to port 50010 on my HDP sandbox. In HDP 2.5: Polybase tries to connect to the data node, whose IP is 172.17.0.2. Every two seconds for the duration of my check, I get back the following message: "Standard query response [hex ID] No such name PTR 2.0.17.172.in-addr.arpa." So it's trying to connect directly to the data node but can't because the node is now within a Docker container. I'll keep plugging away at this (because I want to use HDP 2.5 instead of 2.4), but if there are any workarounds or bits of advice on how to get it functional, I'd be grateful.
Created 12-10-2016 11:41 PM
Hi Kevin,
Do we have a solution to this error "Could not obtain block“ ?
I am also getting this error while trying to query a table after setting up Polybase, I am using HortonWorks Hadoop 2.5 and Sql Server 2016 SP1. Any response will be highly appreciated. Thanks in advance. Aman
,Hi Kevin, did you ever get a solution to this error "Could not obtain block“ ?I am also getting this error while trying to query a table after setting up Polybase, I am using HortonWorks Hadoop 2.5 and Sql Server 2016 SP1, any response will be highly appreciated.
Thanks in advance. Aman
Created 12-11-2016 12:33 AM
Hey, Aman. The short of it is that Polybase does not currently support Dockerized Hadoop data nodes. Polybase queries hit the data nodes directly, so when you're putting data nodes on a non-routing subnet (like Docker does), the SQL Server node can't access the Hadoop data node and so queries fail.
For now, I've dropped back to the HDP 2.4 sandbox for working with Polybase. I've spoken to the relevant team at Microsoft and I'm hopeful that they'll support Dockerized nodes in the future, but I'm sure it won't be an easy task for them given the way Polybase works.
Created 12-11-2016 06:50 PM
Hi Kevin, Thanks a lot for your response, very helpful!!
Can you please point to where could i find 2.4 HDP sandbox? i was looking for some archives on Hortonworks website but no luck.
Created 12-11-2016 08:40 PM
Hi Kevin, I was able to find a link to HDP 2.3 VM download and Polybase seems to be working now, thanks alot for your help.
Created 04-18-2017 05:57 PM
I managed to fix the error on HDP 2.5
Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742700_1894 file=/user/maria_dev/data/Test.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742700_1894 file=/user/maria_dev/data/Test.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)
The issue is that the VM is not listening on Port 50010, strange I know as it's fundamental.
This link shows you have to add the port
https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/
Any questions i'd be happy to help
Created 05-29-2017 06:50 PM
Hi, @James Braun, thank for the information. You can forward port 50010 using Docker, but that's only half the battle. The other half is that Polybase requires direct access to all data nodes, and the data nodes are set up to run on Docker IPs (172.17.0.2 in your case). It might work if you set up a VM and connect from the same machine, but from a different machine, you won't be able to route to 172.17.0.2. That's the issue I ran into, either using an Azure VM or VMware locally. The Polybase architecture requires that the Polybase data nodes are able directly to access all relevant Hadoop data nodes.
The best solution I've found was to install from Ambari and skip the sandbox altogether. That way, the data nodes are not Dockerized and therefore are all accessible to other machines on the network.