Support Questions

Find answers, ask questions, and share your expertise

Polybase And HDP 2.5

avatar
Explorer

I'm trying to get Polybase working with the Hortonworks Data Platform 2.5 sandbox.

The Polybase errors in the sys.dm_exec_compute_node_errors DMV boil down to:

Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone)
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read()
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData()
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)

The file I'm trying to get (/tmp/ootp/secondbasemen.csv) exists and I am able to access it through the WebHDFS interface on port 50070. I can do that through curl or through the Microsoft.Hadoop.MapReduce NuGet package. I've confirmed that the CSV file has global read access, all folders (/tmp and /tmp/ootp) have global read and execute access, and I left the "Allow All" rule on in Ranger.

Here are all of the relevant bits from logs that I could find on my Hadoop sandbox. My data node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-datanode-sandbox.hortonworks.com.log) and name node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-namenode-sandbox.hortonworks.com.log) logs did not appear to have any useful information.

http://sandbox.hortonworks.com:50070/logs/hdfs-audit.log:

2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc

YARN logs show nothing at all.

The audit log shows that the pdw_user account is allowed to get info on /tmp/ootp/secondbasemen.csv and then is allowed to open the file, which it tries to do four times before the Polybase engine gives up and spits out the following error: Msg 8680, Level 17, State 26, Line 1 Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.

I've tried a couple of things like rebuilding the HDP 2.5 Docker container to expose port 50010 (in case Polybase *really* needed access to the data node) and creating an explicit pdw_user account, but neither did anything.

Here are my data source and table definitions within the SQL Server external table:

CREATE EXTERNAL DATA SOURCE [HDP2] WITH (TYPE = Hadoop, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION=N'sandbox.hortonworks.com:8050')GOCREATE EXTERNAL TABLE [dbo].[SecondBasemen]
(
[FirstName] [varchar](50) NULL,
[LastName] [varchar](50) NULL,
[Age] [int] NULL,
[Throws] [varchar](5) NULL,
[Bats] [varchar](5) NULL
)
WITH(LOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GOLOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GO

I was able to get this working under HDP 2.4, and I'd be happy to do more legwork if anyone has potential leads.

1 ACCEPTED SOLUTION

avatar
New Contributor

Hi @Kevin Feasel and @Pedro Faro

To give a bit of background on my setup and how I got it working.

1) Create HDP 2.5 VM Sandbox in Azure and make IP static.

2) Create SQL Server 2016 VM in Azure and make IP static.

3) Install putty on SQL Server 2016 vm.

4) Create putty session to Sandbox IP with the following tunnels

10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996

5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com

6) Connect to sandbox using putty

7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.

😎 Restart Polybase services.

9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.

10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data

11) Create external source

CREATE EXTERNAL DATA SOURCE [Data-Source-Name]

WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );

12) Create external file format

CREATE EXTERNAL FILE FORMAT TextFileFormat

WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));

13) Create external table

CREATE EXTERNAL TABLE [dbo].[Test]

( CustomerID INT,

CustomerName NCHAR(10),

NumberOfOrders INT )

WITH (LOCATION='/user/maria_dev/data/Test.csv',

DATA_SOURCE = [Data-Source-Name],

FILE_FORMAT = TextFileFormat );

14) SELECT * FROM [dbo].[Test]

I am logged onto the SQL Server 2016 VM the from step 3 onwards.

This is just for testing the functionality. Hope this helps

View solution in original post

12 REPLIES 12

avatar

Hi @Kevin Feasel. Try setting authorization to "none" within the Hive configs in Ambari.

2016-08-23-09-34-47.png

avatar
Explorer

Thanks for the reply, but I still get the same error message. Polybase doesn't use the Hive engine, so that would have been a bit surprising if it had been the answer.

avatar
Explorer

As an update, I ran a Wireshark trace on my VM setup with HDP 2.4 and another with 2.5. It looks like what's happening with Polybase is that the Polybase engine makes a request on port 8020, using the getBlockLocations command to ask for my file (/tmp/ootp/secondbasemen.csv in this example). The name node responds by giving the block name and a data node where that block lives. Here's where things differ a bit:

In HDP 2.4: Polybase connects to the data node (whose IP is the same as the name node) on port 50010 and retrieves the file contents I want through a TCP stream connecting to port 50010 on my HDP sandbox. In HDP 2.5: Polybase tries to connect to the data node, whose IP is 172.17.0.2. Every two seconds for the duration of my check, I get back the following message: "Standard query response [hex ID] No such name PTR 2.0.17.172.in-addr.arpa." So it's trying to connect directly to the data node but can't because the node is now within a Docker container. I'll keep plugging away at this (because I want to use HDP 2.5 instead of 2.4), but if there are any workarounds or bits of advice on how to get it functional, I'd be grateful.

avatar
New Contributor

Hi Kevin,

Do we have a solution to this error "Could not obtain block“ ?

I am also getting this error while trying to query a table after setting up Polybase, I am using HortonWorks Hadoop 2.5 and Sql Server 2016 SP1. Any response will be highly appreciated. Thanks in advance. Aman

,

Hi Kevin, did you ever get a solution to this error "Could not obtain block“ ?I am also getting this error while trying to query a table after setting up Polybase, I am using HortonWorks Hadoop 2.5 and Sql Server 2016 SP1, any response will be highly appreciated.

Thanks in advance. Aman

avatar
Explorer

Hey, Aman. The short of it is that Polybase does not currently support Dockerized Hadoop data nodes. Polybase queries hit the data nodes directly, so when you're putting data nodes on a non-routing subnet (like Docker does), the SQL Server node can't access the Hadoop data node and so queries fail.

For now, I've dropped back to the HDP 2.4 sandbox for working with Polybase. I've spoken to the relevant team at Microsoft and I'm hopeful that they'll support Dockerized nodes in the future, but I'm sure it won't be an easy task for them given the way Polybase works.

avatar
New Contributor

Hi Kevin, Thanks a lot for your response, very helpful!!

Can you please point to where could i find 2.4 HDP sandbox? i was looking for some archives on Hortonworks website but no luck.

avatar
New Contributor

Hi Kevin, I was able to find a link to HDP 2.3 VM download and Polybase seems to be working now, thanks alot for your help.

avatar
New Contributor

Hi @Kevin Feasel

I managed to fix the error on HDP 2.5

Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742700_1894 file=/user/maria_dev/data/Test.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-1464254149-172.17.0.2-1477381671113:blk_1073742700_1894 file=/user/maria_dev/data/Test.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)

The issue is that the VM is not listening on Port 50010, strange I know as it's fundamental.

This link shows you have to add the port

https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/

Any questions i'd be happy to help

avatar
Explorer

Hi, @James Braun, thank for the information. You can forward port 50010 using Docker, but that's only half the battle. The other half is that Polybase requires direct access to all data nodes, and the data nodes are set up to run on Docker IPs (172.17.0.2 in your case). It might work if you set up a VM and connect from the same machine, but from a different machine, you won't be able to route to 172.17.0.2. That's the issue I ran into, either using an Azure VM or VMware locally. The Polybase architecture requires that the Polybase data nodes are able directly to access all relevant Hadoop data nodes.

The best solution I've found was to install from Ambari and skip the sandbox altogether. That way, the data nodes are not Dockerized and therefore are all accessible to other machines on the network.