Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Polybase And HDP 2.5

avatar
Explorer

I'm trying to get Polybase working with the Hortonworks Data Platform 2.5 sandbox.

The Polybase errors in the sys.dm_exec_compute_node_errors DMV boil down to:

Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone)
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read()
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData()
     at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)

The file I'm trying to get (/tmp/ootp/secondbasemen.csv) exists and I am able to access it through the WebHDFS interface on port 50070. I can do that through curl or through the Microsoft.Hadoop.MapReduce NuGet package. I've confirmed that the CSV file has global read access, all folders (/tmp and /tmp/ootp) have global read and execute access, and I left the "Allow All" rule on in Ranger.

Here are all of the relevant bits from logs that I could find on my Hadoop sandbox. My data node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-datanode-sandbox.hortonworks.com.log) and name node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-namenode-sandbox.hortonworks.com.log) logs did not appear to have any useful information.

http://sandbox.hortonworks.com:50070/logs/hdfs-audit.log:

2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc

YARN logs show nothing at all.

The audit log shows that the pdw_user account is allowed to get info on /tmp/ootp/secondbasemen.csv and then is allowed to open the file, which it tries to do four times before the Polybase engine gives up and spits out the following error: Msg 8680, Level 17, State 26, Line 1 Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.

I've tried a couple of things like rebuilding the HDP 2.5 Docker container to expose port 50010 (in case Polybase *really* needed access to the data node) and creating an explicit pdw_user account, but neither did anything.

Here are my data source and table definitions within the SQL Server external table:

CREATE EXTERNAL DATA SOURCE [HDP2] WITH (TYPE = Hadoop, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION=N'sandbox.hortonworks.com:8050')GOCREATE EXTERNAL TABLE [dbo].[SecondBasemen]
(
[FirstName] [varchar](50) NULL,
[LastName] [varchar](50) NULL,
[Age] [int] NULL,
[Throws] [varchar](5) NULL,
[Bats] [varchar](5) NULL
)
WITH(LOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GOLOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GO

I was able to get this working under HDP 2.4, and I'd be happy to do more legwork if anyone has potential leads.

1 ACCEPTED SOLUTION

avatar
New Contributor

Hi @Kevin Feasel and @Pedro Faro

To give a bit of background on my setup and how I got it working.

1) Create HDP 2.5 VM Sandbox in Azure and make IP static.

2) Create SQL Server 2016 VM in Azure and make IP static.

3) Install putty on SQL Server 2016 vm.

4) Create putty session to Sandbox IP with the following tunnels

10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996

5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com

6) Connect to sandbox using putty

7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.

😎 Restart Polybase services.

9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.

10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data

11) Create external source

CREATE EXTERNAL DATA SOURCE [Data-Source-Name]

WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );

12) Create external file format

CREATE EXTERNAL FILE FORMAT TextFileFormat

WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));

13) Create external table

CREATE EXTERNAL TABLE [dbo].[Test]

( CustomerID INT,

CustomerName NCHAR(10),

NumberOfOrders INT )

WITH (LOCATION='/user/maria_dev/data/Test.csv',

DATA_SOURCE = [Data-Source-Name],

FILE_FORMAT = TextFileFormat );

14) SELECT * FROM [dbo].[Test]

I am logged onto the SQL Server 2016 VM the from step 3 onwards.

This is just for testing the functionality. Hope this helps

View solution in original post

12 REPLIES 12

avatar
New Contributor

Hi @James,

I know this post is a little old , but i'm having the same issue

when I psping the IP_of_vm:50010 i got

The remote computer refused the network connection.

I already have the port 50010 open in the NSG of Azure VM ?

What is needed to do more ?

Thxs

Pedro

avatar
New Contributor

Hi @Kevin Feasel and @Pedro Faro

To give a bit of background on my setup and how I got it working.

1) Create HDP 2.5 VM Sandbox in Azure and make IP static.

2) Create SQL Server 2016 VM in Azure and make IP static.

3) Install putty on SQL Server 2016 vm.

4) Create putty session to Sandbox IP with the following tunnels

10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996

5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com

6) Connect to sandbox using putty

7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.

😎 Restart Polybase services.

9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.

10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data

11) Create external source

CREATE EXTERNAL DATA SOURCE [Data-Source-Name]

WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );

12) Create external file format

CREATE EXTERNAL FILE FORMAT TextFileFormat

WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));

13) Create external table

CREATE EXTERNAL TABLE [dbo].[Test]

( CustomerID INT,

CustomerName NCHAR(10),

NumberOfOrders INT )

WITH (LOCATION='/user/maria_dev/data/Test.csv',

DATA_SOURCE = [Data-Source-Name],

FILE_FORMAT = TextFileFormat );

14) SELECT * FROM [dbo].[Test]

I am logged onto the SQL Server 2016 VM the from step 3 onwards.

This is just for testing the functionality. Hope this helps

avatar
Explorer

That's insane and something I'd never do in any environment; kudos to you for figuring out a way to do it. Here's hoping Microsoft is able to provide a reasonable Docker-friendly Polybase experience in the future.