Created 09-28-2016 08:26 PM
I'm trying to get Polybase working with the Hortonworks Data Platform 2.5 sandbox.
The Polybase errors in the sys.dm_exec_compute_node_errors DMV boil down to:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException: Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone) at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DataReader.ExternalMoveBufferReader.Read() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.ReadAndSendData() at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status)
The file I'm trying to get (/tmp/ootp/secondbasemen.csv) exists and I am able to access it through the WebHDFS interface on port 50070. I can do that through curl or through the Microsoft.Hadoop.MapReduce NuGet package. I've confirmed that the CSV file has global read access, all folders (/tmp and /tmp/ootp) have global read and execute access, and I left the "Allow All" rule on in Ranger.
Here are all of the relevant bits from logs that I could find on my Hadoop sandbox. My data node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-datanode-sandbox.hortonworks.com.log) and name node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-namenode-sandbox.hortonworks.com.log) logs did not appear to have any useful information.
http://sandbox.hortonworks.com:50070/logs/hdfs-audit.log:
2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc 2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
YARN logs show nothing at all.
The audit log shows that the pdw_user account is allowed to get info on /tmp/ootp/secondbasemen.csv and then is allowed to open the file, which it tries to do four times before the Polybase engine gives up and spits out the following error: Msg 8680, Level 17, State 26, Line 1 Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.
I've tried a couple of things like rebuilding the HDP 2.5 Docker container to expose port 50010 (in case Polybase *really* needed access to the data node) and creating an explicit pdw_user account, but neither did anything.
Here are my data source and table definitions within the SQL Server external table:
CREATE EXTERNAL DATA SOURCE [HDP2] WITH (TYPE = Hadoop, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION=N'sandbox.hortonworks.com:8050')GOCREATE EXTERNAL TABLE [dbo].[SecondBasemen] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH(LOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GOLOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GO
I was able to get this working under HDP 2.4, and I'd be happy to do more legwork if anyone has potential leads.
Created 05-30-2017 08:00 AM
Hi @Kevin Feasel and @Pedro Faro
To give a bit of background on my setup and how I got it working.
1) Create HDP 2.5 VM Sandbox in Azure and make IP static.
2) Create SQL Server 2016 VM in Azure and make IP static.
3) Install putty on SQL Server 2016 vm.
4) Create putty session to Sandbox IP with the following tunnels
10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996
5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com
6) Connect to sandbox using putty
7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.
😎 Restart Polybase services.
9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.
10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data
11) Create external source
CREATE EXTERNAL DATA SOURCE [Data-Source-Name]
WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );
12) Create external file format
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));
13) Create external table
CREATE EXTERNAL TABLE [dbo].[Test]
( CustomerID INT,
CustomerName NCHAR(10),
NumberOfOrders INT )
WITH (LOCATION='/user/maria_dev/data/Test.csv',
DATA_SOURCE = [Data-Source-Name],
FILE_FORMAT = TextFileFormat );
14) SELECT * FROM [dbo].[Test]
I am logged onto the SQL Server 2016 VM the from step 3 onwards.
This is just for testing the functionality. Hope this helps
Created 05-27-2017 11:32 PM
Hi @James,
I know this post is a little old , but i'm having the same issue
when I psping the IP_of_vm:50010 i got
The remote computer refused the network connection.
I already have the port 50010 open in the NSG of Azure VM ?
What is needed to do more ?
Thxs
Pedro
Created 05-30-2017 08:00 AM
Hi @Kevin Feasel and @Pedro Faro
To give a bit of background on my setup and how I got it working.
1) Create HDP 2.5 VM Sandbox in Azure and make IP static.
2) Create SQL Server 2016 VM in Azure and make IP static.
3) Install putty on SQL Server 2016 vm.
4) Create putty session to Sandbox IP with the following tunnels
10000 10015 10500 11000 19888 2222 4200 50010 50070 50075 8020 8050 8080 8088 8886 8888 9995 9996
5) Create entry in hosts for 127.0.0.1 mapping to sandbox.hortonworks.com
6) Connect to sandbox using putty
7) Open Ambari and get copies of configs core-site, hdfs-site, hive-site, mapred-site and yarn-site from sandbox and move over to C:\Program Files\Microsoft SQL Server\MSSQL13.SQL_2016\MSSQL\Binn\Polybase\Hadoop\conf\ on SQL Server 2016 vm.
😎 Restart Polybase services.
9) Follow instructions on https://hortonworks.com/hadoop-tutorial/opening-sandbox-ports-azure/ to add port 50010.
10 ) Use Ambari to add a test.csv file to HDFS under folder /user/maria_dev/data
11) Create external source
CREATE EXTERNAL DATA SOURCE [Data-Source-Name]
WITH ( TYPE = HADOOP, LOCATION ='hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION = 'sandbox.hortonworks.com:8050' );
12) Create external file format
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =',', USE_TYPE_DEFAULT = TRUE));
13) Create external table
CREATE EXTERNAL TABLE [dbo].[Test]
( CustomerID INT,
CustomerName NCHAR(10),
NumberOfOrders INT )
WITH (LOCATION='/user/maria_dev/data/Test.csv',
DATA_SOURCE = [Data-Source-Name],
FILE_FORMAT = TextFileFormat );
14) SELECT * FROM [dbo].[Test]
I am logged onto the SQL Server 2016 VM the from step 3 onwards.
This is just for testing the functionality. Hope this helps
Created 06-05-2017 09:01 PM
That's insane and something I'd never do in any environment; kudos to you for figuring out a way to do it. Here's hoping Microsoft is able to provide a reasonable Docker-friendly Polybase experience in the future.