That's insane and something I'd never do in any environment; kudos to you for figuring out a way to do it. Here's hoping Microsoft is able to provide a reasonable Docker-friendly Polybase experience in the future.
... View more
Hi, @James Braun, thank for the information. You can forward port 50010 using Docker, but that's only half the battle. The other half is that Polybase requires direct access to all data nodes, and the data nodes are set up to run on Docker IPs (172.17.0.2 in your case). It might work if you set up a VM and connect from the same machine, but from a different machine, you won't be able to route to 172.17.0.2. That's the issue I ran into, either using an Azure VM or VMware locally. The Polybase architecture requires that the Polybase data nodes are able directly to access all relevant Hadoop data nodes. The best solution I've found was to install from Ambari and skip the sandbox altogether. That way, the data nodes are not Dockerized and therefore are all accessible to other machines on the network.
... View more
Hey, Aman. The short of it is that Polybase does not currently support Dockerized Hadoop data nodes. Polybase queries hit the data nodes directly, so when you're putting data nodes on a non-routing subnet (like Docker does), the SQL Server node can't access the Hadoop data node and so queries fail. For now, I've dropped back to the HDP 2.4 sandbox for working with Polybase. I've spoken to the relevant team at Microsoft and I'm hopeful that they'll support Dockerized nodes in the future, but I'm sure it won't be an easy task for them given the way Polybase works.
... View more
As an update, I ran a Wireshark trace on my VM setup with HDP 2.4 and another with 2.5. It looks like what's happening with Polybase is that the Polybase engine makes a request on port 8020, using the getBlockLocations command to ask for my file (/tmp/ootp/secondbasemen.csv in this example). The name node responds by giving the block name and a data node where that block lives. Here's where things differ a bit: In HDP 2.4: Polybase connects to the data node (whose IP is the same as the name node) on port 50010 and retrieves the file contents I want through a TCP stream connecting to port 50010 on my HDP sandbox.
In HDP 2.5: Polybase tries to connect to the data node, whose IP is 172.17.0.2. Every two seconds for the duration of my check, I get back the following message: "Standard query response [hex ID] No such name PTR 126.96.36.199.in-addr.arpa." So it's trying to connect directly to the data node but can't because the node is now within a Docker container.
I'll keep plugging away at this (because I want to use HDP 2.5 instead of 2.4), but if there are any workarounds or bits of advice on how to get it functional, I'd be grateful.
... View more
I'm trying to get Polybase working with the Hortonworks Data Platform 2.5 sandbox. The Polybase errors in the sys.dm_exec_compute_node_errors DMV boil down to:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv Microsoft.SqlServer.DataWarehouse.Common.ErrorHandling.MppSqlException:
Could not obtain block: BP-61928784-172.17.0.2-1473764148444:blk_1073742592_1771 file=/tmp/ootp/secondbasemen.csv
at Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsBridgeReadAccess.Read(MemoryBuffer buffer, Boolean& isDone)
at Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.ExternalMoveReaderWorker.Execute(Object status) The file I'm trying to get (/tmp/ootp/secondbasemen.csv) exists and I am able to access it through the WebHDFS interface on port 50070. I can do that through curl or through the Microsoft.Hadoop.MapReduce NuGet package. I've confirmed that the CSV file has global read access, all folders (/tmp and /tmp/ootp) have global read and execute access, and I left the "Allow All" rule on in Ranger. Here are all of the relevant bits from logs that I could find on my Hadoop sandbox. My data node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-datanode-sandbox.hortonworks.com.log) and name node (http://sandbox.hortonworks.com:50070/logs/hadoop-hdfs-namenode-sandbox.hortonworks.com.log) logs did not appear to have any useful information. http://sandbox.hortonworks.com:50070/logs/hdfs-audit.log:
2016-09-28 03:05:15,675 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=getfileinfo src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:16,088 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:05:39,004 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:03,502 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc
2016-09-28 03:06:38,392 INFO FSNamesystem.audit: allowed=true ugi=pdw_user (auth:SIMPLE) ip=/192.168.58.1 cmd=open src=/tmp/ootp/secondbasemen.csv dst=null perm=null proto=rpc YARN logs show nothing at all. The audit log shows that the pdw_user account is allowed to get info on /tmp/ootp/secondbasemen.csv and then is allowed to open the file, which it tries to do four times before the Polybase engine gives up and spits out the following error:
Msg 8680, Level 17, State 26, Line 1
Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase. I've tried a couple of things like rebuilding the HDP 2.5 Docker container to expose port 50010 (in case Polybase *really* needed access to the data node) and creating an explicit pdw_user account, but neither did anything. Here are my data source and table definitions within the SQL Server external table:
CREATE EXTERNAL DATA SOURCE [HDP2] WITH (TYPE = Hadoop, LOCATION = N'hdfs://sandbox.hortonworks.com:8020', RESOURCE_MANAGER_LOCATION=N'sandbox.hortonworks.com:8050')GOCREATE EXTERNAL TABLE [dbo].[SecondBasemen]
[FirstName] [varchar](50) NULL,
[LastName] [varchar](50) NULL,
[Age] [int] NULL,
[Throws] [varchar](5) NULL,
[Bats] [varchar](5) NULL
WITH(LOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GOLOCATION = N'/tmp/ootp/secondbasemen.csv',DATA_SOURCE = HDP2,FILE_FORMAT = TextFileFormat,REJECT_TYPE = Value,REJECT_VALUE = 5)GO I was able to get this working under HDP 2.4, and I'd be happy to do more legwork if anyone has potential leads.
... View more