Created 02-18-2016 09:16 AM
I noticed following problem with HDP sandbox 2.3.1:
My hardware is MAC BOOK PRO, 8gt memory, 256Gt ssd, OSX El Capitan. When I have few hundred Pig latin hadoop jobs (Tez) per day (I am testing script with 180mb data sample) I noticed that pig latin looses column / or columns. This happens after week or two of active testing with same HDP Sandbox installation. I checked everything, data is correct, position of field is correct, but I get empty results if I try to access one column. Column is type chararray and even if I take "all columns" without filtering etc. even then the whole column is gone. When I reinstall HDP and try the same Pig latin script, without changing everything, (From Hue pig latin editor) everything is fine and the column is there like It should be. So question is: Is there some sort of SQL or something what pig uses to store schemas gets filled up and this causes loosing some of the information when you "heavily use" sandbox environment .. ?
Created 02-18-2016 10:45 AM
Hi @petri koski, well I'm sorry, but as you can imagine the sandbox is not of industrial strength 🙂 It's intended for functional testing, to make sure your scripts are working as expected. Several hundred Pig jobs are a piece of cake for a small cluster but it's a heavy lifting for the sandbox on 8G of RAM. Now, to answer your question, I also noticed some strange behavior of my sandbox after my Mac reboots (has a crush) while the sandbox is running (I'm using VirtualBox). Some files can be damaged I guess, but IMHO it doesn't make sense to troubleshot the sandbox to that level.
Created 02-18-2016 10:44 AM
@petri koski pig is designed with no schema in mind. Pigs eat everything, schema or none. Sandbox is not designed for heavy use, because hadoop cluster is not designed to be a one node. You're probably filling up some directory with logs or what not. Errors or logs are welcome. Also o noticed you're using 2.3.1 sanbo, current is 2.3.2 and honestly a new version of sandbox is on the way soon.
Created 02-18-2016 10:45 AM
Hi @petri koski, well I'm sorry, but as you can imagine the sandbox is not of industrial strength 🙂 It's intended for functional testing, to make sure your scripts are working as expected. Several hundred Pig jobs are a piece of cake for a small cluster but it's a heavy lifting for the sandbox on 8G of RAM. Now, to answer your question, I also noticed some strange behavior of my sandbox after my Mac reboots (has a crush) while the sandbox is running (I'm using VirtualBox). Some files can be damaged I guess, but IMHO it doesn't make sense to troubleshot the sandbox to that level.
Created 02-18-2016 11:18 AM
Thanks for answers. I know, Sandbox is just for testing script, but testing scripts needs some data, and in my opinion 180MB of data is still "sample" which should work fine with sandbox, maybe I am wrong.. but I guess the problem is with some files gets corrupted (When Virtualbox is shut down / crash), surely Pig latin, even it "eats everything" needs some own storage to save information about data, and that place, wherever it is, gets somehow corrupted and again we are talking about corrupted files or lack of space etc. Whatever. Production cluster is whole different thing.