Member since
09-23-2015
800
Posts
897
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2089 | 08-12-2016 01:02 PM | |
1233 | 08-08-2016 10:00 AM | |
1214 | 08-03-2016 04:44 PM | |
2543 | 08-03-2016 02:53 PM | |
707 | 08-01-2016 02:38 PM |
07-15-2016
11:06 AM
There is a control in ambari under HDFS right at the top for the memory of the Namenode. You should not set the heap size for all components since most need much less memory than the namenode. For Namenode a good rule of thumb is 1GB for 100TB of data in HDFS ( plus a couple GB base so 4-8 min ) but it needs to be tuned based on workload ( if you suspect your memory settings to be insufficient you can look at the memory and GC behaviour of your JVM )
... View more
07-15-2016
10:09 AM
3 Kudos
Ambari applies some heuristics. But it can never hurt to double check. Datanode: 1GB works but for normal nodes ( 16 cores/12-14 drives/128-256GB RAM) I normally set it to 4GB Namenode: Depends on HDFS size. A good rule of thumb is that 100TB of data in HDFS need 1GB of RAM. Spark: This totally depends on your spark needs. ( not talking about history server but the defaults for executors ) The more power you need the more executors and more RAM in them ( up to 32GB is good apparently ) Yarn: Ambari does decent heuristics but I like to tune them normally. In the end you should change the sizes until yoiur cluster has a good CPU utilization. But this is something more elaborate and depends on your usecases
... View more
07-15-2016
09:55 AM
@mike harding hmmm I think your cluster has bigger problems. It shows that only 2 of the 700 tasks are actually running. How many slots do you have in your cluster? I.e. yarn min/max settings and total memory in cluster? They still shouldn't take so long anymore because they shoiuld only have 1/700 of data but something is just weird. I think you might have to call support.
... View more
07-13-2016
11:20 AM
@mike harding yeah the problem is that Tez tries to create an appropriate number of map tasks. Too much is wasteful since there is a lot of overhead, not enough and tasks are slow. Since your data volumes are tiny it creates one task essentially. However since you use a cartesian join and blow up your 50k rows into 2.5 billion computations this is not appropriate in your case. You need at least as many tasks as you have task slots in your cluster to properly utilize your processing. To increase the number of tasks you have multiple options: a) My Option with the grouping parameters You can tell Tez to use more mappers by changing the grouping parameters. (essentially telling tez how much bytes each map task should process) However this is a bit trial and error since you need to specify a byte size for each mapper and in these small data volumes it gets a bit iffy b) Gopals approach using a reducer This is more elegant, instead of doing the join in mappers you add a shuffle before and set the number of reducers to the number you want ( gopal used 4096 but your cluster is small so I would use less ). You then add a sub select inside with a sort by essentially forcing a reduce step before your computation. In the end you will have Map ( 1 ) -> Shuffle -> Reducer ( Sort by 4096 values 0 ) -> your spatial computations. -> Reduce (group by aggregate) Makes sense?
... View more
07-13-2016
11:07 AM
1 Kudo
My first question would be why do you want to do that. If you want to manage your cluster you would normally install something like pssh or ansible or puppet and use that to manage the cluster. You can put that on one control node define a list of servers and move data/execute scripts on all of them at the same time. You can do something very simple like that with a one line ssh program To execute a script on all nodes: for i in server1 server2;do echo $i; ssh $i $1;done To copy files to all nodes: for i in server1 server2;do scp $1 $i:$2;done [all need keyless ssh from the control node to the cluster nodes] If on the other hand you want to execute job dependencies, something like the distributed mapreduce cache is normally a good idea. Oozie provides the <file> tag to upload files from hdfs to the execution directory of the job. So honestly if you go into more details what you ACTUALLY want we might be able to help more.
... View more
07-13-2016
08:04 AM
2 Kudos
In general the normal client is faster unless your client server is heavily CPU constrained. The PQS essentially is a thrift proxy for the thick client so you have additional effort and latency. ( Esp. if you have larger result sets which are read in parallel from all regions by the thick client but sucked through a single pipe from the PQS ) If you can connect the clients directly to the Region Servers and the clients are not heavily CPU constrained I would go with the normal client.
... View more
07-12-2016
11:57 PM
Good to have the higher power doublecheck 😉 . The sort by is a nice trick will use that from now on. Didn't actually see the sum, I thought he returned a list of close coordinates.
... View more
07-12-2016
11:13 AM
5 Kudos
"from a, b" Unless I am mistaken ( I didn't look at it in too much detail ) You are doing a classic mistake. A cartesian join. There is no join condition in your join. So you are creating: 45,000 * 54000 = 2000000000 rows. This is supposed to take forever. If you want to make this faster you need to join the two tables by a join condition like a key or at the very least restrict the combinations a bit by joining regions with each other. If you HAVE to compare every row with every row you need to somehow get him to distribute data more. You can decrease the split size for example to make this work on more mappers. Tez has the min and max group size parameters to say how much data there will be in a mapper. You could reduce that to something VERY small to force him to split up the base data set into a couple rows each. But by and large cartesian joins are just mathematically terrible. You seem to be doing some geodetic stuff so perhaps you could generate a set of overlapping regions in your data set and make sure that you only compare values that are in the same regions there are a couple tricks here. Edit: a) If you HAVE to compare all rows with all rows. 2b comparisons are not too much for a cluster. Use the following tez parameters to split up your computations into a lot of small tasks. Per default this is up to a GB but you could force it to 100 kb or so and make sure you get at least 24 mappers or so out of it: tez.grouping.max-size tez.grouping.min-size b) Put a distance function in a join condition where distance < 1000 or so. So you discard all unwanted rows before the shuffle. He still needs to do 2b computations but at least he doesn't need to shuffle 2b rows around. c) if all of that fails you need to become creative. One thing I have seen for distance functions is to add quadratic regions of double your max. distance length to your data set ( overlapping ) assign them to the regions and only join them if the two points at least belong to one of the regions.
... View more
07-11-2016
05:16 PM
3 Kudos
Hmmm access issue? If its in the root folder only root can access it. Hive error messages can be pretty generic here and not distinguish between access rights and files exists. If you run beeline the read will be executed by the hive server and that one is running under the hive user so only he could access the data. Try it in /tmp if it works there too you would know.
... View more
07-11-2016
04:28 PM
1 Kudo
I think the majority of people do not use ssh fencing at all. The reason for this is that Namenode HA works fine without it. The only issue can be that during a network partitioning old connections to the old standby might still exist and get stale old date during read-only operations. - They cannot do any write transactions since the Journalnode majority prohibits that - Normally if zkfc works correctly an active namenode will not go into zombie mode, he is dead or not. So the chances of a split brain are low and the impact is pretty limited. If you use ssh fencing the important part is that your script cannot block other wise the failover will be stopped, you need to have all scripts return in a sensible amount of time even if the access is not possible. Fencing by definition is always an attempt. Since most of the time the node is simply down. And they need to return success in the end. So you need a fork with a timeout and then return true. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Verifying_automatic_failover
... View more
07-11-2016
04:20 PM
1 Kudo
if you have a general location you can only use time based scheduling ( every 15 min for example ) you cannot use retention or do late arrivals etc. All the advanced waiting for data to arrive features in oozie are essentially out. Essentially mimicks the datasets in oozie: https://oozie.apache.org/docs/4.2.0/CoordinatorFunctionalSpec.html#a5._Dataset
... View more
07-11-2016
03:45 PM
1 Kudo
You need to look into the logs. Most likely yarn logs of the Map Task of your Oozie launcher. This contains the sqoop command execution and any errors you would normally see on the command line. You can get them from resourcemanager ( click on your oozie launcher job and go through to the map task or use yarn application -logs. You can find any issues in the actual data transfer in the kicked off Mapreduce job which is a separate job
... View more
07-11-2016
03:42 PM
1 Kudo
Pretty sure its a community feature and not supported yet even if it should have found its way into the release. Also its very limited right now. Can you really replace your scheduling with it? It still seems to need a bit of time in the oven.
... View more
07-11-2016
10:31 AM
Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ... But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.
... View more
07-11-2016
09:50 AM
@Sunile Manjee yeah it works Alex tested it as well. Falcon does not point to one RM. It goes to the yarn-site.xml and finds the RM couple that has one of the ones you specified. Then it tries both.
... View more
07-07-2016
11:49 AM
yeah not sure why they call it multiple times. I think the record reader classes are simply initiated multiple times during split and other creation for various reasons. In the end your code needs to be able to survive empty calls and avoid duplication of connection objects. I know of no ways to fix this. In my example I could see relatively easily if the conf object was valid because my config fields ( the storage handler parameters ) where not always in the object. I then simply initialized the connection object and made sure not to create it a second time.
... View more
07-06-2016
12:03 PM
You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.
... View more
07-05-2016
09:45 PM
3 Kudos
It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example ) It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc. And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise. So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.
... View more
07-05-2016
07:26 PM
3 Kudos
One common problem in SQL is that you join two tables and get duplicate column names from the two tables. When you now for example want to create a CTAS you will get "duplicate column name" errors. You also often want to exclude the join key from the result set since it is by definition duplicate. Database schemas often prefix column names with a letter from the table to fix at least the first issue. Like TPCH: lineitem.l_key and orders.o_key. A common approach to fix this is to explicitely specify all column names in your SELECT list. However it looks like Hive has some cool/dirty tricks up its sleeve to make this easier. Regular Expressions to specify column names. Testsetup: describe tb1;
id int
name string
describe tb2;
id int
age int
You then have to disable the use of quotes in identifiers because that interferes with the regex. set hive.support.quoted.identifiers=none; And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after: create table tb3 asselect tb1.*, tb2.`(id)?+.+`from tb1, tb2 where tb1.id = tb2.id; You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want to use.The regex is essentially asking for any string unless it starts with id ( it essentially means "the string id once or not at all ( ?+) and any string following). This means if the whole string is id it will not be matched because the remainder of the regex needs to match something. You could also specify multiple columns: (id|id2)?+.+ or (id*)?+.+. This gives me a result table with all columns from the left table and all columns but the key column from the right table. describe tb3;
id int
name string
age int Hive is really cool.
... View more
- Find more articles tagged with:
- Data Processing
- Hive
- How-ToTutorial
- sql
Labels:
07-05-2016
04:14 PM
9 Kudos
I was about to say its not possible. I assume both tables have a column with the same name which is one of the reasons database schemas often prefix column names with a letter from the table. Like TPCH: lineitem.l_key and orders.o_key. In this case you would have to suck it up and name all columns in the join using the AS statement. However it looks like Hive has some cool/dirty tricks up its sleeve. Regular Expressions to specify column names. Here is my setup: hive> describe tb1;
id int
name string
hive> describe tb2;
id int
age int
You then have to disable the use of quotes in identifiers because that interferes with the regex. hive> set hive.support.quoted.identifiers=none; And now you can use Java Regex to select all columns from the right table that is NOT the key. So essentially you get all columns but the duplicate. If you have non-join key columns that are duplicate you can exclude them and rename them with the AS statement after: hive> create table tb3 as select tb1.*, tb2.`(id)?+.+` from tb1, tb2 where tb1.id = tb2.id;
You can see that I select all columns from the left table and then use the `` quotes to specify a regular expression for the columns from the right side I want. The regex is a mean trick essentially asking for any string unless it starts with id ( it says id once or not at all and something following. This means if it matches the full string it will not be matched because the remainder of the regex needs to match something. You could also do (id|id2)?+.+ or (id*)?+.+. This gives me a result table with all columns from the left table and all columns but the key column from the right table. hive> describe tb3;
id int
name string
age int
Neat. You never stop learning something new. Hive is really cool. Edit: Actually made a little article out of the question because these regex would have made my life much easier multiple times before. So giving a bit more attention to it seems to make sense: https://community.hortonworks.com/articles/43510/excluding-duplicate-key-columns-from-hive-using-re.html
... View more
07-05-2016
02:23 PM
4 Kudos
The oozie installation makes changes to the core-site.xml. Specifically the oozie user. needs to be able to impersonate other users so he can kick off a job as the user who owns the oozie flow. HDFS allows you to configure these powerful users in the proxyuser settings. So to enable that you need to restart HDFS and yarn. hadoop.proxyuser.oozie.groups hadoop.proxyuser.oozie.hosts
... View more
07-05-2016
12:59 PM
1) Normally Mapreduce reads and creates very large amounts of data. The framework is also parallel and failed tasks can be rerun, so until all tasks have finished you are not sure what the output is. You can write a program that returns data to the caller directly obviously but this is not the norm. Hive for example writes files to a tmp dir and then the hiveserver uses the hdfs client to read the results. In pig you have the options to store (save in hdfs ) or dump ( show on screen ) data. But Not sure if pig also utilizes a tmp file here. In Mapreduce you can do whatever you want. 2) Mapreduce is used when you want to run computations in parallel on the cluster. So pig/hive utilize it. But you can also just read the data directly using the client. However in that case you have a single threaded read,
... View more
07-05-2016
11:16 AM
1 Kudo
Not sure what you mean with chunk. Essentially a stream of data is piped into the HDFS write API. Each 128MB a new block is created internally. Inside each block the buffer sends data when a network package is full ( 64KB or so ) So essentially 1GB file is written into HDFS API - Block1 is created on (ideally local) node1, copy on node2 and node3 - data is streamed into it, in 64KB chunks, from client to node1, whenever datanode receives 64KB chunk it writes it to disc into the block and tells client that write was successful and at the same time sends a copy to node2 -node2 writes chunk to its replica of the block and sends data to node3 -node3 writes chunk to block on disc - next 64kb chunk is send from client to node1 ... - 128MB is full and next block is create. The write is successful once the client received notification from node1 that it successfully wrote the last block If node1 dies during the write client will rewrite blocks on a different node. ...
... View more
07-05-2016
10:45 AM
7 Kudos
1. where does this splitting of huge file takes place.? A Client is a (mostly) Java program using the HDFS Filesystem API to write a file to HDFS. This can be the hadoop command line client or a program running in the cluster ( like mapreduce, ... ) In this case each mapper/reducer who writes to HDFS will write one file ( you may have seen mapreduce output folders that contain part-0000 part-0001 files, these are the files written by each mapper/reducer. MapReduce takes these folders as if they are one big file. If you write a file with a client ( let's say one TB) the file is written into the API and transparently chunked into 128MB blocks by the API. 2.Does the client forms 3 pipelines for each block to replicate which run in parallel??. No its a chain. The block is committed once it is persisted on the first node, but it is written in parallel to the other two nodes in a chain Client -> Node1 -> Node2(different rack )->node3(same rack as node2) 3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full?? Yes HDFS API writes in buffer fields I think 64kb or so? So every buffer package is written through at the same time "doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" " Never heard of that replication principle. And it is definitely not true in HDFS. A file doesn't even need three copies to be written successful a put operation is successful if it is persisted on ONE node. The namenode would make sure that the correct replication level is reached eventually. "Can you also provide the possible reasons why the flow is not the otherwise. Because its faster? If you had to wait for three copies sequentially a put would take much longer to succeed. A lot in HDFS is effiiency
... View more
07-04-2016
11:42 AM
1 Kudo
OK for an OLTP requirement and user lookup you are looking at HBase/Phoenix in the HDP distribution ( other possibilities would be Cassandra, Gemfire ... ) HBase is a NoSQL datastore that allows a simple get/put/scan api on flat tables with a unique primary key and arbitrary number of fields. Phoenix is a SQL layer on top of hbase so its very fast for key lookups/inserts of single rows and can also do aggregations on thousands to millions of rows very efficiently. Both scale up very well since you can add Region servers dynamically. Phoenix can maintain secondary indexes as well which might be helpful in your scenario. ( you can obviously directly maintain that in Hbase by adding a translation table and maintaining it yoruself )
... View more
07-01-2016
01:06 PM
38 digits is the max for Decimal. You cannot do Decimal(38,2) You could do Decimal(36,2) or Decimal (10,10 ) or whatever.
... View more
07-01-2016
11:17 AM
5 Kudos
The reason is that in a distributed transactional system with a Paxos or similar algorithm you need a quorum ( majority ). Essentially a transaction is committed once more than 50% of nodes say that the transaction is committed. You could do 4 journalnodes / zookeeper nodes as well but you would not get any benefit over 3 nodes and you add additional overhead. 4 Nodes can only survive 1 failed node because 3 journal-nodes are a majority. But not 2. Therefore you need an uneven number. 3 nodes can survive 1 failure, 5 nodes can survive 2 failures, 7 nodes can survive 3 failures and so on ... https://en.wikipedia.org/wiki/Paxos_(computer_science)
... View more
06-30-2016
05:17 PM
2 Kudos
You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.
... View more
06-30-2016
02:42 PM
2 Kudos
This is something you should discuss with a Sales person: especially licence costs. http://hortonworks.com/contact-us/ The big advantage of HDP is that it is 100% open source and open and always adds the full Apache version of all components. Components are also in Open Hadoop so there is no lock-in like you have with the proprietary components of Mapr ( Map R file system and their proprietary other components ) The second big advantage of Hortonworks is that we have the widest range of committers in the different components so you will get the best support and influence in adding features. Doing a simple feature by feature comparison doesn't really compare these fundamental differences.
... View more
06-30-2016
10:45 AM
2 Kudos
As said a question of priorities. Connectors normally go from a community project into the actual product. Once they are in they have to be tested and supported and often need changes like Kerberos support as well. So including something is not zero effort. Its just lower on the priority list than the Kafka connection which is the standard approach to ingest realtime data into a bigdata environment. If you need the integration you could post an idea in the community board. Our product management teams read them. Regarding the Machine Learning inquiry please open a new question. This would allow people who have tried this before to more easily see the question. Sorry for not being of more help 🙂
... View more