Member since
09-17-2014
88
Posts
3
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1997 | 07-15-2015 08:57 PM | |
7688 | 07-15-2015 06:32 PM |
10-10-2018
10:51 AM
hi experts! there are few storage levels which could be used for Spark persist and cache operations. (https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/which_storage_level_to_choose.html) by default MEMORY_ONLY used. according my observation, MEMORY_AND_DISK_SER maybe more efficient for more cases for me. i'd like to change default StorageLevel for this. is someone have any idea how to do this? thanks!
... View more
Labels:
- Labels:
-
Apache Spark
10-10-2018
10:34 AM
hi Borg! i think it maybe too late for this response 🙂 but let me try 🙂 > Are you asking how to change an active topic size or set the default for newly created topics? And active topic size > Also in terms of "size" of the topic are you referring to partitions or messages? GBytes 🙂 > Lastly, are you referring to topics created being created via the command line or a newly created topics from client? any 🙂
... View more
10-08-2018
09:28 AM
hi experts! in HDFS serivices there is tool called balancer, which purposed to ensure even distribution of blocks across cluster. my question is how frequently it kicks in to check is cluster imbalanced or not? is there any way to change this frequency? thanks!
... View more
Labels:
- Labels:
-
HDFS
10-05-2018
07:47 AM
awesome! thank you!
... View more
10-05-2018
06:08 AM
@Fawze thanks for script... unfortunitely it reflects some error for me: ./users_resource_cons.sh: line 14: syntax error: unexpected end of file
... View more
10-05-2018
06:02 AM
Thank you Thomas! do you mean some concreate charts? 🙂 I checked Cloudera Manager -> YARN -> Resource Pools there indeed lots of useful charts, but it shows pool consumption. for example it could be pool root.marketing, but within thi pool it could be multiple users. so, i want to have a understanding which users consume which resources.
... View more
10-05-2018
04:13 AM
so, as i can see here we may have solution as disallow undeclared pools and don't create pools which matches with undesirible groops, right? thank you!
... View more
10-05-2018
12:48 AM
thanks for idea. so if Bob will belong to marketing group and to low group, in case of existing only low pool, it will be mapped to this pool, right? anohter question, what if i'll have pools: root.low and root.marketing, which one will Pick up Bob, given that he belongs to both (secondary) groups?
... View more
10-04-2018
06:50 AM
hi dear experts! i do have a challenge. i do have a dynamuc service pool, let's say root.marketing. Many users, who belong to this pool is submitting jobs on it (Bob, Alice, Tom). i want to know resource consumption for each of the users. like for the last day Bob used in average 33 cores, Alice 12, Tom 118... or something like this. in other words want to know who consume what within the same pool thanks!
... View more
Labels:
- Labels:
-
Apache YARN
10-04-2018
06:36 AM
well, I'd like to create only 3 pools with different priorities and put diffrent groups, which could be many (like it could be up to 50 different groups), in different pools (which is only 3)... i was thinking for some workaround but nothing come to my mind
... View more
10-04-2018
05:14 AM
Thank you so much for sharing this! but i didn't fing how to map name of Primary/Secondary group to the different Name of the pool. in other words i want to put user, who belongs to Marketing group into root.low pool, rather than root.marketing thanks!
... View more
10-04-2018
01:59 AM
Hi dear experts! I'm looking for some way to map cettain user group to YARN resource pool. for example, i have user Bob, who belongs to group QA and user Alice who Belongs to group Dev in YARN i have pools like: root.low, root.medium, root.high. I want to config placement rules at certain way that all users who belong to QA group (Bob) would be mapped to root.low pools, and everyone who belongs to Dev (Alice) would map to root.high. does anyone know is it possible or not? thanks!
... View more
Labels:
- Labels:
-
Apache YARN
07-19-2017
05:21 PM
Hi gurys! is there any way to do subj? thanks!
... View more
Labels:
- Labels:
-
Apache Kafka
04-19-2016
03:29 PM
Hi dear experts! i'm trying to load data with ImportTSV tool , like this: hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dmapreduce.job.reduces=1000 -Dimporttsv.columns="data:SS_SOLD_DATE_SK, HBASE_ROW_KEY" -Dimporttsv.separator="|" -Dimporttsv.bulk.output=/tmp/store_sales_hbase store_sales /user/root/benchmarks/bigbench/data/store_sales/*, but have only one reducer (despite on -Dmapreduce.job.reduces=1000 setting ). i even set mapreduce.job.reduces=1000 on the Cluster wide, but still have only one reducer. Could anybody hint how to resolve this? thank you in advance for any input!
... View more
Labels:
- Labels:
-
Apache HBase
04-19-2016
11:17 AM
Hi dear experts! i'm trying to load data from CSV format on HDFS to HBase with ImportTSV (importtsv). it works perfectly fine in case when HBASE_ROW_KEY is the single CSV column. but i don't know how to create composite HBASE_ROW_KEY (from two columns). for example, i have CSV with 3 columns: row1, 1, abc
row1, 2, dd
row2, 1, iop
row3, 1, kk and row could be uniqly identified by first two columns. any inputs will be highly appreciated!
... View more
Labels:
- Labels:
-
Apache HBase
01-22-2016
05:01 PM
Hi dear expert! i'm trying to export data with sqoop.export.records.per.statement parameter. But for some reasons sqoop doesn't recognize it: sqoop export --direct --connect jdbc:oracle:thin:@scaj43bda01:1521:orcl --username bds --password bds --table orcl_dpi --export-dir /tmp/dpi --input-fields-terminated-by ',' --lines-terminated-by '\n' -m 70 --batch -Dsqoop.export.records.per.statement=10000 -Dsqoop.export.statements.per.transaction=100
Warning: /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1168.923/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/01/22 19:59:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.1
16/01/22 19:59:38 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/22 19:59:38 ERROR tool.BaseSqoopTool: Error parsing arguments for export:
16/01/22 19:59:38 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dsqoop.export.records.per.statement=10000
16/01/22 19:59:38 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dsqoop.export.statements.per.transaction=100 i've tried to remove --direct key (target DB is Oracle), but it also doesn't help: sqoop export --connect jdbc:oracle:thin:@host:1521:orcl --username user --password pass --table orcl_dpi --export-dir /tmp/dpi --input-fields-terminated-by ',' --lines-terminated-by '\n' -m 70 --batch -Dsqoop.export.records.per.statement=10000 -Dsqoop.export.statements.per.transaction=100
Warning: /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1168.923/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/01/22 20:00:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.1
16/01/22 20:00:29 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/22 20:00:29 ERROR tool.BaseSqoopTool: Error parsing arguments for export:
16/01/22 20:00:29 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dsqoop.export.records.per.statement=10000
16/01/22 20:00:29 ERROR tool.BaseSqoopTool: Unrecognized argument: -Dsqoop.export.statements.per.transaction=100 thank you!
... View more
Labels:
01-13-2016
02:10 PM
so, i've started to play with this and met interesting thing. When I try to proceed data with lzma i read in two times more data then i'm actually have on the HDFS. For example, hadoop client (hadoop fs -du) shows some numbers like 100GB. then i run MR (like select count(1) ) over this data and check MR counters and find "HDFS bytes read" two times more (like 200GB). In case of gzip and bzip2 codecs hadoop client file size and MR counters are the similar
... View more
01-11-2016
07:19 PM
Thank you for your reply! seems promissing, but as far as i understand it require to rebuld your Hadoop distribution pack. what if i just have CDH pack and want to plug this like extention (for example like lzo does through the parcels)... thanks!
... View more
01-09-2016
05:09 PM
Hi experts! seems that LZMA algorithm could be pretty siutable for some Hadoop cases (like storing historical inmutable data). Does someone know is it possible to implement it somehow or reuse some library? any ideas are very welcome! thanks!
... View more
- Tags:
- compression
- HDFS
- lzma
Labels:
- Labels:
-
Apache Hadoop
11-11-2015
02:45 PM
Hi dear experts! i'm wondering is there any way to force block redistridution for some particular file/directory. my case is: 1) load file from node that have DataNode process with replication factor 1 2) increace replication factor by executing: hdfs dfs -setrep 3 /tmp/path/to/my/file 3) check distribution with some specific Java tool: hadoop jar FileDistribution.jar /tmp/ path/to/my/file and got: ----------------------------------- ----------------------------------- Files distribution in directory across cluster is : {scaj31bda05.us.oracle.com=400, scaj31bda03.us.oracle.com=183, scaj31bda04.us.oracle.com=156, scaj31bda01.us.oracle.com=151, scaj31bda02.us.oracle.com=154, scaj31bda06.us.oracle.com=156} it's obvious that first node contain 400 blocks. other 400*2=800 blocks evenly distributed across other nodes. it there any way for force block redistribution for make it even? thanks!
... View more
Labels:
- Labels:
-
HDFS
11-08-2015
08:05 PM
Thanks! Hdfs fsck will works, but it's hard to analyze in case of big file. maybe it is other way to get aggregate values?
... View more
11-08-2015
07:40 PM
Hi dear expert! i'm wondering is there any way to check file distribution amond nodes in HDFS? some way that allow to check on which nodes place some particular file of dirrectory? thanks!
... View more
Labels:
- Labels:
-
HDFS
11-06-2015
11:06 AM
Hi dear experts! i'm strugling with configuring sqoop2+hue(3.7)+Oracle DB. i'm trying to create connection in HUE, but getting error: i have ojdbc6.jar in /var/lib/sqoop2/ dirrectory (as hinted some forums): [root@sqoop2server ~]# ll /var/lib/sqoop2/ total 7684 -rw-r--r-- 1 sqoop2 sqoop 2677451 Nov 6 12:46 derby-10.8.2.2.jar -rw-r--r-- 1 sqoop2 sqoop2 960396 Nov 6 13:39 mysql-connector-java.jar -rw-r--r-- 1 root root 3670975 Nov 6 13:52 ojdbc6.jar -rw-r--r-- 1 sqoop2 sqoop2 539705 Nov 6 13:39 postgresql-9.0-801.jdbc4.jar drwxr-xr-x 3 sqoop2 sqoop2 4096 Nov 5 21:17 repository drwxr-xr-x 3 sqoop2 sqoop2 4096 Nov 2 19:06 repositoy drwxr-xr-x 5 sqoop2 sqoop 4096 Nov 6 13:39 tomcat-deployment + another one question is the any way to config Oraoop sqoop with Hue? thanks!
... View more
Labels:
- Labels:
-
Apache Sqoop
-
Cloudera Hue
09-09-2015
06:03 PM
thank you for your reply! Could you point me at source class where it's possible to read this in more details? thanks!
... View more
09-09-2015
10:01 AM
thank you for your reply! just for clarify > stream the data via a buffered read does size of this buffer defined by io.file.buffer.size parameter? thanks!
... View more
09-08-2015
06:20 PM
Hi dear experts! i'm curious how it possible to handle read IO size in my MR jobs. for exampe, i have some file in HDFS, under the hood it's files in Linux filesystem /disk1/hadoop/.../.../blkXXX. in ideal case this file size should be equal block size (128-256MB). my question is how it possible to set IO size for reading operation? thank you!
... View more
Labels:
- Labels:
-
Apache Hadoop
-
HDFS
07-27-2015
02:53 AM
but in the second case I read all dataset as in the first case (without any map operation). so, in both casese i read whole dataset... regarding shuffle - i use coalesce instead repartition, so it suppose to avoid shuffle operations...
... View more
07-26-2015
04:47 PM
Hi dear experts! i discovering Spark's persist capabilities and noted interesting behaivour of DISK_ONLY persistance. as far as i understand the main goal - to store reusable and intermediate RDDs, that were produced from permanent data (that lays on HDFS). import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/big_table");
val result = input.coalesce(600).persist(StorageLevel.DISK_ONLY)
scala> result.count()
……
// and repeat command
……..
scala> result.count() so, i was surprised when saw that second iteration was significantly faster... could anybody describe why? thanks!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
-
HDFS
07-18-2015
10:35 AM
Hi everyone! i trying to understand Sort shuffle in spark and will very appreciate if someone could answer on simple question, let's imagine: 1) i have 600 partitions (HDFS blocks, for simplicity) 2) it place in 6 node cluster 3) i run spark with follow parameters: --executor-memory 13G --executor-cores 6 --num-executors 12 --driver-memory 1G --properties-file my-config.conf that's mean that on each server i will have 2 executor with 6 core each. 4) according my config reduce phase has only 3 reducers. so, ny question is how many files on each servers will be after Sort Shuffle: - 12 like a active map task - 2 like a number of executors on each server - 100 like a number of partitions that place on this server (for simplicity i just devide 600 on 6) and the second question is how names buffer for storing intermediate data before spill it on disk on the map stage? thanks!
... View more
Labels:
- Labels:
-
Apache Spark
-
HDFS