Member since
06-13-2017
25
Posts
3
Kudos Received
0
Solutions
12-02-2017
04:15 PM
Usually people use HDFS, S3 (or Kudu) or you can use Alluxio (Tachyon) as an off-heap storage, which is faster and more scalable.
... View more
08-31-2017
05:08 AM
I don't think there is Kudu support yet in Pyspark. see KUDU-1603
... View more
08-30-2017
12:28 AM
1 Kudo
hard to tell based on the information you provided but see if you can increase Pentaho's memory settings (edit spoon.bat). If that doesn't work, check Impala's catalog'd memory setting. Hope this helps.
... View more
08-30-2017
12:22 AM
there is a uuid function in impala that you can use to generate surrogate keys for kudu. or you can write an impala udf to generate unique bigints.
... View more
08-27-2017
04:51 AM
Sounds good to me.
... View more
08-22-2017
04:36 AM
I don't think this can be done in Spark. You have to use JDBC API style syntax (import java.sql.* ) for this and wrap your DML statements inside a transaction ie. setAutocommit = false, commit if everything is OK, rollback everything if any one of the DML statements fails.
... View more
08-20-2017
05:57 PM
Hi, You might be able to do this if your destination database and its JDBC driver supports transactions (rollbacks and commits).
... View more
07-28-2017
06:15 PM
1 Kudo
Not the fastest way to do it, but you can create a hive table on top of the hbase table and use Spark JDBC to create your hbase Dataframe. you can then join that dataframe with your Oracle dataframe.
... View more
06-20-2017
07:11 PM
Impala 2.9 has several Impala-Kudu performance improvements. partial list: IMPALA-4859 - Push down IS NULL / IS NOT NULL to Kudu IMPALA-3742 - INSERTs into Kudu tables should partition and sort IMPALA-5156 - Drop VLOG level passed into Kudu client - "In some simple concurrency testing, Todd found that reducing the vlog level resulted in an increase in throughput from ~17 qps to 60qps." make sure you have a large enough MEM_LIMIT and limit the number of joins in your queries. Goodluck 🙂
... View more
06-19-2017
05:24 PM
The best way to deal with small files is to not have to deal with them at all. You might want to explore using Kudu or HBase as your storage engine instead of HDFS (Parquet).
... View more
06-18-2017
12:23 AM
actual source here: https://github.com/apache/spark/blob/v2.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala more info here: https://spark.apache.org/docs/latest/streaming-programming-guide.html
... View more
06-16-2017
12:24 AM
1 Kudo
if writing to parquet you just have to do something like: df.write.mode("append").parquet("/user/hive/warehouse/Mytable") and if you want to prevent the "small file" problem: df.coalesce(1).write.mode("append").parquet("/user/hive/warehouse/Mytable")
... View more
06-15-2017
05:30 PM
You need to configure NTP correctly. "Four NTP servers is the recommended minimum. Four servers protects against one incorrect timesource, or "falseticker". " for tips in configuring NTP. https://access.redhat.com/solutions/58025
... View more
06-14-2017
08:10 PM
hi, you need to increase your YARN container memory settings in CM. it has to be bigger than your --executor-memory.
... View more
06-14-2017
06:32 PM
does it have to be a sequence? or would a unique value be sufficient? If that's the case Impala's got a uuid() function that you can use. Or if a BIGINT is required you can hash the uuid() to get a BIGINT value.
... View more
06-14-2017
06:06 PM
simplifying the SQL statement by denormalization (using several temp/intermediate tables) is a common way of tuning extremely large queries. Regarding your second question. There shouldn't be any difference (the way you wrote the query). But using the API give you flexibility such as taking advantage of caching for example.
... View more
06-14-2017
12:07 AM
Yes, you have to use foreachRDD from https://stackoverflow.com/questions/44088090/spark-streaming-saving-data-to-mysql-with-foreachrdd-in-scala // JDBC writer configuration val connectionProperties = new Properties() connectionProperties.put("user", "root") connectionProperties.put("password", "*****") structuredData.foreachRDD { rdd => val df = rdd.toDF() // create a dataframe from the schema RDD df.write.mode("append") .jdbc("jdbc:mysql://192.168.100.8:3306/hadoopguide", "topics", connectionProperties) }
... View more
06-14-2017
12:04 AM
You might have to include your GPFS libraries to your SPARK_CLASSPATH and LD_LIBRARY_PATH
... View more
06-13-2017
11:57 PM
hi, Let me clarify. you're not able to access kudu tables created via impala, is that correct?
... View more
06-13-2017
11:46 PM
just cast it back to the correct data type using selectExpr val ConvertedDF = myDF.selectExpr("id","name","cast(age as tinyint) age");
... View more
06-13-2017
11:44 PM
One way is to use selectExpr and use cast. val ConvertedDF = joined.selectExpr("id","cast(mydoublecol as double) mydoublecol");
... View more