Member since
04-07-2017
80
Posts
33
Kudos Received
0
Solutions
06-15-2018
02:04 PM
Hi, I wanted to write hdfs file compare in scala functional programming. To start with I have written some code(googled to handle file closing and catching exceptions) to read a single file. I have proceeded so far, I was successful in reading first line but the code does not loop to read the next lines. Any help please. I do not want to use spark. import java.io.{BufferedReader, FileInputStream, InputStreamReader} import java.net.URI import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FSDataInputStream, FileSystem, Path} import scala.util.{Failure, Success, Try} object DRCompareHDFSFiles { def main(args: Array[String]): Unit = { val hdfs = FileSystem.get(new Configuration()) val path1 = new Path(args(0)) val path2 = new Path(args(1)) readHDFSFile(hdfs, path1, path2) } // Accept a parameter which implements a close method def using[A <: { def close(): Unit }, B](resource: A)(f: A => B): B = try { f(resource) } finally { resource.close() } def readHDFSFile(hdfs: FileSystem, path1: Path, path2: Path): Option[Stream[(String,String)]] = { Try(using(new BufferedReader(new InputStreamReader(hdfs.open(path1))))(readFileStream)) } match { case Success(result) => { I am expecting collections of string but get only string } case Failure(ex) => { println(s"Could not read file $path1, detail ${ex.getClass.getName}:${ex.getMessage}") None } } def readFileStream(br: BufferedReader)= { for { line <- Try(br.readLine()) if (line != null ) } yield line } }
... View more
Labels:
- Labels:
-
Apache Hadoop
02-14-2018
02:02 PM
Hi, Could you explain how sqoop export internally works from hive to db2? Does it do a load or insert? Thank you.
... View more
Labels:
- Labels:
-
Apache Sqoop
12-18-2017
06:31 PM
In Sqoop user guide, it is mentioned as: Delimiters may be specified as: a character ( --fields-terminated-by X ) an escape character ( --fields-terminated-by \t ). Supported escape characters are: \0 (NUL) - This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the --enclosed-by , --optionally-enclosed-by , or --escaped-by arguments.
... View more
12-18-2017
06:27 PM
Hi, I wanted to load data from a hdfs underlying hive directory through Sqoop to my-sql and db2. The data contains \000 as delimiter. Hive accepts \000 as delimiter but sqoop fails. Sqoop works for other control characters as I tried for \001 and \020. Could you suggest how to run sqoop with \000 as delimiter. This will be used for both sql and db2. I couldn't find this delimiter in the Sqoop generated .JAVA. So, not sure how this is handled. Could someone explain this please. I am guessing because my-sql and db2 considers \000 as NULL while hive considers \N as NULL. Thank you
... View more
Labels:
- Labels:
-
Apache Sqoop
12-15-2017
11:12 AM
Hi, I have my data inside hive table which has multiple delimiter(#|) and now wanted to export this data to my sql using sqoop. I have tried several options but sqoop does not support multi-delimiter. One of the field is the name field that contains special character. So to avoid using single special character, have used double special character as delimiter. Is there way in Sqoop to export with multiple delimiter? The sqoop job creates a JAVA where the delimiters are specified as char. Can I modify this JAVA file with modified delimiter(string) and use in sqoop? Please advice how this can be done. Also, incase if I am using enclosed by(#) and field terminated by(|) in hive and I sqoop that data, how will the data which has (|) as part of data be handled? e.g #12345#|#ABC|def# Thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Sqoop
12-15-2017
11:10 AM
Hi, I have my data inside hive table which has multiple delimiter(#|) and now wanted to export this data to my sql using sqoop. I have tried several options but sqoop does not support multi-delimiter. One of the field is the name field that contains special character. So to avoid using single special character, have used double special character as delimiter. Is there way in Sqoop to export with multiple delimiter? The sqoop job creates a JAVA where the delimiters are specified as char. Can I modify this JAVA file with modified delimiter(string) and use in sqoop? Please advice how this can be done. Also, incase if I am using enclosed by(#) and field terminated by(|) in hive and I sqoop that data, how will the data which has (|) as part of data be handled? e.g #12345#|#ABC|def# Thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Sqoop
11-26-2017
07:04 PM
I have used TextInformat and Multi-avro output format. This worked for me. Thank you.
... View more
11-08-2017
05:06 PM
Thank you. We do not want to go for Spark for transformation. Any details on avro file format?
... View more
11-07-2017
03:47 PM
Hi, I would like to load a fixed width record (records are delimited by \n) to Hive table(avro). I have few questions on avro. 1. I have seen .avro and .avsc file where the data and schema exist in separate file. I have also seen file where the schema and data exist in the same file. Which one is the best approach to load into avro hive? 2. Secondly, would like to understand more from the schema evolution. I understand from the other postings that using SERDEPROPERTIES is better than TBLPROPERTIES. Does the schema evolution include only adding column at the end or it also includes changing the data types and/or adding column inbetween? 3. I am writing a mapper to convert the fixed width to delimited file and then convert the delimited to .avro file. This is required since we have to filter few records. Converting delimited file to avro would it be better to have it as a separate Java application or have it inside the mapper? Appreciate for the details. 4. Is there any tool/utility to generate .avsc file from record format(text file)? Thank you.
... View more
Labels:
- Labels:
-
Apache Hive
10-31-2017
06:37 PM
Thanks for the suggestion. I am opting for the 2nd solution as the data is not one big continuous row and 1st solution did not work.
... View more
10-31-2017
12:14 PM
Hi, I am not an expertise in Java and trying to analyse a FixedInputFormat and FixedRecordReader to customize in the project. I copied both the classes from the below GitHub link and testing through Driver and mapper class https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input The input is a fixedlengthformat like this: 1234abcvd123mnfvds6722 6543abcad123aewert1234 While running this I get the error: Partial record found at the end of split. The inputsplit has considered newline and calculated the splitlength as 46 instead of 44 and calculates 3 records instead of 2. How could the newline character be avoided from the input split? I appreciate any help on this. Thank you
... View more
- Tags:
- FileSystem
- Mapreduce
Labels:
- Labels:
-
Apache Hadoop
10-18-2017
08:52 AM
It is a sequential file and not orc. Is there similar parameters for sequential files?
... View more
10-16-2017
06:52 PM
Hi, I have a basic question on the number of mappers. This is to understand how the number of mappers are arrived: I have a partitioned sequential hive table. This table is partitioned by date and cut off number. I read the data for 3 days whose size is about 634gb. So, the number of mappers I expected was 2536 and the system created 2304 mappers(nearing to the expected one) - split size is 256MB. Now, I increase the split size to 1 GB and the number of mapper created was 622(nearing to the expected one). But, now when I increase the split size to 2GB, I was expecting the mappers to be around 300, but the actual count was 512. Could you help me to understand as why increasing the split size hasn't reduced the number of mappers. Is it because of the last file whose size is less than the logical split size that creates a separate mapper or is there anything else? I would like to understand this as I have around 14TB of data to be processed and this creates around 80,000 mappers for which the application master is not at all created. I am performing analysis to reduce the number of mappers. Could you suggest any other option to reduce the number of mappers. The map and reduce size is 6gb and 8gb and application master 10 gb. One more question, would there be any difference in the number of mapper when accessed through HCatalog? Thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache HCatalog
-
Apache Hive
05-09-2017
08:29 AM
2 Kudos
Hi, I am trying to develop some MapReduce and UDF's. Wondering if there are any specific coding standards for Java in Hadoop. I could not find any in google. The google coding style is for complete Java programming. The coding standard in Oracle site seems to be an old one. Could you share some details if there are anything specific for Java used in Hadoop. Thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
04-20-2017
07:23 PM
Hi, Wanted to test a custom store udf. Took inspiration from TestStoreBase.java which uses MiniGenericCluster to setup a dummy cluster. But when I import in my java it says unable to resolve MiniGenericCluster. I am using POM of <dependency><groupId>org.apache.pig</groupId><artifactId>pig</artifactId><version>0.8.0</version></dependency> Could you help me to resolve the MiniGenericCluster issue or could you suggest some unit test cases without using MiniGenericCluster. Please reply if more details are required. Thank you.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
04-07-2017
02:31 AM
Hi, I am invoking a hql script from .ksh. The hql is used to format data from input files(.csv). So, first I am loading the data to a table(tbl1) who columns are same as the input file. Then I am creating another table(tbl2) with the format I require and performing an insert select from tbl1 to tbl2. As the substr of the file name has a field that is required I am retrieving using INPUT__FILE__NAME. But this has the complete hdfs hive table location along with the filename as the data is from tbl1. Is there a way to only get the table location as parameter in hql? I couldn't find the metastore database in the core-site.xml. We are using Clodera cluster. Is there a way to find the metastore database so that I can query the table location? Hope these details helps. Thank you.
... View more
Labels:
- Labels:
-
Apache Hive
04-05-2017
05:27 PM
The error was I was not executing with -f option. Thanks. It is working.
... View more
04-05-2017
05:24 PM
Hi, I have a hql where I have to SET properties.
SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET mapred.input.dir.recursive=true; SET hive.mapred.supports.subdirectories=true; SET hivevar:SCHEMA=xxx; and when I try to execute name.hql it errors saying "SET: not found [No such file or directory]" How to assign the properties in hql.
... View more
- Tags:
- Data Processing
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Hive
04-03-2017
04:49 PM
It is working fine now. Thanks. Have to use /* to point to sub-directory
... View more
04-03-2017
03:53 PM
I tried setting the below parameters:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
but didn't help.
Is there a way to load the partitioned data.
... View more
04-03-2017
03:48 PM
Hi, I have the output of pig that is partitioned based on date and customer segment(2 level of segments).
I wanted to load data into hive for a particular date. In load command if I point the INPATH location to date, I get error as "source contains directory". Could you suggest how to load the data from a partition directory.
I also tried creating an external table with the LOCATION as partition folder. The table was created but there were no data.
I wanted to do a LOAD since I wanted to remove the header and footer while loading.
... View more
- Tags:
- Data Processing
- hive-serde
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Hive
04-03-2017
03:21 PM
Hi,
I have data in HDFS that is output by pig. The data is stored in partition first by date and then by cust_segment. Each file under the segment has a header and footer. I wanted to load this data to hive ignoring the header and footer. I got the 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to remove the header. Is there similiar serde to remove both the header and footer.
Or could you suggest an approach to remove the footer. It is a single line. Header is also single line.
Thank you.
... View more
- Tags:
- Data Processing
- HDFS
- Hive
- hive-serde
- Pig
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
06-09-2016
06:04 PM
SBT: commons-beanutils:commons-beanutils:1.7.0:jar
SBT: commons-beanutils:commons-beanutils-core:1.8.0:jar
SBT: commons-cli:commons-cli:1.2:jar
SBT: commons-codec:commons-codec:1.10:jar
SBT: commons-collections:commons-collections:3.2.1:jar
SBT: commons-configuration:commons-configuration:1.6:jar
SBT: commons-dbcp:commons-dbcp:1.4:jar
SBT: commons-digester:commons-digester:1.8:jar
SBT: commons-el:commons-el:1.0:jar
SBT: commons-httpclient:commons-httpclient:3.1:jar
SBT: commons-io:commons-io:2.4:jar
SBT: commons-lang:commons-lang:2.6:jar
SBT: commons-logging:commons-logging:1.1.3:jar
SBT: commons-net:commons-net:3.1:jar
SBT: commons-pool:commons-pool:1.5.4:jar
SBT: io.dropwizard.metrics:metrics-core:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-graphite:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-json:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-jvm:3.1.2:jar
SBT: io.netty:netty:3.8.0.Final:jar
SBT: io.netty:netty-all:4.0.29.Final:jar
SBT: io.spray:spray-json_2.10:1.3.2:jar
SBT: javax.activation:activation:1.1:jar
SBT: javax.inject:javax.inject:1:jar
SBT: javax.jdo:jdo-api:3.0.1:jar
SBT: javax.transaction:jta:1.1:jar
SBT: javax.xml.bind:jaxb-api:2.2.2:jar
SBT: javolution:javolution:5.5.1:jar
SBT: jline:jline:2.12:jar
SBT: joda-time:joda-time:2.9.1:jar
SBT: junit:junit:4.11:jar
SBT: log4j:apache-log4j-extras:1.2.17:jar
SBT: log4j:log4j:1.2.17:jar
SBT: net.hydromatic:eigenbase-properties:1.1.5:jar
SBT: net.java.dev.jets3t:jets3t:0.7.1:jar
SBT: net.jpountz.lz4:lz4:1.3.0:jar
SBT: net.razorvine:pyrolite:4.9:jar
SBT: net.sf.opencsv:opencsv:2.3:jar
SBT: net.sf.py4j:py4j:0.9:jar
SBT: org.antlr:antlr-runtime:3.4:jar
SBT: org.antlr:ST4:4.0.4:jar
SBT: org.antlr:stringtemplate:3.2.1:jar
SBT: org.apache.avro:avro:1.7.7:jar
SBT: org.apache.avro:avro-ipc:1.7.7:jar
SBT: org.apache.avro:avro-ipc:1.7.7:tests:jar
SBT: org.apache.avro:avro-mapred:1.7.7:hadoop2:jar
SBT: org.apache.calcite:calcite-avatica:1.2.0-incubating:jar
SBT: org.apache.calcite:calcite-core:1.2.0-incubating:jar
SBT: org.apache.calcite:calcite-linq4j:1.2.0-incubating:jar
SBT: org.apache.commons:commons-compress:1.4.1:jar
SBT: org.apache.commons:commons-csv:1.1:jar
SBT: org.apache.commons:commons-lang3:3.1:jar
SBT: org.apache.commons:commons-math3:3.4.1:jar
SBT: org.apache.commons:commons-math:2.2:jar
SBT: org.apache.curator:curator-client:2.6.0:jar
SBT: org.apache.curator:curator-framework:2.6.0:jar
SBT: org.apache.curator:curator-recipes:2.6.0:jar
SBT: org.apache.derby:derby:10.10.2.0:jar
SBT: org.apache.directory.api:api-asn1-api:1.0.0-M20:jar
SBT: org.apache.directory.api:api-util:1.0.0-M20:jar
SBT: org.apache.directory.server:apacheds-i18n:2.0.0-M15:jar
SBT: org.apache.directory.server:apacheds-kerberos-codec:2.0.0-M15:jar
SBT: org.apache.hadoop:hadoop-annotations:2.6.0:jar
SBT: org.apache.hadoop:hadoop-auth:2.6.0:jar
SBT: org.apache.hadoop:hadoop-client:2.6.0:jar
SBT: org.apache.hadoop:hadoop-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-hdfs:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-app:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-core:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-jobclient:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-shuffle:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-api:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-client:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-server-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-server-web-proxy:2.2.0:jar
SBT: org.apache.hbase:hbase-client:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-common:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-hadoop2-compat:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-prefix-tree:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-protocol:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-server:0.98.0-hadoop2:jar
SBT: org.apache.httpcomponents:httpclient:4.3.2:jar
SBT: org.apache.httpcomponents:httpcore:4.3.1:jar
SBT: org.apache.ivy:ivy:2.4.0:jar
SBT: org.apache.mesos:mesos:0.21.1:shaded-protobuf:jar
SBT: org.apache.parquet:parquet-column:1.7.0:jar
SBT: org.apache.parquet:parquet-common:1.7.0:jar
SBT: org.apache.parquet:parquet-encoding:1.7.0:jar
SBT: org.apache.parquet:parquet-format:2.3.0-incubating:jar
SBT: org.apache.parquet:parquet-generator:1.7.0:jar
SBT: org.apache.parquet:parquet-hadoop:1.7.0:jar
SBT: org.apache.parquet:parquet-jackson:1.7.0:jar
SBT: org.apache.spark:spark-catalyst_2.10:1.6.0:jar
SBT: org.apache.spark:spark-core_2.10:1.6.0:jar
SBT: org.apache.spark:spark-core_2.10:1.6.0:tests:jar
SBT: org.apache.spark:spark-hive_2.10:1.6.0:jar
SBT: org.apache.spark:spark-launcher_2.10:1.6.0:jar
SBT: org.apache.spark:spark-network-common_2.10:1.6.0:jar
SBT: org.apache.spark:spark-network-shuffle_2.10:1.6.0:jar
SBT: org.apache.spark:spark-sql_2.10:1.6.0:jar
SBT: org.apache.spark:spark-sql_2.10:1.6.0:tests:jar
SBT: org.apache.spark:spark-unsafe_2.10:1.6.0:jar
SBT: org.apache.spark:spark-yarn_2.10:1.6.0:jar
SBT: org.apache.thrift:libfb303:0.9.2:jar
SBT: org.apache.thrift:libthrift:0.9.2:jar
SBT: org.apache.xbean:xbean-asm5-shaded:4.4:jar
SBT: org.apache.zookeeper:zookeeper:3.4.6:jar
SBT: org.cloudera.htrace:htrace-core:2.04:jar
SBT: org.codehaus.groovy:groovy-all:2.1.6:jar
SBT: org.codehaus.jackson:jackson-core-asl:1.9.13:jar
SBT: org.codehaus.jackson:jackson-jaxrs:1.9.13:jar
SBT: org.codehaus.jackson:jackson-mapper-asl:1.9.13:jar
SBT: org.codehaus.jackson:jackson-xc:1.9.13:jar
SBT: org.codehaus.janino:commons-compiler:2.7.8:jar
SBT: org.codehaus.janino:janino:2.7.8:jar
SBT: org.codehaus.jettison:jettison:1.1:jar
SBT: org.datanucleus:datanucleus-api-jdo:3.2.6:jar
SBT: org.datanucleus:datanucleus-core:3.2.10:jar
SBT: org.datanucleus:datanucleus-rdbms:3.2.9:jar
SBT: org.eclipse.jdt:core:3.1.1:jar
SBT: org.eclipse.jetty.orbit:javax.servlet:3.0.0.v201112011016:jar
SBT: org.fusesource.leveldbjni:leveldbjni-all:1.8:jar
SBT: org.hamcrest:hamcrest-core:1.3:jar
SBT: org.htrace:htrace-core:3.0.4:jar
SBT: org.iq80.snappy:snappy:0.2:jar
SBT: org.jamon:jamon-runtime:2.3.1:jar
SBT: org.joda:joda-convert:1.8:jar
SBT: org.jodd:jodd-core:3.5.2:jar
SBT: org.json4s:json4s-ast_2.10:3.2.10:jar
SBT: org.json4s:json4s-core_2.10:3.2.10:jar
SBT: org.json4s:json4s-jackson_2.10:3.2.10:jar
SBT: org.json:json:20090211:jar
SBT: org.mortbay.jetty:jetty:6.1.26:jar
SBT: org.mortbay.jetty:jetty-sslengine:6.1.26:jar
SBT: org.mortbay.jetty:jetty-util:6.1.26:jar
SBT: org.mortbay.jetty:jsp-2.1:6.1.14:jar
SBT: org.mortbay.jetty:jsp-api-2.1:6.1.14:jar
SBT: org.mortbay.jetty:servlet-api-2.5:6.1.14:jar
SBT: org.objenesis:objenesis:1.2:jar
SBT: org.roaringbitmap:RoaringBitmap:0.5.11:jar
SBT: org.scala-lang:scala-compiler:2.10.0:jar
SBT: org.scala-lang:scala-library:2.10.6:jar
SBT: org.scala-lang:scala-reflect:2.10.5:jar
SBT: org.scala-lang:scalap:2.10.0:jar
SBT: org.scalamock:scalamock-core_2.10:3.2:jar
SBT: org.scalamock:scalamock-scalatest-support_2.10:3.2:jar
SBT: org.scalatest:scalatest_2.10:2.2.5:jar
SBT: org.scoverage:scalac-scoverage-plugin_2.10:1.1.1:jar
SBT: org.scoverage:scalac-scoverage-runtime_2.10:1.1.1:jar
SBT: org.slf4j:jcl-over-slf4j:1.7.10:jar
SBT: org.slf4j:jul-to-slf4j:1.7.10:jar
SBT: org.slf4j:slf4j-api:1.7.10:jar
SBT: org.slf4j:slf4j-log4j12:1.7.10:jar
SBT: org.sonatype.sisu.inject:cglib:2.2.1-v20090111:jar
SBT: org.spark-project.hive:hive-exec:1.2.1.spark:jar
SBT: org.spark-project.hive:hive-metastore:1.2.1.spark:jar
SBT: org.spark-project.spark:unused:1.0.0:jar
SBT: org.tachyonproject:tachyon-client:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-hdfs:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-local:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-s3:0.8.2:jar
SBT: org.tukaani:xz:1.0:jar
SBT: org.uncommons.maths:uncommons-maths:1.2.2a:jar
SBT: org.xerial.snappy:snappy-java:1.1.2:jar
SBT: oro:oro:2.0.8:jar
SBT: sbt-and-plugins
SBT: stax:stax-api:1.0.1:jar
SBT: tomcat:jasper-compiler:5.5.23:jar
SBT: tomcat:jasper-runtime:5.5.23:jar
SBT: xerces:xercesImpl:2.9.1:jar
SBT: xml-apis:xml-apis:1.3.04:jar
SBT: xmlenc:xmlenc:0.52:jar
... View more
06-09-2016
04:44 PM
Hi Benjamin, Is it required to install thrift plugin in Intellij to execute a spark-sql application written with scala? I have installed scala plugin and imported spark lib. Below are the external libraries and when I run the sparl-sql/scala project it is finding the hive warehouse directory. Getting a error message - input refs are invalid: /user/hive/warehouse/bkfs.db/asmt_02013. Could you help. Thanks!!! datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
spark-1.5.1-yarn-shuffle.jar; spark-assembly-1.5.1-hadoop2.4.0.jar
spark-examples-1.5.1-hadoop2.4.0.jar; < 1.8 > java-8-oracle
SBT: antlr:antlr:2.7.7:jar
SBT: aopalliance:aopalliance:1.0:jar
SBT: asm:asm:3.2:jar
SBT: com.clearspring.analytics:stream:2.7.0:jar
SBT: com.databricks:spark-csv_2.10:1.3.0:jar
SBT: com.esotericsoftware.kryo:kryo:2.21:jar
SBT: com.esotericsoftware.minlog:minlog:1.2:jar
SBT: com.esotericsoftware.reflectasm:reflectasm:1.07:shaded:jar
SBT: com.fasterxml.jackson.core:jackson-annotations:2.4.4:jar
SBT: com.fasterxml.jackson.core:jackson-core:2.4.4:jar
SBT: com.fasterxml.jackson.core:jackson-databind:2.4.4:jar
SBT: com.fasterxml.jackson.module:jackson-module-scala_2.10:2.4.4:jar
SBT: com.github.stephenc.findbugs:findbugs-annotations:1.3.9-1:jar
SBT: com.github.stephenc.high-scale-lib:high-scale-lib:1.1.1:jar
SBT: com.google.code.findbugs:jsr305:1.3.9:jar
SBT: com.google.code.gson:gson:2.2.4:jar
SBT: com.google.guava:guava:16.0.1:jar
SBT: com.google.inject.extensions:guice-servlet:3.0:jar
SBT: com.google.inject:guice:3.0:jar
SBT: com.google.protobuf:protobuf-java:2.5.0:jar
SBT: com.googlecode.javaewah:JavaEWAH:0.3.2:jar
SBT: com.iheart:ficus_2.10:1.0.2:jar
SBT: com.jolbox:bonecp:0.8.0.RELEASE:jar
SBT: com.ning:compress-lzf:1.0.3:jar
SBT: com.sun.jersey.contribs:jersey-guice:1.9:jar
SBT: com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:1.9:jar
SBT: com.sun.jersey:jersey-client:1.9:jar
SBT: com.sun.jersey:jersey-core:1.9:jar
SBT: com.sun.jersey:jersey-json:1.9:jar
SBT: com.sun.jersey:jersey-server:1.9:jar
SBT: com.sun.xml.bind:jaxb-impl:2.2.3-1:jar
SBT: com.thoughtworks.paranamer:paranamer:2.6:jar
SBT: com.twitter:chill-java:0.5.0:jar
SBT: com.twitter:chill_2.10:0.5.0:jar
SBT: com.twitter:parquet-hadoop-bundle:1.6.0:jar
SBT: com.typesafe.akka:akka-actor_2.10:2.3.11:jar
SBT: com.typesafe.akka:akka-remote_2.10:2.3.11:jar
SBT: com.typesafe.akka:akka-slf4j_2.10:2.3.11:jar
SBT: com.typesafe:config:1.2.1:jar
SBT: com.univocity:univocity-parsers:1.5.1:jar
SBT: com.yammer.metrics:metrics-core:2.1.2:jar
... View more
05-26-2016
03:05 PM
Hi Robert, Thanks for the details. Could you help me in reading a parquet file.
I have loaded some data in Hive and to validate the data I have run the TopNotch script(https://github.com/blackrock/TopNotch). This script has created the bad records in a fileName.gz.parquet file in HDFS under home directory. This script uses Sparksql. Now, I wanted to read/see these invalid records. I have tried the above script but it fails.
Could the above script be used to read data from parquet file. val newDataDF = sqlContext.read.parquet("/user/user1/topnotch/part-r-00000-1513f167-1c5a-4ca8-bb08-6b7cb70a64dc.gz.parquet") The above line throws error as not found.
I wanted these invalid records to be loaded in hive table for querying: parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") Thank you.
... View more
05-25-2016
05:55 AM
Thank you. Do you know any generic scripts developed in spark for data profiling and data cleaning, that you can share?
... View more
05-25-2016
04:41 AM
Hi Neeraj, for data quality testing is there a model script developed on pig or spark, rather than using a tool. Thanks.
... View more
05-25-2016
04:30 AM
1 Kudo
Hi, Could you share the details on analysing the data quality that is loaded in Hive. I have got a text file around 250 million records which I have loaded into hive and stored in parquet file.
Now my next task is to analyse the quality of data. Since I am not from ETL background, this is new to me.
Could you share some details that could be used on Hive tables. I would prefer spark or pig. Thanks in adavnce!!!
... View more
Labels:
- Labels:
-
Apache Pig
-
Apache Spark
05-06-2016
12:54 AM
Hi Abdel, I haven't tried this one. Used Join instead. I would try. Thank you.
... View more