Member since
04-07-2017
80
Posts
33
Kudos Received
0
Solutions
10-31-2017
06:37 PM
Thanks for the suggestion. I am opting for the 2nd solution as the data is not one big continuous row and 1st solution did not work.
... View more
10-31-2017
12:14 PM
Hi, I am not an expertise in Java and trying to analyse a FixedInputFormat and FixedRecordReader to customize in the project. I copied both the classes from the below GitHub link and testing through Driver and mapper class https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input The input is a fixedlengthformat like this: 1234abcvd123mnfvds6722 6543abcad123aewert1234 While running this I get the error: Partial record found at the end of split. The inputsplit has considered newline and calculated the splitlength as 46 instead of 44 and calculates 3 records instead of 2. How could the newline character be avoided from the input split? I appreciate any help on this. Thank you
... View more
Labels:
- Labels:
-
Apache Hadoop
04-03-2017
03:21 PM
Hi,
I have data in HDFS that is output by pig. The data is stored in partition first by date and then by cust_segment. Each file under the segment has a header and footer. I wanted to load this data to hive ignoring the header and footer. I got the 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to remove the header. Is there similiar serde to remove both the header and footer.
Or could you suggest an approach to remove the footer. It is a single line. Header is also single line.
Thank you.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
06-09-2016
06:04 PM
SBT: commons-beanutils:commons-beanutils:1.7.0:jar
SBT: commons-beanutils:commons-beanutils-core:1.8.0:jar
SBT: commons-cli:commons-cli:1.2:jar
SBT: commons-codec:commons-codec:1.10:jar
SBT: commons-collections:commons-collections:3.2.1:jar
SBT: commons-configuration:commons-configuration:1.6:jar
SBT: commons-dbcp:commons-dbcp:1.4:jar
SBT: commons-digester:commons-digester:1.8:jar
SBT: commons-el:commons-el:1.0:jar
SBT: commons-httpclient:commons-httpclient:3.1:jar
SBT: commons-io:commons-io:2.4:jar
SBT: commons-lang:commons-lang:2.6:jar
SBT: commons-logging:commons-logging:1.1.3:jar
SBT: commons-net:commons-net:3.1:jar
SBT: commons-pool:commons-pool:1.5.4:jar
SBT: io.dropwizard.metrics:metrics-core:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-graphite:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-json:3.1.2:jar
SBT: io.dropwizard.metrics:metrics-jvm:3.1.2:jar
SBT: io.netty:netty:3.8.0.Final:jar
SBT: io.netty:netty-all:4.0.29.Final:jar
SBT: io.spray:spray-json_2.10:1.3.2:jar
SBT: javax.activation:activation:1.1:jar
SBT: javax.inject:javax.inject:1:jar
SBT: javax.jdo:jdo-api:3.0.1:jar
SBT: javax.transaction:jta:1.1:jar
SBT: javax.xml.bind:jaxb-api:2.2.2:jar
SBT: javolution:javolution:5.5.1:jar
SBT: jline:jline:2.12:jar
SBT: joda-time:joda-time:2.9.1:jar
SBT: junit:junit:4.11:jar
SBT: log4j:apache-log4j-extras:1.2.17:jar
SBT: log4j:log4j:1.2.17:jar
SBT: net.hydromatic:eigenbase-properties:1.1.5:jar
SBT: net.java.dev.jets3t:jets3t:0.7.1:jar
SBT: net.jpountz.lz4:lz4:1.3.0:jar
SBT: net.razorvine:pyrolite:4.9:jar
SBT: net.sf.opencsv:opencsv:2.3:jar
SBT: net.sf.py4j:py4j:0.9:jar
SBT: org.antlr:antlr-runtime:3.4:jar
SBT: org.antlr:ST4:4.0.4:jar
SBT: org.antlr:stringtemplate:3.2.1:jar
SBT: org.apache.avro:avro:1.7.7:jar
SBT: org.apache.avro:avro-ipc:1.7.7:jar
SBT: org.apache.avro:avro-ipc:1.7.7:tests:jar
SBT: org.apache.avro:avro-mapred:1.7.7:hadoop2:jar
SBT: org.apache.calcite:calcite-avatica:1.2.0-incubating:jar
SBT: org.apache.calcite:calcite-core:1.2.0-incubating:jar
SBT: org.apache.calcite:calcite-linq4j:1.2.0-incubating:jar
SBT: org.apache.commons:commons-compress:1.4.1:jar
SBT: org.apache.commons:commons-csv:1.1:jar
SBT: org.apache.commons:commons-lang3:3.1:jar
SBT: org.apache.commons:commons-math3:3.4.1:jar
SBT: org.apache.commons:commons-math:2.2:jar
SBT: org.apache.curator:curator-client:2.6.0:jar
SBT: org.apache.curator:curator-framework:2.6.0:jar
SBT: org.apache.curator:curator-recipes:2.6.0:jar
SBT: org.apache.derby:derby:10.10.2.0:jar
SBT: org.apache.directory.api:api-asn1-api:1.0.0-M20:jar
SBT: org.apache.directory.api:api-util:1.0.0-M20:jar
SBT: org.apache.directory.server:apacheds-i18n:2.0.0-M15:jar
SBT: org.apache.directory.server:apacheds-kerberos-codec:2.0.0-M15:jar
SBT: org.apache.hadoop:hadoop-annotations:2.6.0:jar
SBT: org.apache.hadoop:hadoop-auth:2.6.0:jar
SBT: org.apache.hadoop:hadoop-client:2.6.0:jar
SBT: org.apache.hadoop:hadoop-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-hdfs:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-app:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-core:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-jobclient:2.6.0:jar
SBT: org.apache.hadoop:hadoop-mapreduce-client-shuffle:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-api:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-client:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-server-common:2.6.0:jar
SBT: org.apache.hadoop:hadoop-yarn-server-web-proxy:2.2.0:jar
SBT: org.apache.hbase:hbase-client:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-common:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-hadoop2-compat:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-prefix-tree:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-protocol:0.98.0-hadoop2:jar
SBT: org.apache.hbase:hbase-server:0.98.0-hadoop2:jar
SBT: org.apache.httpcomponents:httpclient:4.3.2:jar
SBT: org.apache.httpcomponents:httpcore:4.3.1:jar
SBT: org.apache.ivy:ivy:2.4.0:jar
SBT: org.apache.mesos:mesos:0.21.1:shaded-protobuf:jar
SBT: org.apache.parquet:parquet-column:1.7.0:jar
SBT: org.apache.parquet:parquet-common:1.7.0:jar
SBT: org.apache.parquet:parquet-encoding:1.7.0:jar
SBT: org.apache.parquet:parquet-format:2.3.0-incubating:jar
SBT: org.apache.parquet:parquet-generator:1.7.0:jar
SBT: org.apache.parquet:parquet-hadoop:1.7.0:jar
SBT: org.apache.parquet:parquet-jackson:1.7.0:jar
SBT: org.apache.spark:spark-catalyst_2.10:1.6.0:jar
SBT: org.apache.spark:spark-core_2.10:1.6.0:jar
SBT: org.apache.spark:spark-core_2.10:1.6.0:tests:jar
SBT: org.apache.spark:spark-hive_2.10:1.6.0:jar
SBT: org.apache.spark:spark-launcher_2.10:1.6.0:jar
SBT: org.apache.spark:spark-network-common_2.10:1.6.0:jar
SBT: org.apache.spark:spark-network-shuffle_2.10:1.6.0:jar
SBT: org.apache.spark:spark-sql_2.10:1.6.0:jar
SBT: org.apache.spark:spark-sql_2.10:1.6.0:tests:jar
SBT: org.apache.spark:spark-unsafe_2.10:1.6.0:jar
SBT: org.apache.spark:spark-yarn_2.10:1.6.0:jar
SBT: org.apache.thrift:libfb303:0.9.2:jar
SBT: org.apache.thrift:libthrift:0.9.2:jar
SBT: org.apache.xbean:xbean-asm5-shaded:4.4:jar
SBT: org.apache.zookeeper:zookeeper:3.4.6:jar
SBT: org.cloudera.htrace:htrace-core:2.04:jar
SBT: org.codehaus.groovy:groovy-all:2.1.6:jar
SBT: org.codehaus.jackson:jackson-core-asl:1.9.13:jar
SBT: org.codehaus.jackson:jackson-jaxrs:1.9.13:jar
SBT: org.codehaus.jackson:jackson-mapper-asl:1.9.13:jar
SBT: org.codehaus.jackson:jackson-xc:1.9.13:jar
SBT: org.codehaus.janino:commons-compiler:2.7.8:jar
SBT: org.codehaus.janino:janino:2.7.8:jar
SBT: org.codehaus.jettison:jettison:1.1:jar
SBT: org.datanucleus:datanucleus-api-jdo:3.2.6:jar
SBT: org.datanucleus:datanucleus-core:3.2.10:jar
SBT: org.datanucleus:datanucleus-rdbms:3.2.9:jar
SBT: org.eclipse.jdt:core:3.1.1:jar
SBT: org.eclipse.jetty.orbit:javax.servlet:3.0.0.v201112011016:jar
SBT: org.fusesource.leveldbjni:leveldbjni-all:1.8:jar
SBT: org.hamcrest:hamcrest-core:1.3:jar
SBT: org.htrace:htrace-core:3.0.4:jar
SBT: org.iq80.snappy:snappy:0.2:jar
SBT: org.jamon:jamon-runtime:2.3.1:jar
SBT: org.joda:joda-convert:1.8:jar
SBT: org.jodd:jodd-core:3.5.2:jar
SBT: org.json4s:json4s-ast_2.10:3.2.10:jar
SBT: org.json4s:json4s-core_2.10:3.2.10:jar
SBT: org.json4s:json4s-jackson_2.10:3.2.10:jar
SBT: org.json:json:20090211:jar
SBT: org.mortbay.jetty:jetty:6.1.26:jar
SBT: org.mortbay.jetty:jetty-sslengine:6.1.26:jar
SBT: org.mortbay.jetty:jetty-util:6.1.26:jar
SBT: org.mortbay.jetty:jsp-2.1:6.1.14:jar
SBT: org.mortbay.jetty:jsp-api-2.1:6.1.14:jar
SBT: org.mortbay.jetty:servlet-api-2.5:6.1.14:jar
SBT: org.objenesis:objenesis:1.2:jar
SBT: org.roaringbitmap:RoaringBitmap:0.5.11:jar
SBT: org.scala-lang:scala-compiler:2.10.0:jar
SBT: org.scala-lang:scala-library:2.10.6:jar
SBT: org.scala-lang:scala-reflect:2.10.5:jar
SBT: org.scala-lang:scalap:2.10.0:jar
SBT: org.scalamock:scalamock-core_2.10:3.2:jar
SBT: org.scalamock:scalamock-scalatest-support_2.10:3.2:jar
SBT: org.scalatest:scalatest_2.10:2.2.5:jar
SBT: org.scoverage:scalac-scoverage-plugin_2.10:1.1.1:jar
SBT: org.scoverage:scalac-scoverage-runtime_2.10:1.1.1:jar
SBT: org.slf4j:jcl-over-slf4j:1.7.10:jar
SBT: org.slf4j:jul-to-slf4j:1.7.10:jar
SBT: org.slf4j:slf4j-api:1.7.10:jar
SBT: org.slf4j:slf4j-log4j12:1.7.10:jar
SBT: org.sonatype.sisu.inject:cglib:2.2.1-v20090111:jar
SBT: org.spark-project.hive:hive-exec:1.2.1.spark:jar
SBT: org.spark-project.hive:hive-metastore:1.2.1.spark:jar
SBT: org.spark-project.spark:unused:1.0.0:jar
SBT: org.tachyonproject:tachyon-client:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-hdfs:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-local:0.8.2:jar
SBT: org.tachyonproject:tachyon-underfs-s3:0.8.2:jar
SBT: org.tukaani:xz:1.0:jar
SBT: org.uncommons.maths:uncommons-maths:1.2.2a:jar
SBT: org.xerial.snappy:snappy-java:1.1.2:jar
SBT: oro:oro:2.0.8:jar
SBT: sbt-and-plugins
SBT: stax:stax-api:1.0.1:jar
SBT: tomcat:jasper-compiler:5.5.23:jar
SBT: tomcat:jasper-runtime:5.5.23:jar
SBT: xerces:xercesImpl:2.9.1:jar
SBT: xml-apis:xml-apis:1.3.04:jar
SBT: xmlenc:xmlenc:0.52:jar
... View more
06-09-2016
04:44 PM
Hi Benjamin, Is it required to install thrift plugin in Intellij to execute a spark-sql application written with scala? I have installed scala plugin and imported spark lib. Below are the external libraries and when I run the sparl-sql/scala project it is finding the hive warehouse directory. Getting a error message - input refs are invalid: /user/hive/warehouse/bkfs.db/asmt_02013. Could you help. Thanks!!! datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
spark-1.5.1-yarn-shuffle.jar; spark-assembly-1.5.1-hadoop2.4.0.jar
spark-examples-1.5.1-hadoop2.4.0.jar; < 1.8 > java-8-oracle
SBT: antlr:antlr:2.7.7:jar
SBT: aopalliance:aopalliance:1.0:jar
SBT: asm:asm:3.2:jar
SBT: com.clearspring.analytics:stream:2.7.0:jar
SBT: com.databricks:spark-csv_2.10:1.3.0:jar
SBT: com.esotericsoftware.kryo:kryo:2.21:jar
SBT: com.esotericsoftware.minlog:minlog:1.2:jar
SBT: com.esotericsoftware.reflectasm:reflectasm:1.07:shaded:jar
SBT: com.fasterxml.jackson.core:jackson-annotations:2.4.4:jar
SBT: com.fasterxml.jackson.core:jackson-core:2.4.4:jar
SBT: com.fasterxml.jackson.core:jackson-databind:2.4.4:jar
SBT: com.fasterxml.jackson.module:jackson-module-scala_2.10:2.4.4:jar
SBT: com.github.stephenc.findbugs:findbugs-annotations:1.3.9-1:jar
SBT: com.github.stephenc.high-scale-lib:high-scale-lib:1.1.1:jar
SBT: com.google.code.findbugs:jsr305:1.3.9:jar
SBT: com.google.code.gson:gson:2.2.4:jar
SBT: com.google.guava:guava:16.0.1:jar
SBT: com.google.inject.extensions:guice-servlet:3.0:jar
SBT: com.google.inject:guice:3.0:jar
SBT: com.google.protobuf:protobuf-java:2.5.0:jar
SBT: com.googlecode.javaewah:JavaEWAH:0.3.2:jar
SBT: com.iheart:ficus_2.10:1.0.2:jar
SBT: com.jolbox:bonecp:0.8.0.RELEASE:jar
SBT: com.ning:compress-lzf:1.0.3:jar
SBT: com.sun.jersey.contribs:jersey-guice:1.9:jar
SBT: com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:1.9:jar
SBT: com.sun.jersey:jersey-client:1.9:jar
SBT: com.sun.jersey:jersey-core:1.9:jar
SBT: com.sun.jersey:jersey-json:1.9:jar
SBT: com.sun.jersey:jersey-server:1.9:jar
SBT: com.sun.xml.bind:jaxb-impl:2.2.3-1:jar
SBT: com.thoughtworks.paranamer:paranamer:2.6:jar
SBT: com.twitter:chill-java:0.5.0:jar
SBT: com.twitter:chill_2.10:0.5.0:jar
SBT: com.twitter:parquet-hadoop-bundle:1.6.0:jar
SBT: com.typesafe.akka:akka-actor_2.10:2.3.11:jar
SBT: com.typesafe.akka:akka-remote_2.10:2.3.11:jar
SBT: com.typesafe.akka:akka-slf4j_2.10:2.3.11:jar
SBT: com.typesafe:config:1.2.1:jar
SBT: com.univocity:univocity-parsers:1.5.1:jar
SBT: com.yammer.metrics:metrics-core:2.1.2:jar
... View more
05-26-2016
03:05 PM
Hi Robert, Thanks for the details. Could you help me in reading a parquet file.
I have loaded some data in Hive and to validate the data I have run the TopNotch script(https://github.com/blackrock/TopNotch). This script has created the bad records in a fileName.gz.parquet file in HDFS under home directory. This script uses Sparksql. Now, I wanted to read/see these invalid records. I have tried the above script but it fails.
Could the above script be used to read data from parquet file. val newDataDF = sqlContext.read.parquet("/user/user1/topnotch/part-r-00000-1513f167-1c5a-4ca8-bb08-6b7cb70a64dc.gz.parquet") The above line throws error as not found.
I wanted these invalid records to be loaded in hive table for querying: parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") Thank you.
... View more
05-25-2016
05:55 AM
Thank you. Do you know any generic scripts developed in spark for data profiling and data cleaning, that you can share?
... View more
05-25-2016
04:41 AM
Hi Neeraj, for data quality testing is there a model script developed on pig or spark, rather than using a tool. Thanks.
... View more
05-25-2016
04:30 AM
1 Kudo
Hi, Could you share the details on analysing the data quality that is loaded in Hive. I have got a text file around 250 million records which I have loaded into hive and stored in parquet file.
Now my next task is to analyse the quality of data. Since I am not from ETL background, this is new to me.
Could you share some details that could be used on Hive tables. I would prefer spark or pig. Thanks in adavnce!!!
... View more
Labels:
- Labels:
-
Apache Pig
-
Apache Spark
05-06-2016
12:54 AM
Hi Abdel, I haven't tried this one. Used Join instead. I would try. Thank you.
... View more