Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

KITE SDK 'Provided partitioners do not reference a source field and instead require that a value is'

Solved Go to solution

KITE SDK 'Provided partitioners do not reference a source field and instead require that a value is'

New Contributor

Hi,

 

I am using kite sdk on quick start vm to do some datasets creation, but I can not see how to pass a provided partion value when I do csv-import or json-import.

 

How can we achieve that?

 

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: KITE SDK 'Provided partitioners do not reference a source field and instead require that a value

New Contributor
Thanks Lars.
4 REPLIES 4

Re: KITE SDK 'Provided partitioners do not reference a source field and instead require that a value

Expert Contributor

Hi Khalef,

 

Have you had a look at this blog post covering Kite? https://blog.cloudera.com/blog/2014/06/how-to-use-kite-sdk-to-easily-store-and-configure-data-in-apa...

 

Cheers, Lars

Re: KITE SDK 'Provided partitioners do not reference a source field and instead require that a value

New Contributor

Hi Lars,

 

Yes I have read I believe most of the articles and the doco writen on Kite SDK.

 

However, my partition fields (year, month, day) are not part of my data files, and there is no date or timestamp field that tells me that this data is of today or a month ago.

 

My partition config (if I can use one) would be:

[{
  "type" : "provided",
  "name" : "year"
},
{
  "type" : "provided",
  "name" : "month"
},
{
  "type" : "provided",
  "name" : "day"
}]

 

And when I want to csv-import or json-import my files I don't see how to tell kitesdk-dataset explicitly that I want to store the imported file in partition year=2016, month=05, day=30.

 

Right now this is what I am doing: I create a dataset, create a partition directory and then copy the parquet file to it):

 

> kite-dataset csv-schema ods_ml_au.Introducer_Group_30_05_2016.psv --class IntroducerGroup --delimiter '|' -o introducerGroup.avsc

 

> hdfs dfs -put introducerGroup.avsc /user/caf/macleasing/format

 

> kite-dataset create dataset:hdfs:/user/caf/macleasing/stage/ml/introducerGroups -s hdfs:/user/caf/macleasing/format/introducerGroup.avsc -f parquet

 

> hdfs dfs -put ods_ml_au.Introducer_Group_30_05_2016.psv /user/caf/macleasing/source

 

> kite-dataset csv-import hdfs:/user/caf/macleasing/source/ods_ml_au.Introducer_Group_30_05_2016.psv dataset:hdfs:/user/caf/macleasing/stage/ml/introducerGroups --delimiter '|'

 

> hdfs dfs -mkdir -p /user/caf/macleasing/stage/ml/introducerGroups/year=2016/month=05/day=30

 

>hdfs dfs -mv /user/caf/macleasing/stage/ml/introducerGroups/*.parquet /user/caf/macleasing/stage/ml/introducerGroups/year=2016/month=05/day=30/

 

How can I avoid the explicit creation of directory and file movement??

 

I want to use my partition-config 

 

Cheers Khalef

 

Re: KITE SDK 'Provided partitioners do not reference a source field and instead require that a value

Expert Contributor

Hi Khalef,

 

I don't know the answer to your question myself, but I asked around for an expert on Kite and learned that the best source for help would be the Kite project itself. You probably found their website already: http://kitesdk.org/docs/current/

 

It also links to their mailing list and I would like to ask you to post your question there: https://groups.google.com/a/cloudera.org/forum/#!forum/cdk-dev

 

Apologies for the inconvenience. Best wishes, Lars

Re: KITE SDK 'Provided partitioners do not reference a source field and instead require that a value

New Contributor
Thanks Lars.
Don't have an account?
Coming from Hortonworks? Activate your account here