Defined and documented consumesData and producesData properties by proycon · Pull Request #5 · SoftwareUnderstanding/software_types

proycon · 2022-06-10T15:29:33Z

This formalizes some of the earlier discussion in codemeta/codemeta#188 and introduces two new properties to allow us to describe input and output data for software on a high-level.

PS: this includes and builds upon the earlier PR #4

…meta/codemeta#271, codemeta/codemeta#267)

…meta/codemeta#188)

dgarijo

I think this PR needs more discussion.
It is not clear from the doc and spec what is the range of consumes data and produces data. Is it a specific dataset? Or a dataset type? (I understand the latter). If that's the case, from my experience in workflows most people tend to think that's a data format. For example "this software consumes CSV files" rather than "this software consumes metereological files in CSV format", which is different. Also, what happens if it consumes several types of data, with different roles? (e.g., a metereological file and a snowmelt file). These two properties may open a can of worms, and need to be better scoped if we want them to work for simple cases.

proycon · 2022-08-03T18:55:31Z

It is not clear from the doc and spec what is the range of consumes data and produces data. Is it a specific dataset? Or a dataset type? (I understand the latter).

Mostly the latter indeed, it acts like a template based on what's specified, but I'm trying to keep the options open to also accommodate the first. If one

I may need to clarify a bit more that this is a pretty high-level and descriptive proposal because I'm indeed afraid to open a can of worms otherwise. My aim here is just that we can encode and communicate to the end-user, in the software metadata, at least some information on what input and output a particular piece of software can accept or produce. Currently in codemeta, we don't have this ability at all.

What I'm proposing is deliberately limited and more descriptive that prescriptive, it's fairly open-ended. More in line with codemeta than with things like OpenAPI which go into machine-parseable detail (aka the can of worms). You're not going to be able to automate calling tools or APIs on the basis of this information but it can at least communicate to the user some aspects of the data input/output of the software. I think this is valuable information for users/researchers for example to make a judgment based on the software metadata whether the software might be suitable for them and worth looking into. It could also be used for some automated tool suggestions.

Also, what happens if it consumes several types of data, with different roles? (e.g., a metereological file and a snowmelt file).

The roles are not distinguished. If there are several types of input/output data, you can specify them all, but the precise relation between the things that are consumed and the things that are produced is not expressed, nor is whether it consumes/produces any or all of them. I'll see if I can explain it better in the text.

dgarijo · 2022-08-04T10:32:28Z

We did this in https://knowledgecaptureanddiscovery.github.io/SoftwareDescriptionOntology/release/1.9.0/index-en.html#hasInput and hasOutput and it can get tricky fast (we actually used everything for execution and definition of components). Plus, if the role is not specified, then you can do very limited things automatically. And you have to start defining a taxonomy of dataset types...

If this profile is about specifying software types, I would leave this out, to be honest. If you want to include this type of information then maybe we can start a different profile? I insist on this because contributions to schema.org are usually very modular. If people do not understand something, or think it's too complicated, it won't be merged.

proycon · 2022-08-04T22:33:22Z

If this profile is about specifying software types, I would leave this out, to be honest. If you want to include this type of information then maybe we can start a different profile? I insist on this because contributions to schema.org are usually very modular. If people do not understand something, or think it's too complicated, it won't be merged.

I agree. It's probably best to start a different profile for this and keep both minimal. Let's do that. Can we settle on a name (something like software_iodata perhaps?) and the scope?

We did this in https://knowledgecaptureanddiscovery.github.io/SoftwareDescriptionOntology/release/1.9.0/index-en.html#hasInput and hasOutput and it can get tricky fast (we actually used everything for execution and definition of components).

I see your DatasetSpecification indeed goes way deeper, for this new profile I wanted to keep things fairly simple. I also want to point back to the discussion at codemeta/codemeta#188 and the feedback there. One of the suggestions was to simply allow the whole of schema:CreativeWork as the range. If you interpret this CreativeWork range for consumesData/producesData as a kind of template then it already opens up a wealth of ways to specify things like mime type, natural language, licensing restrictions, etc with all the existing schema.org vocabulary.

Plus, if the role is not specified, then you can do very limited things automatically. And you have to start defining a taxonomy of dataset types...

Yes, initially I wanted to keep it as simple as possible which indeed limits what you can do automatically. The main aim would be to present the user some metadata about possible input/output types for the software (as opposed to nothing at all as it stands now with codemeta). I'm open to including and developing this more but a bit wary about the can of worms it might open. It would probably be best if we can start very simple (basically just these two properties) and leave some room for incorporating more later?

proycon · 2022-08-05T11:53:46Z

I added some initial work on splitting and reworking things to https://github.com/proycon/software-iodata , but I'd rather move and push it to a new remote repo under https://github.com/SoftwareUnderstanding/ to keep things together, if you agree to collaborate further on this of course.

dgarijo · 2022-08-06T09:59:58Z

@proycon I invited you as a member of the organization. Now you should be able to request a repo transfer. I am happy to collaborate towards defining a loose mechanism for defining i/o (the name is fine by me too).
My only concern with allowing CreativeWork is that i/o types are most of the time types, not creative works (i.e., instances) per se. And people will likely be confused by this. That and the confusion with type and format too. We'll need examples to clarify this

dgarijo · 2022-08-06T10:00:19Z

Then I guess this PR should be closed?

proycon · 2022-08-06T16:35:31Z

Thanks! I sent a transfer request.

My only concern with allowing CreativeWork is that i/o types are most of the time types, not creative works (i.e., instances) per se. And people will likely be confused by this. That and the confusion with type and format too. We'll need examples to clarify this

Yes, I see your point. Let's work that out further in the new repo. Perhaps we want to encapsulate the CreativeWork in a class of our own that makes it more explicit that it's a template/type?

Then I guess this PR should be closed?

Indeed, closing this now.

proycon added 2 commits June 10, 2022 15:38

worked out documentation and examples for software type profile (code…

d75a803

…meta/codemeta#271, codemeta/codemeta#267)

defined and documented consumesData and producesData properties (code…

7cfa826

…meta/codemeta#188)

proycon requested a review from dgarijo June 10, 2022 15:30

dgarijo requested changes Aug 1, 2022

View reviewed changes

proycon mentioned this pull request Aug 3, 2022

updated documentation and polished text #8

Merged

proycon closed this Aug 6, 2022

proycon mentioned this pull request Aug 8, 2022

Determine the scope and review the current proposed solution SoftwareUnderstanding/software-iodata#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defined and documented consumesData and producesData properties#5

Defined and documented consumesData and producesData properties#5
proycon wants to merge 2 commits intomainfrom
datatypes

proycon commented Jun 10, 2022

Uh oh!

dgarijo left a comment

Uh oh!

proycon commented Aug 3, 2022

Uh oh!

dgarijo commented Aug 4, 2022

Uh oh!

proycon commented Aug 4, 2022 •

edited

Loading

Uh oh!

proycon commented Aug 5, 2022

Uh oh!

dgarijo commented Aug 6, 2022

Uh oh!

dgarijo commented Aug 6, 2022

Uh oh!

proycon commented Aug 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

proycon commented Jun 10, 2022

Uh oh!

dgarijo left a comment

Choose a reason for hiding this comment

Uh oh!

proycon commented Aug 3, 2022

Uh oh!

dgarijo commented Aug 4, 2022

Uh oh!

proycon commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

proycon commented Aug 5, 2022

Uh oh!

dgarijo commented Aug 6, 2022

Uh oh!

dgarijo commented Aug 6, 2022

Uh oh!

proycon commented Aug 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

proycon commented Aug 4, 2022 •

edited

Loading