Skip to content

Defined and documented consumesData and producesData properties#5

Closed
proycon wants to merge 2 commits intomainfrom
datatypes
Closed

Defined and documented consumesData and producesData properties#5
proycon wants to merge 2 commits intomainfrom
datatypes

Conversation

@proycon
Copy link
Collaborator

@proycon proycon commented Jun 10, 2022

This formalizes some of the earlier discussion in codemeta/codemeta#188 and introduces two new properties to allow us to describe input and output data for software on a high-level.

PS: this includes and builds upon the earlier PR #4

@proycon proycon requested a review from dgarijo June 10, 2022 15:30
Copy link
Contributor

@dgarijo dgarijo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR needs more discussion.
It is not clear from the doc and spec what is the range of consumes data and produces data. Is it a specific dataset? Or a dataset type? (I understand the latter). If that's the case, from my experience in workflows most people tend to think that's a data format. For example "this software consumes CSV files" rather than "this software consumes metereological files in CSV format", which is different. Also, what happens if it consumes several types of data, with different roles? (e.g., a metereological file and a snowmelt file). These two properties may open a can of worms, and need to be better scoped if we want them to work for simple cases.

@proycon
Copy link
Collaborator Author

proycon commented Aug 3, 2022

It is not clear from the doc and spec what is the range of consumes data and produces data. Is it a specific dataset? Or a dataset type? (I understand the latter).

Mostly the latter indeed, it acts like a template based on what's specified, but I'm trying to keep the options open to also accommodate the first. If one

I may need to clarify a bit more that this is a pretty high-level and descriptive proposal because I'm indeed afraid to open a can of worms otherwise. My aim here is just that we can encode and communicate to the end-user, in the software metadata, at least some information on what input and output a particular piece of software can accept or produce. Currently in codemeta, we don't have this ability at all.

What I'm proposing is deliberately limited and more descriptive that prescriptive, it's fairly open-ended. More in line with codemeta than with things like OpenAPI which go into machine-parseable detail (aka the can of worms). You're not going to be able to automate calling tools or APIs on the basis of this information but it can at least communicate to the user some aspects of the data input/output of the software. I think this is valuable information for users/researchers for example to make a judgment based on the software metadata whether the software might be suitable for them and worth looking into. It could also be used for some automated tool suggestions.

Also, what happens if it consumes several types of data, with different roles? (e.g., a metereological file and a snowmelt file).

The roles are not distinguished. If there are several types of input/output data, you can specify them all, but the precise relation between the things that are consumed and the things that are produced is not expressed, nor is whether it consumes/produces any or all of them. I'll see if I can explain it better in the text.

@dgarijo
Copy link
Contributor

dgarijo commented Aug 4, 2022

We did this in https://knowledgecaptureanddiscovery.github.io/SoftwareDescriptionOntology/release/1.9.0/index-en.html#hasInput and hasOutput and it can get tricky fast (we actually used everything for execution and definition of components). Plus, if the role is not specified, then you can do very limited things automatically. And you have to start defining a taxonomy of dataset types...

If this profile is about specifying software types, I would leave this out, to be honest. If you want to include this type of information then maybe we can start a different profile? I insist on this because contributions to schema.org are usually very modular. If people do not understand something, or think it's too complicated, it won't be merged.

@proycon
Copy link
Collaborator Author

proycon commented Aug 4, 2022

If this profile is about specifying software types, I would leave this out, to be honest. If you want to include this type of information then maybe we can start a different profile? I insist on this because contributions to schema.org are usually very modular. If people do not understand something, or think it's too complicated, it won't be merged.

I agree. It's probably best to start a different profile for this and keep both minimal. Let's do that. Can we settle on a name (something like software_iodata perhaps?) and the scope?

We did this in https://knowledgecaptureanddiscovery.github.io/SoftwareDescriptionOntology/release/1.9.0/index-en.html#hasInput and hasOutput and it can get tricky fast (we actually used everything for execution and definition of components).

I see your DatasetSpecification indeed goes way deeper, for this new profile I wanted to keep things fairly simple. I also want to point back to the discussion at codemeta/codemeta#188 and the feedback there. One of the suggestions was to simply allow the whole of schema:CreativeWork as the range. If you interpret this CreativeWork range for consumesData/producesData as a kind of template then it already opens up a wealth of ways to specify things like mime type, natural language, licensing restrictions, etc with all the existing schema.org vocabulary.

Plus, if the role is not specified, then you can do very limited things automatically. And you have to start defining a taxonomy of dataset types...

Yes, initially I wanted to keep it as simple as possible which indeed limits what you can do automatically. The main aim would be to present the user some metadata about possible input/output types for the software (as opposed to nothing at all as it stands now with codemeta). I'm open to including and developing this more but a bit wary about the can of worms it might open. It would probably be best if we can start very simple (basically just these two properties) and leave some room for incorporating more later?

@proycon
Copy link
Collaborator Author

proycon commented Aug 5, 2022

I added some initial work on splitting and reworking things to https://github.com/proycon/software-iodata , but I'd rather move and push it to a new remote repo under https://github.com/SoftwareUnderstanding/ to keep things together, if you agree to collaborate further on this of course.

@dgarijo
Copy link
Contributor

dgarijo commented Aug 6, 2022

@proycon I invited you as a member of the organization. Now you should be able to request a repo transfer. I am happy to collaborate towards defining a loose mechanism for defining i/o (the name is fine by me too).
My only concern with allowing CreativeWork is that i/o types are most of the time types, not creative works (i.e., instances) per se. And people will likely be confused by this. That and the confusion with type and format too. We'll need examples to clarify this

@dgarijo
Copy link
Contributor

dgarijo commented Aug 6, 2022

Then I guess this PR should be closed?

@proycon
Copy link
Collaborator Author

proycon commented Aug 6, 2022

Thanks! I sent a transfer request.

My only concern with allowing CreativeWork is that i/o types are most of the time types, not creative works (i.e., instances) per se. And people will likely be confused by this. That and the confusion with type and format too. We'll need examples to clarify this

Yes, I see your point. Let's work that out further in the new repo. Perhaps we want to encapsulate the CreativeWork in a class of our own that makes it more explicit that it's a template/type?

Then I guess this PR should be closed?

Indeed, closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants