What is the issue? Most studies (and books) are not in Scholia si

bulk importing will make Wikidata less useful <p dir="a

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed about scholia HOT 4 OPEN

prototyperspective commented on September 3, 2024

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed

from scholia.

Comments (4)

egonw commented on September 3, 2024

Thanks for the cross-link! Often changes or changes on the Wikidata side need updates here. The book example with versions and editions is an important one.

Scholia indeed just visualizes what is in Wikidata, and here to make Scholia as useful as possible (without resulting in timed-out queries). Scholia should not, imho, be a platform to discuss what Wikidata can handle or not. That is, mass-importing data is to be discussed on Wikidata (as is in this case). But the simple fact is that the current Wikidata platform is not as scalable as everyone would love it to be.

from scholia.

fnielsen commented on September 3, 2024

I am under the impression that the bots that imported and annotated wikicite data has been switched off due to the fear of Wikidata Query Service coming into troubles.

from scholia.

andrawaag commented on September 3, 2024

I disagree. Bulk importing is not a solution. I have switched of some if not all bulk importing bots, not because of fear of the WDQS getting into trouble, because bulk importing will make Wikidata less useful. We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items. So trying to achieve complete recall is basically impossible.

IMO we should try build AI-ready corpora not by increasing the coverage in Wikidata, but by creating independent RDF graphs on books and papers using the Wikidata namespace. Building RDF Graphs of the size of Wikidata (or bigger) is relatively easy if directly constructed as RDF graphs (ie. not having to rely on the limitations of the Wikibase API). This approach still requires Wikidata bots, because main topic items still will need to be minted.

from scholia.

prototyperspective commented on September 3, 2024

bulk importing will make Wikidata less useful

The opposite is the case. Currently, it's not really already useful in the real world at least when it comes to studies but that would change once the data is more complete.

We don't know the exact number of all books and papers, but even the most conservative estimates give a number that is way bigger than the number of current Wikidata items.

Not true. Got to admit that the number of Wikidata items lower than I though and the share of studies larger. However, per this page the count of items is somwhere around 111 million at this point. ScienceOpen contains most studies and it currently has 95 million items. So if preprints aren't added (note many of them have been imported) and some of the most notable items not in SO (maybe OpenAlex has them) were included one could estimate it to be somewhere around 150 million items. That's indeed larger than the current number of WD items but not by much, it would double its size and this wouldn't include books or food products which would be good to import as well. I think Scholia would start to become useful, recommendable, used and not misleading once it reaches around parity with scienceopen and that would be below the current number of items.

So trying to achieve complete recall is basically impossible.

It's not impossible and I see no reason why it would be rather than many demonstrations that bulk imports work quite well and could be scaled up. I don't know if the imports are done from a local server or written remotely using an API and what the current ways to improve performance would be (such as caching).

If that was the case then why even spend time developing Scholia? If not more bulk-importing is done then this is kind of a waste of time and not useful. For example, charts about a person's or a topic's number of studies per year are otherwise only misleading instead of useful to human users and AI. They give a wrong picture of whatever is looked at.

Sorry if this sounds a bit hurtful but otherwise I don't think there is much potential for Scholia except as an UI for Wikidata users to more easily see issues like when people use WikiFlix to improve films-related data (instead of using it to watch films or anything else) and even there the usefulness may be absent or minimal.

because bulk importing will make Wikidata less useful.

No reason why that would be the case, not true – why would it make it less useful? For example, in the search results one could filter away scholarly articles so they don't show up (potentially even with the click of a button).

As for issues with data, the scripts should be well-written, well-tested/investigated and when things go wrong this can be fixed with scripts. There are further ways to mitigate issues such as locking bot-imported articles to only be editable after request for unlock or by bots or sth like that.

AI-ready corpora

Not what one would think of Scholia, I never thought of Scholia only as a tool for training AI – for example why is it sometimes linked in Wikipedia articles when it's only intended for AI? But even then, this would mistrain AIs so they produce flawed results and similar issues.

but by creating independent RDF graphs on books and papers using the Wikidata namespace

Well maybe I misunderstood you then above, I'll leave the above unedited nevertheless...I see no reason why wikidata items couldn't be as performant as RDF graphs if they are more so. Wikidata items could be converted to RDF graph nodes, essentially cached and used as such and updated once a Wikidata item is edited which is only retrieved when a human uses the Wikidata interface to open the WD item. This is just an overly broad outline. For example, the API doesn't have to be used...it's the default but theoretically one could copy files to where the data is stored in batch data upgrades.

Also commented in the WikiProject Books thread as well as the talk pages of two editors whose bots imported lots of studies, more input would be appreciated.
I think it's critical that data of a type/field like 'academic articles' becomes fairly complete so that actual usecases/applications can be built and that sustainable efficient automated bulk imports are designed for that.

from scholia.

Scholia is only as good as the data on studies etc in Wikidata which is miserable – bulk-imports from datasets are needed about scholia HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent