Comments (15)
I wanted to see the SPARQL itself, not the results.
from scholia.
I have tried one here: https://synia.toolforge.org/#author/Q18618629 just changing the endpoint. There were issues ("Recent publications from experimental scholarly endpoint ")
- The instance of is not there.
- The split is not including, e.g., chapter, working paper, ...
- Labeling does not work for non scientific papers.
from scholia.
Thanks for keeping an eye on this, @dpriskorn ! The link in your post does not work for me, so here is another link to what is probably the same message.
We are looking into the matter and do not have good answers to your questions yet, but here are some guesstimates:
- How many queries need to be rewritten?
- The majority of the nearly 400 Scholia queries would need to be rewritten (probably well over 300) — all those that make use of scholarly article or any of the associated properties (title, author, author name string, main subject, published in, describes a method that uses, publication date, number of pages), plus a number of indirect ones, e.g. things related to publisher or affiliation .
- Can all of them be rewritten without adverse effects like timeout?
- Not if the timeout settings remain the same, since federation adds complexity. Working with a static dataset might have some performance benefits though.
- How much effort is it to rewrite?
- We need to review all queries as to whether they are affected, i.e. as to whether they (a) run on either of the new
main
orscholarly
endpoints and (b) give the same results as thefull
endpoint. This could probably be largely automated in a matter of hours by someone who understands the matter. - For any of the queries that fail to run or where the results differ in substance, we would need to rewrite them. Assuming an average of 5-10 min per query, that means something on the order of a person week of work time. I suspect that some queries might not work usefully, so we would need to change their functionality.
- Perhaps we need a dedicated hackathon just for such adaptations of Scholia queries.
- We need to review all queries as to whether they are affected, i.e. as to whether they (a) run on either of the new
- Can the rewriting be automated somehow?
from scholia.
Can all of them be rewritten without adverse effects like timeout?
@dpriskorn, no, I don't think so. This initial split is suffering from the problem we highlighted in a telcon last year: queries break and cannot be easily solved with SPARQL. the key problem is that statements (like P2860) have object and subject split over the two QS-s... this will require to figure out which statements have content in both (multiple) QS-s, then do a fusion of that data, before moving to the next statement
Example query that returns empty is this one: https://w.wiki/98JL
from scholia.
I just tried rewriting it, but it's nasty because essential info is split over the two resources (to be run at https://query-scholarly-experimental.wikidata.org/):
select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?role where {
# get the intention types from the "main" WDQS
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
?intention wdt:P31 wd:Q96471816 .
}
# get the citing works from the "main" WDQS
{
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
} group by ?work ?type_
}
}
UNION
# get the citing works from the "scholarly" WDQS
{
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
}
group by ?work ?type_
}
hint:Prior hint:runFirst true .
# now look up some additional info (only available from the "main" WDQS
SERVICE <https://query-main-experimental.wikidata.org/sparql> {
?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
MINUS { ?venue_ wdt:P31 wd:Q1143604 }
}
bind(
coalesce(
if(bound(?type_), ?venue,
'other source')
) as ?role
)
}
group by ?year ?type_ ?role
order by ?year
It times out.
from scholia.
When I run the query from main
I get closer, and it runs in reasonable time:
select ?year (count(distinct ?work) as ?number_of_publications) ?type_ ?venue_ ?role where {
# get the intention types from the "main" WDQS
?intention wdt:P31 wd:Q96471816 .
# get the articles from the "scholarly" WDQS
{
SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
} group by ?work ?type_
}
}
UNION
# get the articles from the "main" WDQS
{
select distinct ?work (min(?years) as ?year) ?type_ where {
?work wdt:P577 ?dates ;
p:P2860 / pq:P3712 ?intention .
bind(str(year(?dates)) as ?years) .
OPTIONAL {
?work wdt:P31 wd:Q109229154 . bind("explicit" as ?type_)
}
}
group by ?work ?type_
}
hint:Prior hint:runFirst true .
# now look up some additional info: venue
# get the venue info from the "scholarly" WDQS
OPTIONAL {
?work wdt:P1433 ?venue_ . ?venue_ rdfs:label ?venue . FILTER (LANG(?venue) = "en")
MINUS { ?venue_ wdt:P31 wd:Q1143604 }
}
bind(
coalesce(
if(bound(?type_), ?venue,
'other source')
) as ?role
)
}
group by ?year ?type_ ?venue_ ?role
order by ?year
But you can see from the results that the venue
information is split over the two QS-s (the above query missed venue info). As soon as I try looking up the venue info from both instances, it times out again.
from scholia.
Where can I find the SPARQL of the
Recent publications from experimental scholarly endpoint
?
https://synia.toolforge.org/#author/Q18618629 - third table
from scholia.
Thanks. Now that I have seen the query, I think that one runs into exactly the problem I experienced and tried to describe.
from scholia.
I discussed that briefly with @Daniel-Mietchen today. To me it seems that the one time split does conceptually not solve any scaling issues and should not be done in the way as it is currently planned. If done, it should be done transparently to the user, i.e. the query might be executed on different back-ends, but it should not be required to change the query.
from scholia.
just changing the endpoint
@fnielsen, I think the split will mean federated SPARQL queries over the two servers. Did you try that already? Where can I find the SPARQL of the Recent publications from experimental scholarly endpoint
? I could not spot the link to the matching query service. Does it not have a QS for the individual endpoints yet? That would make development a lot more difficult.
from scholia.
(crossposted from the Telegram Wikicite channel)
I just finished a query that shows how content is scattered over the two splits: https://w.wiki/98km One of the powers of SPARQL is to be able to search the linking ("web"), unlike, for example, label searching. But if we search for a link (the Statement in Wikidata terms), this becomes hard when those links are split too: you effectively have to search in both QSs. This is what I tried yesterday with #2423 (comment) (above): but since SPARQL commonly includes a pattern of two and more links, this is not trivial at all. Indeed, I ran into timeouts. I do not think this is special for Scholia, but applies to any tool that uses SPARQL where Statements are split over the two instances. Of course, this query just looks at one direct claim, and the GitHub issue shows that "two or more" is with qualifiers.
Basically, splitting works if the content can be split. But the power of Wikidata is the complexity of human language, but then with machine readability. Qualifiers are all over the place. So, when i say, "that I feel that Wikidata has failed", more accurately I should say "the query service has failed" and that I think that the QS is a essential part of the eco system (also for Wikibase, for the matter). This is just opinion. Let me stress, the problems are real and we need a real solution. This real solution is hard. This splitting is not the first solution being sought. The Scholia project has been actively looking into alternatives, including a dedicated WDQS, a QS with a lag (but see notes about loading times being days, rather than hours), and the subsetting work (see https://content.iospress.com/articles/semantic-web/sw233491). It is compicated and 5 years ago I has naive and optimistic that computer science would develop a scalable triple store with SPARQL endpoint that meets the Wikidata needs. Sadly, the CS field did not live up to my hopes. So, my tears (":(") are real. And the scalability problems that Wikidata are seeing important and to me very serious and nothing to joke about.
from scholia.
https://synia.toolforge.org/#author/Q18618629 - third table
Yes, got that :) But unlike the other tables, this one does not have a link to the matching query service. I wanted to see the SPARQL itself, not the results.
I think I should be able to find it in the Wiki itself, but the Synia setup I wrote was already too long ago that I can easily find it.
from scholia.
Instead of rewriting or bothering about the split, I suggest we focus on running QLever ourselves and improve it to do what we want no matter the growth of Wikidata. See the discussion I started #2425
from scholia.
i.e. the query might be executed on different back-ends, but it should not be required to change the query.
What I found is that this is not trivial at all: you cannot simply run a query on both endpoints and then merge the results.
from scholia.
see #2412 - for a mitigation path
from scholia.
Related Issues (20)
- Inputting string with many identifiers overload the identifier-to-quickstatemnets tool
- Add missing links to softwares in the curation tab HOT 1
- Current chemical class hierarchy looks cluttered
- Current "related compounds" is ambiguous HOT 2
- Patch broke display of results HOT 2
- Topic curation from #2479 is quite slow
- Remote Code Execution (RCE) affecting werkzeug package
- Reviewer statistics panel on author aspect could include program committee work
- Event data table extended with submissions, accepts and a percentage sign on acceptance rate.
- Add deadline column to coming deadlines panel on event index page
- Remove online events from related based on time and location
- Add language to event data table. HOT 1
- Security issue in requests package, versions [2.3.0,2.31.0)
- Security issue in Jinja2 < 3.1.4
- Command-line arxiv-to-quickstatements fails
- Add a Project MUSE scrape to Scholia
- Wrong ordinal when multiple series in conference
- Author aspect takes more than 100 milliseconds to complete HOT 2
- OJS scraper does not scrape maajournal
- Author list is not sorted
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scholia.