Importing Author Information from CrossRef to Wikidata via QuickStatements

crossref wikidata r quickstatements sparql author metadata

The post continues my quest to improve the presence of academic journal(s) in Wikidata. I reviewed the different searching approaches to find the authors, especially those with with non-English names, in Wikidata and upload the author metadata via QuickStatements.

true
07-05-2021

In a previous post I was trying to do what is called an author disambiguation using the open metadata from CrossRef, Microsoft Academic and ORCID. I took 10 DOIs as an example and identified the ORCIDs for a number of authors. This initial idea was also to upload the author metadata info to Wikidata in the same post.

While I was working on it I realized that I completely lost a sense of smell. I decided to publish the previous part and postpone exercising with Wikidata till better days. I was lucky, it was not as bad as it is said it could be - a week later my only discomfort is that I still do not know the smells.

So, I am back, full of forces, and in this post I am going to prepare and uploade the author metadata to Wikidata.

Wikidata for Academic Journal

You may be wondering why to bother about the journal’s standing in Wikidata, when there are many citation indices and A&I databases that present the metadata. Apologies for not having a detailed and well-argumented intro for such a question. Instead, I would give you few examples that I find illustrative:

  1. this is how the online service Scholia shows the journal Atmospheric Chemistry and Physics and the article The Alzheimer’s disease-associated amyloid beta-protein is an antimicrobial peptide, based on the Wikidata. If the journals and articles exists as Wikidata items and contain the metadata, the items can be cited in Wikipedia articles or be linked to other non-journal Wikidata items like author biographical facts, events, awards, organizations, topics, molecules, news, products, you name it. Altogether this forms a huge knowledge graph, absolutely incomparable with the citation/search engines we used to. This is the only way to highlight the journal impact in multiple facets, to move beyond the citation obsession.

  2. here is a collection of Wikidata queries prepared by Martin Poulter, demonstrating how the linked items can be investigated. There are more high-brow collections like this one with biomedical queries. You can run yourself any of those examples by clicking Try it! or Execute url links next to the SPARQL snippets.

But not every journal looks this great in Wikidata to benefit from semantic blessing. In one of my previous posts I described how the Russian academic journals look in Wikidata and Wikipedia.

More argumentation on Wikidata’s value for the academic journals can be found in the original works Google Scholar: Wikidata + Journal.

Editing Wikidata

How the Wikidata contents can be added or improved? There are few ways:

  1. manually. This is plain. You need to register to Wikidata, find the journal record and start copy/pasting the metadata bits.

  2. with a help of tools, but still manually. There are some great tools like Mix'n'match, Author Disambiguator, Wikidata Link Reconciler and many others (listed here). Some of it requries of one to be an Autoconfirmed user, which is a special status you get after having edited a certain number of items (a meritocracy in action if you will).

  3. automatically, with API-based tools.

In this post I am going to use the most popular tool named QuickStatements. This tool is fantastic, as it allows you to create/edit the Wikidata items by submitting the materials in very simple formats like CSV/TSV or a URL-encoded string. Like many other Wiki tools it was developed by Magnus Manske.

You will not be able to submit the materials in batches via QuickStatements until you get the status of Autoconfirmed User. But take it as a hint - running QuickStatements through URL does not require the status, so you can start with it to reach the status and then switch to the batches. Or you can just deal with Wikimedia API directly. In this post I am going to describe only the batch uploading available for autoconfirmed users.

The good practice is to:

  1. test your editing approaches on Wikidata Sandbox,

  2. always check what you have done,

  3. always fix the errors you made.

This discipline can seem very annoying to some creative minds, but this is a public tool, it will be spoiled without the rules.

Data

In previous post I took 10 DOIs and made few manipulations to obtain the ORCIDs for some of the authors (see below). The CRF in the column names stands for CrossRef (as a source), MAG - for Microsoft Academic.

Show code
final_data_checked <- paste0(dir, "/final.data.xlsx") %>% 
  readxl::read_xlsx(col_types = "text") %>% 
  filter(score!="D") %>% 
  arrange(doi, score) 

final_data_checked %>% 
  mutate(orcid = ifelse(score=="A", orcid, NA_character_)) %>% 
  select(doi, order, orcid, `CRF family`, `CRF given`, 
         Scopus_AuthID, ResearcherID, `MAG Author ID`) %>%  
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: see in the text.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE,
                columnDefs = list(
                  list(width = '150px', targets = c(0,2)),
                  list(className = 'dt-center', targets = c(1)))
                  ))

But we can not submit DOI-ORCID links to Wikidata. The process needs to be organizaed a bit different:

  1. we have check if the article are present in Wikidata (searching by DOI) and what metadata are available

  2. next we should check for every author if it is present in Wikidata as an item (searching by ORCID, Scopus Author ID, Researcher ID, Microsoft Academic ID, and also by the author’s name)

  3. once we have collected the details about article and author, we can submit one of 2 statements connecting those details:

The statements above are simplified a bit, the Wikidata rules require the statements to be supported with the references and some specific qualifiers (statements for additional properties). The author information for the scholarly article needs to be backed with a source (i.e.Β CrossRef) and should have an ordinal. It may also have other properties like affiliation or affiliation string.

Finding the articles in Wikidata

The package WikidataR, still in development, provides very practical options to work with Wikidata. We will use its function qid_from_identifier to find Wikidata by DOI.

Show code
dois_file <- paste0(dir, "dois_wiki.csv")

if(!file.exists(dois_file)){
  dois <- final_data_checked %>% select(doi) %>% distinct()
for (i in 1:nrow(dois)){
  print(i)
  res <- qid_from_identifier(property = "DOI", value = dois$doi[i]) %>% 
    unnest(qid)
  dois[i, "wikidata"] <- ifelse(nrow(res)==1,  as.character(res$qid[1]), "") 
  }
  write_excel_csv(dois, dois_file)
} else {
  dois <- read_csv(dois_file)
}
dois %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

The publications are already present in Wikidata.

Let’s check of they already have anything about the authors. We can collect the author information using another function of WikidataR named get_item.

Author Information in Wikidata

The code below obtains all the claims of the Wikidata item and extracts those that we are interested in:

Show code
wd_auth_info_file <- paste0(dir, "wikidata_author_existing_details.csv") 

if(!file.exists(wd_auth_info_file)){
  wd_pub_existing_data <- c()
  
  for (i in 1:nrow(dois)){
    b <- get_item(dois$wikidata[i]) %>% map(pluck, "claims")
    b <- b[[1]][c("P2093", "P50")]
    
    wd_pub_auth_data <- cbind(
      #1 part. author strings or Wiki items
      b %>% map(~{.x[["mainsnak"]]["datavalue"] %>% map_df("value") %>% 
          select(any_of(c("author_wiki" = "id", 
                          "author_wiki" = "datavalue")))}) %>% 
        map_df(~.x) %>% 
        mutate(author_wiki_type = ifelse(grepl("^Q\\d+",author_wiki), 
                                         "P50", "P2093")),
      #2 part. author statement IDs
      b %>% map(~{.x[["id"]]}) %>% unlist(use.names = FALSE) %>% 
        enframe(name = NULL, value = "Statement_ID"),
      #3 part. author order qualifiers
      b %>% map_df(~{.x[["qualifiers"]][["P1545"]] %>% 
            map_df("datavalue")}) %>% 
            select(any_of(c("order" = "value"))) %>% 
            mutate(qualif_prop = "P1545"),
      #4 part. references/sources
      b %>% map_df(~{.x[["references"]] %>% 
            map_df(pluck, "snaks", "P248") %>% 
            map_df(list("value","id"))}) %>%
        select(source = datavalue) %>%
        mutate(source_prop = "P248")
    ) %>% 
      mutate(qid = dois$wikidata[i]) %>% 
      relocate(qid)
    
    wd_pub_existing_data <- wd_pub_existing_data %>% bind_rows(wd_pub_auth_data)
    print(i)
  } 
  write_excel_csv(wd_pub_existing_data, wd_auth_info_file)
 } else {
  wd_pub_existing_data <- read_csv(wd_auth_info_file, 
                                   col_types = cols(.default = col_character()))
 } 

wd_pub_existing_data %>% 
  relocate(Statement_ID, .after = source_prop) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
     options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = TRUE,
                columnDefs = list(
                  list(width = '430px', targets = c(7)),
                   list(width = '150px', targets = c(1)),
                  list(className = 'dt-center', targets = c(2,3,4,5,6)))
                  ))

This is how the author relations look like:

Or, in cases when the author exists as Wikidata item and linked to the Wikidata item for the article, the relation look like:

Are we satisfied with the author names present in Wikidata as text (authorname string)? Of course, we should not be. With text strings for the authors Wikidata is just like any other A&I database. To see a magic of Wiki, we have to substitute the author name strings (defined by P2093) with the Wikidata items corresponding to the persons (defined by P50). With those authors our analytical reach will expand beyond the names, as we will be able to analyze the relations of the authors (known to Wikidata).

Let’s search the Wikidata items corresponding to the authors - for this we can use the author’s name and the personal identificators (PID).

Finding Authors in Wikidata by PIDs

First, we will try to find the Wikidata items by PIDs. In order to do that I will take the initital data and filter out the authors for which I earlier found the ORCIDs and other PIDs.

Show code
zz1 <- final_data_checked %>% 
    filter(score=="A") %>% 
    select(doi, order, orcid, Scopus_AuthID, ResearcherID, `MAG Author ID`) %>% 
    filter(!is.na(orcid)) %>% 
    mutate_at(c("Scopus_AuthID", "ResearcherID"), ~str_split(.x, "\\|")) %>%
    unnest(c("Scopus_AuthID", "ResearcherID")) %>% 
    pivot_longer(-c("doi", "order"), names_to = "src", values_to = "id") %>%
    mutate_all(~str_squish(.x)) %>% 
    filter(!is.na(id))

zz1 %>%  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

Let’s search them one by one in Wikidata using qid_from_identifier function from WikidataR package with the appropriate property for each PID type.

Show code
wd_authors_by_ids_file <- paste0(dir, "wd_authors_by_ids.csv")

if(!file.exists(wd_authors_by_ids_file)){
  for (i in 1:nrow(zz1)){
    if(zz1$src[i]=="orcid"){
      res <- qid_from_identifier(property = "ORCID iD", 
                                 value = zz1$id[i]) %>% unnest(qid)
      }
    if(zz1$src[i]=="MAG Author ID"){
      res <- qid_from_identifier(property = "Microsoft Academic ID", 
                                 value = zz1$id[i]) %>% unnest(qid)
    }
    if(zz1$src[i]=="Scopus_AuthID"){
      res <- qid_from_identifier(property = "Scopus author ID", 
                                 value = zz1$id[i]) %>% unnest(qid)
    }
    if(zz1$src[i]=="ResearcherID"){
      res <- qid_from_identifier(property = "ResearcherID", 
                                 value = zz1$id[i]) %>% unnest(qid)
    }
    zz1[i, "wikidata"] <- ifelse(nrow(res)==1, 
                                 as.character(res$qid[1]), "unusual_response") 
    print(i)
  }
  
  zz1 <- zz1 %>% select(doi, order, wikidata) %>% distinct() %>% na.omit()
  
  write_excel_csv(zz1, wd_authors_by_ids_file)
} else { 
  zz1 <- read_csv(wd_authors_by_ids_file, col_types = cols(.default = col_character()))
}

zz1 %>% DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

We found nothing.

I know that some of our authors are 100% present in Wikidata, but their items do not have the statements linking the items to PIDs like ORCID, Scopus Author ID, etc, therefore, we can not find them using PIDs. This is just our case, many researchers have Wikidata items with PIDs.

Our last chance in this situation is to search by name.

Let’s prepare the names.

Show code
zz2 <- final_data_checked %>% 
  # anti_join is for cases when something is found by PIDs 
  anti_join(zz1, by = c("doi", "order")) %>% 
  select(doi, order, `CRF family`, `CRF given`, `MAG family`, `MAG given`) %>% 
  mutate_at(3:6, ~str_to_title(str_extract(.x, "^[^\\s]+"))) %>% 
  pivot_longer(c("CRF family", "CRF given", 
                  "MAG family", "MAG given"), 
                names_to = c("source", "name_type"), 
                names_pattern = "(.+) (.+)", 
                values_to = "name") %>% 
  distinct() %>% 
  pivot_wider(names_from = name_type, values_from = name) %>%
  filter(!is.na(family)) %>% 
  select(-source) %>% distinct()

zz2 %>% DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

Before we start looking up these names in Wikidata, let me share with you what I learned about a text search in Wikidata.

Text Search in Wikidata

Searching Wikidata items by text is tricky. If you are interested to know more about it, I can recommend you an article Running a reconciliation service for Wikidata by Antonin Delpeuch, where the author explains the reasons for using separately 2 search APIs:

Below is my experience with both.

Wikibase API: wbsearchentities

This API is cool and simple, but you may be surprized with a variety of results that are totally irrelevant to what you had in mind.

For example, searching for a prominent Soviet physicist Sergey Vavilov via wbsearchentities (try API query, JSON result produces 2 items for the human beings and 1 for the research institute named after S.I.Vavilov.

Show code
fromJSON("https://www.wikidata.org/w/api.php?action=wbsearchentities&search=sergey%20vavilov&language=en&format=json", flatten = TRUE) -> a  

a %>% pluck("search") %>% 
  select(title, label, description) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

One strategy of making this search more specific is to query the API via SPARQL and apply some filters. For instance, we can limit the results to be only the humans by filtering the items having a property Q5 (human).

A SPARQL request below will return only 2 results, omitting that corresponding to the organization.

SELECT ?person ?personlabel_en ?persondesc_en
   WHERE {hint:Query hint:optimizer "None".
      SERVICE wikibase:mwapi {
          bd:serviceParam wikibase:endpoint "www.wikidata.org";
             wikibase:api "EntitySearch";
              mwapi:search "Sergey Vavilov"; 
              mwapi:language "en".
          ?person wikibase:apiOutputItem mwapi:item.
      }
      FILTER BOUND (?person)       
        ?person wdt:P31/wdt:P279* wd:Q5.
      optional{?person rdfs:label ?personlabel_en . FILTER(lang(?personlabel_en)='en')}
      optional{?person schema:description ?persondesc_en . FILTER(lang(?persondesc_en)='en')}
      SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
}

Check it!

I used this approach in one of my recent posts about matching the scientific terms to Wikidata items, filtering only those results that are most likely to be the scientific terms.

Querying wbsearchentities API via SPARQL with additional filters provides a lot of flexibility, but… the approach has one serious limitation - the API searches the entity labels or aliases for a perfect match. This can be a problem with the non-English names that often have ambiguous Latin spelling. The name Sergey can also be written as Sergei (or also as Sergej, Serhii, etc). People whose Wikidata item contains Sergei will not be found by a query with Sergey, unless the alternative labels are present in the Wikidata item. Searching for Sergei Vavilov will return only one person. I did not find a way to make a fuzzy search for wbsearchentities API. But the other API allows it.

Wikibase API: query&list=search

The Help page for this API is minimal, the more detailed information about the available parameters, use of regular expressions, fuzzy search is available at Cirrus Search page. Be warned, it is a brain-boiling stuff.

This API allows you to do various searches:

Let’s see the results of the strict match search:

Show code
fromJSON("https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=%22sergey%20vavilov%22&format=json&srlimit=20", flatten = TRUE) -> a  

a %>% pluck("query") %>% pluck("search") %>% 
  select(title, snippet) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

In addition to 2 persons and 1 organization, these results include also a ship (Q4059130) and a scholarly article (Q4059130).

We can also query this API via SPARQL and filter the results - let’s to do this for more relaxed (Serge\?) query. Mind quoting that a question mark for SPARQL need to be escaped, so in R code this requires more backslashes.

Show code
WikidataQueryServiceR::query_wikidata(sparql_query = 'select 
  ?person ?personlabel_en ?persondesc_en
Where {
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "Serge\\\\? Vavilov" .          
    ?person wikibase:apiOutputItem mwapi:title.    
  } 
  FILTER BOUND (?person)       
        ?person wdt:P31/wdt:P279* wd:Q5.
      optional{?person rdfs:label ?personlabel_en . FILTER(lang(?personlabel_en)="en")}
      optional{?person schema:description ?persondesc_en . FILTER(lang(?persondesc_en)="en")}
      SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
  }'    
  ) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

This search returns 5 persons, 3 of which we haven’t found 3 with wbsearchentities API. Why? Because two results (Q12562032, Q19907955) do not have labels, aliases, or description in English - but the Wikidata items have the first name (Sergey) and the last name (Vavilov) properties, which is not queried by wbsearchentities API. Another one found only by this API is spelled in Wikidata as Sergei(!) V Vavilov (Q91652500).

The main advantages of using Search API (query&list=search) instead of wbsearchentities API for searching the people are:

Unfortunately, it is not only the endings that change when the Cyrillic names are transliterated to Latin.

So you can hardly guess where to put the wildcard.

Even being armed with wikibase:api β€œSearch” + wildcards/fuzzy + SPARQL filters, we still have some limitations, having to decide which strategy is the best for searching the researchers in Wikidata by their names!

Search Strategies

I considered few options:

  1. to substitute the ending of all given names ? (without further thinking) and ignore the names like Oxana, Aleksei or Fyodor with ambiguous spelling in the middle of the given name. Dismissed this approach.

  2. to use a built-in option of fuzzy search operator (~). But there are some aspects…described at CirrusSearch.

Searching β€œOxana~1” covers both Oksana and Oxana (as for a word search suceeded with β€œ~2” means 2 extra added or changed characters). Together with the family name it is processed in different way - tilda tells how many extra words to fit in. Hence, β€œOksana~1 Ivanova” and β€œOxana~1 Ivanova” produce different results.

Trying to UNION 2 requests (Oxana~1 and Ivanova) in SPARQL is likely to be too β€œexpensive” for most popular names. Same as searching by the family name only and further filtering with regular expressions in SPARQL (see some examples here).

  1. to find the Wikidata items corresponding to the given name (with fuzzy search), collect all the relevant name variants (P460: said to be the same as + P2440:transliterations), and use them to generate all possible combinations (family name + given name variant) for wbsearchentities.

This strategy may look like a great trick, but dealing with the names in Wikidata is not an easy walk.

First of all, a fuzzy search does not change first 2 letters (here), so Yulia~ will never lead to Julia (and vice versa). Same with Fyodor and Fedor. Tilda will not help everyone.

If we try to search a female given name Yulia by wbsearchentities, we will find few Wikidata items (Yuliya of Ukrainian/Belarussian origin) and (Yulia of Russian/Bulgarian origin). Their (P460:said to be the same as) forms vary from Giulietta to Uliana. So P460 could be more an extra burden than a solution.

It may seem that the English transliterations (P2440), filtered by Wikimedia language code (P424), could help, but this property is not mandatory and can be missed for some names. Moreover, the property constraints can also be totally different. For Yulia the transliterations are defined via the following constraints (Wikimedia language code / writing system / determination method), but for Oksana there is only a determination method. For Yulia the determination method for English transliteration is β€œromanization of Russian”, but for Oksana this method has many more academic values - β€œGerman romanization of Ukrainian (Duden)”, β€œBGN/PCGN romanization of Russian”, β€œBGN/PCGN romanization of Ukrainian (1965 system)”, β€œUkrainian National System”, β€œscientific transliteration of Cyrillic”, β€œALA-LC romanization”, β€œmodified Library of Congress system”. So using the transliterations can be an option only after most popular β€œdetermination method” values are collected (to be used as a filter for English transliteration).

What strategy is optimal? I decided to go with this:

Let’s do it with the code.

Step 1. Finding Given Names in Wikidata

The code below sets up a function to collect the name variants.

Show code
get_wiki_name_variants <- function(given_name){
  paste0('SELECT DISTINCT ?personlabel_en WHERE {
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
        wikibase:api "EntitySearch";
        mwapi:search "', given_name, '"; 
        mwapi:language "en".
      ?item wikibase:apiOutputItem mwapi:item.
  }
    ?item wdt:P31/wdt:P279* wd:Q202444. 
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}
  optional {?item rdfs:label|skos:altLabel ?personlabel_en . FILTER(lang(?personlabel_en)="en")}
}') %>% 
    WikidataQueryServiceR::query_wikidata() %>% 
    mutate_at(c("personlabel_en"), 
              ~str_squish(str_replace_all(.x, "\\([^\\(^\\)]+\\)", ""))) %>% 
  distinct()
# example: get_wiki_name_variants("Aleksei") 
}

Now we are going to collect all the variants of the given names for our 28 authors (present in 10 articles selected as an example).

Show code
wiki_given_names_file <- paste0(dir,"wiki_given_names.RDS")

if(!file.exists(wiki_given_names_file)){
  wiki_given_names <- unique(zz2$given) %>%
    map_df(~get_wiki_name_variants(.x) %>% 
            filter(!str_detect(personlabel_en, 
                              "[\\p{Letter}--\\p{script=latin}]+")) %>% 
            mutate(given = .x))

  write_rds(wiki_given_names, wiki_given_names_file)
} else { 
  wiki_given_names <- read_rds(wiki_given_names_file)
}

wiki_given_names %>% relocate(given) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = FALSE))

Well, a conversion of Marina to Marinette or Alexander to Sasha may be considered as an exaggeration, but for many given names (column β€œgiven”) our adding the name variants from Wikidata (column β€œpersonlabel_en”) seem to increase the chances to be found.

Wikidata returned many name variants with the special symbols (diacritical marks). I think that it will be no harm if we convert those names into ASCII-form to use only the English letters and unique spellings. For such transformations I use a function stri_trans_general(β€œLatin-ASCII”) from stringi package. Even though stringr provides a lot of str_* substitutions for original stringi functions (optimized for coding in tidyverse style), I am not aware of stringr-based way of doing such a transformation.

Show code
wiki_given_names <- wiki_given_names %>% 
  mutate(personlabel_en = stringi::stri_trans_general(personlabel_en, "Latin-ASCII")) %>% 
  distinct()

This transformation decreases a total number of name variants from 104 to 93.

Step 2. Building the Name Combinations

I merged the found name variants with the family names to form a new column β€œname variant”, which I am going to use further for finding the persons in Wikidata.

Show code
zz2 <- zz2 %>% left_join(wiki_given_names) %>% 
  unite(name_variant, c("personlabel_en", "family"), 
        sep = " ", na.rm=TRUE, remove = FALSE) %>% 
  relocate(doi, order, family, given, personlabel_en)
  
zz2 %>%
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Bfrtip',
                   autoWidth = FALSE))

Step 3. Finding authors in Wikidata (by name)

To find the persons in Wikidata and get the most interesting details we will be using the following function based on a SPARQL request. As R seems to truncate the arguments of 1000+ chars in length, I also made a separate URL to WDS where you can see a SPARQL query.

Show code
sparql_author_query <- function(person_name){
  paste0('SELECT ?person ?personlabel_ru ?personlabel_en
          ?persondesc_en ?occupations ?employers  
        ?viaf ?magid ?isni ?lc ?elibrary ?scopus_id ?reseacher_id ?orcid ?publons
WITH {
  SELECT ?person  
        (GROUP_CONCAT(DISTINCT ?occupation; SEPARATOR=" | ") AS ?occupations)
        (GROUP_CONCAT(DISTINCT ?employer; SEPARATOR=" | ") AS ?employers)
   WHERE {
    hint:Query hint:optimizer "None".
      SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:api "Search";
                        wikibase:endpoint "www.wikidata.org";
                        mwapi:srsearch "', person_name, [1163 chars quoted with ''']) %>% 
    WikidataQueryServiceR::query_wikidata() 
}

This function accepts the person name, quoted or unquoted, and can also process the name variants separated by OR (like in example below designed to find the Wikidata profiles of 2 prominent Russian scientists from Novosibirsk).

Show code
sparql_author_query("\\\"Valentin Parmon\\\" OR \\\"Valentin Vlasov\\\"") %>%
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip',
                   autoWidth = TRUE, 
                columnDefs = list(
                  list(width = '350px', targets = c(5)),
                   list(width = '170px', targets = c(1,3,13)))
                  ))

By adding the name variants we increase 3.5 times a number of requests required to find the Wikidata items for 28 authors (97 name variants). Therefore, I am going to use … OR … syntax to reduce a number of queries back to 28.

Show code
## grouping the name variants into ... OR ... strings
zz2grouped <- zz2 %>% 
  mutate(name_variant = paste0('\\\"',name_variant,'\\\"')) %>% 
  group_by(doi, order) %>% 
  summarize(name_variants = paste0(name_variant, collapse = " OR ")) %>% 
  ungroup()

wiki_persons_file <- paste0(dir,"wiki_persons.RDS")

if(!file.exists(wiki_persons_file)){
  wiki_persons <- unique(zz2grouped$name_variants) %>%
    map_df(~sparql_author_query(.x) %>%
             ## as some identifiers can be returned as characters & double
             mutate_all(~as.character(.x)) %>% 
             mutate(name_variants = .x))

  write_rds(wiki_persons, wiki_persons_file)
} else { 
  wiki_persons <- read_rds(wiki_persons_file)
}

wiki_persons %>%  
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip', autoWidth = TRUE, 
                columnDefs = list(
                  list(width = '350px', targets = c(3, 5)),
                   list(width = '450px', targets = c(15)),
                   list(width = '170px', targets = c(1,8, 13)))
                  ))

We have found 11 persons, of which 4 are the right persons (Wladimir Andreff, Peter Kaznacheev, Viktor Ryazanov, and twice Sergey Kalendzhyan). This I decided by manually checking the suggested variants (Step 4). It was helpful, of course, that 3 of 4 are also marked as economists in the β€œdescription” or β€œoccupations” columns. By the way, Sergey Kalendzhyan in the original articles was spelled both as Sergei and Sergey, but due to step 1 (when we added all known name variants), we have found Sergey Kalendzhyan’s Wikidata profile in both cases.

Show code
wd_authors_by_names_file <- paste0(dir, "wd_authors_by_names.csv")
right_ones <- c("Wladimir Andreff", "Peter Kaznacheev", 
                "Viktor Ryazanov", "Sergey Kalendzhyan")
wiki_persons %>% 
  filter(personlabel_en %in% right_ones) %>% 
  left_join(zz2grouped, .) %>%
  write_excel_csv(wd_authors_by_names_file)

Now we are ready for a final step!

Uploading the Author Info into Wikidata

This is what we have had by the end of our journey.

Show code
data_wd  <- bind_rows(
  read_csv(wd_authors_by_ids_file, 
           col_types = cols(.default = col_character())) %>% 
    select(any_of(c("doi", "order", "wikidata"))),
  read_csv(wd_authors_by_names_file, 
           col_types = cols(.default = col_character()))%>% 
    mutate(wikidata = str_extract(person, "Q\\d+$")) %>% 
    select(any_of(c("doi", "order", "wikidata")))
  ) %>% 
    filter(!is.na(wikidata)) %>% 
    rename(wd_author = wikidata)
  
data4wiki <- final_data_checked %>% 
  left_join(dois) %>% 
  left_join(data_wd) %>%
  left_join(wd_pub_existing_data %>% 
              select(wikidata = qid, order, current_author = author_wiki, 
                     current_statement = author_wiki_type)) %>% 
  unite("author_string", c("CRF given", "CRF family"), sep = " ") %>% 
  mutate(author_string = str_to_title(author_string)) %>% 
  mutate(author_statement = ifelse(is.na(wd_author), "P2093", "P50")) %>%  
  select(article_wiki = wikidata, order,
         current_author, current_statement, 
         author_statement, author_wiki = wd_author, author_string) 

data4wiki %>%  
  DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip', autoWidth = TRUE))

We know the wikidata items, corresponding to the articles (column β€œarticle_wiki”), and how the authors are currently present in Wikidata (see the columns β€œcurrent_author” and β€œcurrent_statement”). The β€œcurrent_author” strings have only the initials, so we can improve it by substituting with the authorname strings from CrossRef (P2093 property for the statement). For the authors with Wikidata items (4 authors with non-empty values in the author_wiki column) we will introduce the P50 statements (connecting the Wikidata items for article and person).

In both cases we should also not to forget to delete the existing statements (see below the example of statements).

Show code
wd_statements <-  data4wiki %>% rowwise() %>% 
  mutate(statement = ifelse(!is.na(author_wiki),
           paste0("-",article_wiki,"|", current_statement,"|\"", 
                  current_author,"\"||",
                  article_wiki,"|P50|",author_wiki,"|P1545|\"", 
                  order,"\"|S248|Q5188229"), 
           paste0("-",article_wiki,"|", current_statement,"|\"", 
                  current_author,"\"||",
                  article_wiki,"|P2093|\"",author_string, "\"|P1545|\"", 
                  order,"\"|S248|Q5188229"))) %>% 
  select(statement) %>% 
  mutate(statement = str_split(statement, "\\|\\|")) %>% 
  unnest(statement) %>% 
  unlist(use.names = FALSE) 

wd_statements %>% 
  paste0(., collapse = "\n") %>% 
  write_file(paste0(dir, "for qs_auths.tsv"))  
  
wd_statements %>% enframe(name = NULL, value = "wikidata_statements") %>% 
   DT::datatable(rownames = FALSE, escape = FALSE, class = 'compact striped',
               caption = htmltools::tags$caption(
               style = 'caption-side: bottom; 
               text-align: left; font-size: 80%; 
               color: #969696; font-family: Roboto Condensed;',
               'Data: wikidata.org.'),
    options = list(searchHighlight = TRUE, dom = 'Brtip', autoWidth = FALSE))

With the line (-Q107266290|P2093|β€œKalendzhyan S.O.”) we remove the existing statement that claims {wikidata item Q107266290 has an author string β€œKalendzhyan S.O.”}.

With the next line (Q107266290|P50|Q4209279|P1545|β€œ1”|S248|Q5188229) we create a new statement that claims {wikidata item Q107266290 has an author (P50) relation to Q4209279 that is positioned (P1545) first (β€œ1”) as stated (P248) in CrossRef (Q5188229)}.

Mind S instead of P for the statement used as a reference. More details on QuickStatements syntax can be found here)

The process looks like this:

  1. Open QuickStatemets interface. To have an access, you have to be an autoconfirmed user. Paste the prepared statements (the pictures below show the result just for 2 lines, as an example!). Click Import V1 commands.
Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok1.PNG"))

  1. Check that the suggested revisions make sense (the properties are recognized and shown as active URLs). Click Run.
Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok2.PNG"))

  1. Watch. Enjoy.
Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok3.PNG"))

  1. This is how the updated article page looks like in Wikidata. The β€œauthors” and β€œauthor name strings” are separated.
Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok4.PNG"))

  1. And this is how the article page looks like in Scholia.
Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok5.PNG"))

If you click on the authors marked as UNRESOLVED:, you will be moved to the Author Disambiguator, another extra-useful tool for editing the author information for academic publications in Wikidata.

Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok6.PNG"))

This application tries to find the author profile in Wikidata (I am not sure how complex its recommendation algorithm), and you can search the person by ORCID and by name. If nothing is found, the application suggests to create the new Wikidata item (this is something I decided not to do in my exercise).

Show code
knitr::include_graphics(paste0(getwd(),"/images/wd_statement_sok7.PNG"))

The only β€œbut”, again, it is also for autoconfirmed users only.

Final Remarks (to the journal editors)

Wikidata has a lot of valuable tools and options for the academic journals to highlight their impact. To deal with it is just a question of having one dedicated person enrolled to take care of uploading the articles’ metadata into Wikidata and doing the author disambiguation (after getting a status of autoconfirmed user). Even if I failed to convince you to deal with Wikidata, please, do pay attention to a quality of articles’ metadata.

I will continue this exercise to add the affiliations and other details to Wikidata in close future.

Acknowledgments

Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2021). rmarkdown: Dynamic Documents for R. R package version 2.7, <URL: https://github.com/rstudio/rmarkdown>.

Gagolewski M (2020). R package stringi: Character string processing facilities. <URL: http://www.gagolewski.com/software/stringi/>.

Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, <URL: https://CRAN.R-project.org/package=purrr>.

Ooms J (2014). β€œThe jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. <URL: https://arxiv.org/abs/1403.2805>.

Popov M (2020). WikidataQueryServiceR: API Client Library for β€˜Wikidata Query Service’. R package version 1.0.0, <URL: https://CRAN.R-project.org/package=WikidataQueryServiceR>.

Shafee T, Keyes O, Signorelli S, Lum A, Graul C, Popov M (2021). WikidataR: Read-Write API Client Library for β€˜Wikidata’. R package version 2.2.0, <URL: https://github.com/TS404/WikidataR/issues>.

Wickham H (2020). tidyr: Tidy Messy Data. R package version 1.1.2, <URL: https://CRAN.R-project.org/package=tidyr>.

Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=stringr>.

Wickham H, Francois R, Henry L, Muller K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.3, <URL: https://CRAN.R-project.org/package=dplyr>.

Wickham H, Hester J (2020). readr: Read Rectangular Text Data. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=readr>.

Xie Y (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30, <URL: https://yihui.org/knitr/>.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL: https://yihui.org/knitr/>.

Xie Y (2014). β€œknitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, <URL: http://www.crcpress.com/product/isbn/9781466561595>.

Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <URL: https://bookdown.org/yihui/rmarkdown>.

Xie Y, Cheng J, Tan X (2021). DT: A Wrapper of the JavaScript Library β€˜DataTables’. R package version 0.17, <URL: https://CRAN.R-project.org/package=DT>.

Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <URL: https://bookdown.org/yihui/rmarkdown-cookbook>.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lutai (2021, July 5). ConviviaR Tools: Importing Author Information from CrossRef to Wikidata via QuickStatements. Retrieved from https://dwayzer.netlify.app/posts/2021-07-02-importing-author-information-from-crossref-to-wikidata-via-quickstatements/

BibTeX citation

@misc{lutai2021importing,
  author = {Lutai, Aleksei},
  title = {ConviviaR Tools: Importing Author Information from CrossRef to Wikidata via QuickStatements},
  url = {https://dwayzer.netlify.app/posts/2021-07-02-importing-author-information-from-crossref-to-wikidata-via-quickstatements/},
  year = {2021}
}