Checking VAK titles in CrossRef

russian data issn.org crossref r gov data

In this post I am checking how many academic journals from the Russian white list (VAK) deposited their 2020/2021 publications in CrossRef. To do that I queried portal.issn.org & CrossRef, and parsed PDF & JSON files, what I think makes this post of interest to those who analyze the academic journals.

true
05-12-2021

Introduction

In a previous post I ventured to parse a 853-page PDF file released by VAK (Russian state regulator) and extracted a list of journal titles. For the Russian academic journals being listed by VAK is a sort of endorsement, as it makes the publications scored:

  1. in some assessment exercises run by funders & ministries,

  2. for academic degree pre-assessment (main results must be published in at least X credible journals).

The international journals indexed in Scopus, WoS, CAS, etc are also counted as credible titles, so VAK list is not a big deal for those who can draft a proper research manuscript in English language (or pay for that) and compete for a publication in the international journals.

Other researchers (no, not “are quite dreadful,” as Oscar Wilde suggested) may have their reasons to place their materials in the VAK-listed journals. But what I think many people are missing here is that some of VAK titles are absent in CrossRef and invisible to the international search engines. Sort of ferae naturae, endangered to disappear in case of … the publisher’s bankrputcy or anything else.

So I decided to check which VAK titles are present in CrossRef with 2020/2021 publications. This is not a difficult thing to do via CrossRef API, but life, of course, proves not to be that plain - the titles can be registered in CrossRef not with the ISSN present in the VAK list, but with the other one, not listed in the VAK list.

This means that first I need to collect the additional ISSNs from ISSN Portal and only then to check them in CrossRef.

1. Dataset Preparation

The parsed list can be downloaded from Figshare with the following code:

Show code
library(httr)
library(rvest)

dataset_link <- read_html("https://doi.org/10.6084/m9.figshare.14561814", 
                          encoding = "UTF-8") %>% 
  html_nodes("a") %>% html_attr("href") %>% 
  .[grepl("ndownloader",.)] %>% unique()

data <- httr::GET(dataset_link) %>% httr::content()

but for speed I will take it from a hard disk.

Show code
data <- read_csv(paste0(dir, "2021_04_vak_list_parsed_data.csv"), 
                      col_types = cols(.default = col_character()))

At the next step I extract the unique ISSNs and here came the first observation. The VAK list contained the different rows sharing the same ISSNs.

Show code
issns_list <- data %>% select(title, issns = issn) %>% 
  mutate(issn = str_split(issns,",")) %>% unnest(issn) %>% 
  distinct() %>% na.omit()

issns_list %>% group_by(issn) %>% 
  mutate(n_titles = n_distinct(title)) %>% ungroup() %>% 
  filter(n_titles>1) %>% select(title, issns, issn) %>% arrange(issn) %>% 
  mutate(issns = gsub(",",", ", issns)) %>% 
  DT::datatable(escape = FALSE, class = 'compact',
                options = list(pageLength = 5, deferRender = TRUE, 
                               autoWidth = TRUE,  ordering = TRUE, scrollX = TRUE))

It appears like some journals have individual e-versions, but are printed under one cover. Well, if so, this can be understood, of course. I do not check this and leave it as it is.

I also manually corrected 2 ISSNs that had wrong format in the original VAK list - “999–4133” (corrected version is 1999-4133) and “224–9877” (changed to 2224-9877).

2. Harvesting data from ISSN.org

To obtain information from ISSN.org I am using a query that saves the results to the hard disk in PDF format:

For this I created a downloading function “readUrl”, which is a wrapper of download.file function into a life-saving tryCatch. So the function takes URL and filename, saves the file and returns OK or an error.

Show code
dir0 <- paste0(dir, "issns/")

issns_list <- issns_list %>% mutate(issn_json = "", issn_pdf = "")

readUrl <- function(url, filename) {
  out <- tryCatch({
    download.file(url, filename, quiet = TRUE, mode = "wb")
    return("OK")
  },
  error=function(cond){return(NA)}
  )
  return(out)
}

And now - let’s run a cycle and download all the pdf files to the hard disk.

Show code
for (k in which(issns_list$issn_pdf==""|is.na(issns_list$issn_pdf))){  
  print(k)
  q <- issns_list$issn[k]
  url <- paste0("https://portal.issn.org/custom-search/print/",q,"/public")
  filename <- paste0(dir0, gsub("-","_", q), "_issn.pdf")
  issns_list$issn_pdf[k] <- ifelse(file.exists(filename),
                             "OK", readUrl(url, filename))
}

The Portal gives a quick response, so a whole process took about 30 minutes.

Next step would be to parse the downloaded PDF files and link it to the original VAK list.

I will extract a text layer using pdftools package. The information is arranged in the following format Key:Value, so taking some measures for not losing the broken lines and not processing the special lines containing “:” , I have extracted it all.

Show code
library(pdftools)

tt <- tibble() 
 
for (z in 1:nrow(issns_list)){
  print(z)
  q <- issns_list$issn[z]
  filename <- paste0(dir0, gsub("-","_", q), "_issn.pdf")
  if(file.exists(filename)){
  txt <- suppressMessages(pdf_text(filename)) %>% paste(collapse="\\\r\\\n")  
  # some PDFs contains history of publishers, we take only the current info
  if(grepl("From:", txt)){
    txt <- strsplit(txt, split = "From:") %>% map_chr(`[`,1)
    }
  txt <- strsplit(txt, split = "\\\r\\\n") %>% 
    map_dfc(~.x) %>% rename(extracted = 1) %>% 
    mutate(tag = ifelse(grepl("\\:",extracted), row_number(), NA_integer_)) %>%
    mutate(tag = ifelse(grepl("Seri.:|Серия:", extracted), NA_integer_,tag)) %>%
    fill(tag, .direction = "down") %>% 
    group_by(tag) %>% 
    summarize(txt = paste(unique(na.omit(extracted)), collapse = " ")) %>% 
    ungroup() %>% 
    mutate(issn = q)
  
  tt <- bind_rows(tt, txt)
    }
}

# a bit of cleaning / glueing here
tt_group <- tt %>% 
  # separating the key and the value
  ## a bit cmplex regex to avoid splitting after the word "Seria":
  mutate(tag = ifelse(grepl("^[^:]*?Seri.:|^[^:]*?Серия:", txt), NA_integer_,tag)) %>%
  mutate(tag = ifelse(grepl("^http", txt), NA_integer_,tag)) %>%
  fill(tag, .direction = "down") %>% 
  group_by(issn, tag) %>% 
  summarize(txt = paste(unique(na.omit(txt)), collapse = " ")) %>% 
  ungroup() 

# glueing  
tt_group <- tt_group %>%   
  mutate(key = str_extract(txt, "^[^:]+?:"),
         txt = str_replace(txt, "^[^:]+?:","")) %>% 
  mutate(txt = gsub("!Seri",": Seri", txt)) %>% 
  mutate_all(~str_squish(str_replace_all(.x, "\\\\\\r\\\\\\n|:$",""))) %>% 
  # gluing together
  group_by(issn, key) %>% 
  summarize(value = paste0(unique(na.omit(txt)), collapse = " | ")) %>% 
  ungroup() %>%  
  # converting into a wide format
  pivot_wider(id_cols = issn, names_from = key, values_from = value) %>% 
  distinct()

issns_list <- issns_list %>% 
  select(-issn_json, -issn_pdf) %>%  
  left_join(tt_group) %>% distinct()

write_excel_csv(issns_list, paste0(dir, "2021_04_vak_issn_info.csv"))

3. ISSN info (a quick glance)

Show code
issns_list <- read_csv(paste0(dir, "2021_04_vak_issn_info.csv"), 
                      col_types = cols(.default = col_character()))

At this stage we quickly look at the information obtained from ISSN Portal for 2678 ISSNs present in the VAK list (2586 titles). The collected data is a dataset of 2686 rows and 44 cols, a large part of it is empty.

Few interesting observations:

(1) there is no information at ISSN Portal for some titles

The ISSNs to Russian journals are assigned by the National ISSN Centre, so all questions about the titles not being present in ISSN.org must be addressed to them. The table below lists such journals.

Show code
issns_list %>% select(-issn) %>% distinct() %>% 
  filter_at(c("Country", "ISSN", "Key-title", "URL", "Title proper"), ~is.na(.x)) %>% 
  select(vak_title = title, vak_issns = issns,
         `issn.org issn` = ISSN, `issn.org title` = "Key-title") %>% 
    DT::datatable(escape = FALSE, class = 'compact',
                options = list(pageLength = 5, deferRender = TRUE, 
                               autoWidth = TRUE,  ordering = TRUE, scrollX = TRUE))

(2).Some ISSNs are not affiliated with Russian Federation

Show code
issns_list %>% 
  select(vak_title = title, vak_issns = issns, 
         `issn.org country` = Country, `issn.org title` = "Key-title", 
         `issn.org language` = Language,
         `issn.org prev.publisher` = `Earliest publisher`, 
         `issn.org latest.publisher` = `Latest publisher`) %>%
  filter(!is.na(`issn.org country`)&!grepl("Russia", `issn.org country`)) %>% 
  count(`issn.org country`) %>% arrange(-n) %>% 
  ggplot()+
  geom_bar(aes(y = reorder(`issn.org country`,n), x = n), stat = "identity", 
           fill = "#e05915", color = "grey50", size = 0.1)+
  labs(x = "number of titles", 
       y = "country (by issn)", 
       title =  "Countries of non-Russian VAK journals (per issn.org)", 
       subtitle = "VAK list as of April 8, 2021",
       caption = paste0("Date: ", format(Sys.time(), '%Y.%m.%d'),
                       ".\nData: portal.issn.org, vak.minobrnauki.gov.ru"))+
  scale_x_continuous(expand = expansion(mult = c(0,0.05)), 
                     breaks = scales::pretty_breaks(6))+
  my_theme

Quite a set of countries for the Russian white list!

Show code
issns_list %>% 
  select(vak_title = title, vak_issns = issns, 
         `issn.org country` = Country, `issn.org title` = "Key-title", 
         `issn.org language` = Language,
         `issn.org prev.publisher` = `Earliest publisher`, 
         `issn.org latest.publisher` = `Latest publisher`) %>%
  filter(!is.na(`issn.org country`)&!grepl("Russia", `issn.org country`)) %>% 
  arrange(`issn.org country`) %>% 
  DT::datatable(escape = FALSE, class = 'compact', rownames = FALSE,
                options = list(pageLength = 5, deferRender = TRUE, 
                               autoWidth = TRUE,  ordering = TRUE, scrollX = TRUE,
              columnDefs = list(
                list(width = '200px', targets = c(5:6)),
                list(width = '300px', targets = c(0,3)),
                list(className = 'dt-center', targets = c(1)),
                list(className = 'dt-left', targets = c(0, 2:6)))))

Slovakia case seems to be plain - Russian publisher sold its academic titles to Slovakian company (or registered a new one), so now they are de jure Slovakian. But when it comes to the titles registered as Ukrainian, Belarusian or Tajik, an explanation will not be that plain. I would refrain from speculations about this.

Now we are ready to move on to CrossRef.

4. Harvesting CrossRef data

There is a convinient R wrapper for CrossRef API rCrossRef, but with the complex queries, with few filters, I like to use API service directly and save JSON files to the hard disk (not to parse them on a flight).

Such approach makes sense for me - if anything goes wrong I can break the process and start it all over, the saving script checks the existing files and will skip what’s been already saved.

A query to get 5 random articles published after 2019 in a journal title with ISSNs (1811-833X, 2311-7133) will consist of 3 parts:

Of course, we will also exploit ISSN-Ls collected from ISSN.org.

Show code
issns_cr <- issns_list %>% 
  select(title, issn, issnL = `Linking ISSN (ISSN-L)`) %>%  
  pivot_longer(-1) %>% distinct() %>% 
  filter(!is.na(value)) %>% 
  group_by(title) %>% 
  summarize(issns = paste0(unique(value), collapse = ",")) %>% 
  mutate(cr_file = "", code = paste0("cr_",row_number()))

And a cycle.

Show code
dir0 <- paste0(dir, "pdf_cr/")
for (i in which(issns_cr$cr_file==""|is.na(issns_cr$cr_file))){
  print(i)
  q <- issns_cr$issns[i] %>% strsplit(., split = "\\,")
  url <- paste0("https://api.crossref.org/works?filter=",
              unlist(q) %>% paste0("issn:",., collapse = ","),
              ",from-pub-date:2020&rows=5&mailto=", my_email)
  filename <- paste0(dir0, issns_cr$code[i],".json")
  issns_cr$cr_file[i] <- ifelse(file.exists(filename),"OK", readUrl(url, filename))
} 

I had to make breaks few times during this cycle, so I can not accurately estimate a total execution time, but I feel like that it would not take longer than an hour or so, was it run without interruptions.

Let’s parse it with a package jsonlite to extract some data from CrossRef files.

Show code
library(jsonlite)
cr_data <- data.frame()

for (i in 1:NROW(issns_cr)){
    print(i)
    filename <- paste0(dir0, issns_cr$code[i],".json")
    jdoc <- fromJSON(filename, flatten = TRUE) %>% .$message 
    
    # total number of articles (it is in a header)
    total <- jdoc %>% .$`total-results` %>% as.character()
    
    # other data (it is in the json items)
    a <- jdoc %>% .$items
    if(length(a)>0){
      if(exists("issn-type", a)){
        cristy <- a$`issn-type` %>% map_dfr(~.x) %>% 
          unite(col = "issns", c("type", "value"),sep=":") %>% 
          unique() %>% map_chr(paste0, collapse = "|")
      }
      if(exists("ISSN", a)){
        crisis <- a$ISSN %>% unlist() %>% unique() %>% paste0(., collapse = "|")
      }
   
      a2 <- a %>% select(any_of(c("prefix", "member", "publisher"))) %>% 
        mutate_all(~as.character(.x)) %>%
        mutate(code = issns_cr$code[i], 
               cr_issn_type = cristy,
               cr_issns = crisis,
               cr_2020_2021 = total)
      a2 <- a2 %>% 
        select(any_of(c("prefix", "member", "publisher", "code", 
               "cr_issn_type", "cr_issns", "cr_2020_2021"))) %>% 
        pivot_longer(-"code") %>% 
        unique() 
      cr_data <- bind_rows(cr_data, a2)
    }
}

The parsed data is then grouped and converted into a wide format.

Show code
cr_data_wide <- cr_data %>% distinct() %>% 
  group_by(code, name) %>% 
  summarize(value = paste0(unique(na.omit(value)), collapse = " | ")) %>% 
  ungroup() %>% 
  pivot_wider(id_cols = code)
  
issns_cr_merged <- issns_cr %>% left_join(cr_data_wide)

write_excel_csv(issns_cr_merged, paste0(dir, "2021_04_vak_crossref_info.csv"))

5. CrossRef info (a quick glance)

Show code
issns_cr_merged <- read_csv(paste0(dir, "2021_04_vak_crossref_info.csv"), 
                      col_types = cols(.default = col_character(), 
                                       cr_2020_2021 =col_integer()))

## lets merge all the datasets together in this chucnk
issn_bit <- issns_list %>% 
  select(title, Country, issnL = `Linking ISSN (ISSN-L)`, 
         title.proper = `Title proper`) %>% 
  group_by(title) %>% 
  summarize_all(~paste(unique(na.omit(.x)), collapse = " | ")) %>% 
  ungroup()

issn_bit <- issn_bit %>% 
  mutate(Country = ifelse(is.na(Country), NA, 
                         paste0('ISSN_country=', Country)),
         issnL = ifelse(is.na(issnL), NA, 
                          paste0('ISSN_L=', issnL)),
         title.proper = ifelse(is.na(title.proper), NA, 
                          paste0('ISSN_proper_title=', title.proper))) %>%
  unite("issn_jrnl_creds", c(title.proper, Country, issnL), 
        sep = "; ", na.rm = TRUE) 
           
cr_bit <- issns_cr_merged %>% 
  select(title, cr_2020_2021, cr_issn_type, prefix,  publisher, member) %>% 
  mutate(prefix = ifelse(is.na(prefix), NA, 
                         paste0('DOI_prefix=', prefix)), 
         cr_issn_type = ifelse(is.na(cr_issn_type), NA, 
                               paste0("CR_issns=", cr_issn_type)), 
         publisher = ifelse(is.na(publisher), NA, 
                            paste0("CR_publisher=", publisher)), 
         member = ifelse(is.na(member), NA, 
                         paste0("CR_member=", member))) %>% 
  unite("cr_jrnl_creds", c(cr_issn_type, prefix), sep = "; ", na.rm = TRUE) %>% 
  unite("cr_publ_creds", c(publisher, member), sep = "; ", na.rm = TRUE) %>% 
  replace_na(list(cr_2020_2021 = c(0)))

datax <- data %>% 
  left_join(issn_bit, by = "title") %>% 
  left_join(cr_bit, by = "title")
# datax will be used later for the picture and the table 

In contrast to the dataset with an information from ISSN.org, this one has just few most useful (imho) columns like “prefix” (to analyze the citations of these titles in COCI/Lens/Scopus/WoS), or “member” (to restore the publisher details from CrossRef using members API example.

The main result is that for 41% of VAK titles there are no 2020/2021 publications with ISSNs either indicated in the list, or present in ISSN.org.

Show code
chart

6. Final table

For convinience an interactive table below is made with the composite columns, so that one could filter the rows using the search field, and quickly investigate the data. The links to the full datasets can be found below in a section named “Shared Data”.

By default the table is shown with few columns hidden (subjects, subj_codes, original VAK title and comments), but those columns are also used by DT Search functionality to filter the rows (try to put “14.01.07” into a Search pane to filter out the journals approved as credible for those who are going to defend a degree in a subject 14.01.07 - Ophthalmic Diseases).

Show code
datax %>% 
  select(title_main, issn,  title,  comments, issn_jrnl_creds, 
         `cr_2020+` = cr_2020_2021, cr_jrnl_creds, cr_publ_creds,
                  subjects,   subj_codes) %>% 
 DT::datatable(rownames = FALSE, escape = FALSE, 
                class = 'compact', extensions = 'Buttons', 
         options = list(deferRender = TRUE, autoWidth = TRUE, 
                        buttons = list(I('colvis'),'csv', 'excel'),
                        ordering = TRUE, scrollX = TRUE, 
                        dom = 'Bfrtip', pageLength = 5, 
              columnDefs = list(
                list(visible = FALSE,targets = c(2,3,8,9)),
                list(width = '200px', targets = c(0,1,6,7)),
                list(width = '400px', targets = c(2,3,4,8,9)),
                list(className = 'dt-center', targets = c(5)),
                list(className = 'dt-left', targets = c(0:4,6:9)))
              )
         )

Limitations

Anyway, please use these data with caution, regarding the empty fields as a warning, not as a final confirmation or recommendation.

Shared data

The original PDF file, the parsed and revised data in CSV format, and 2 datasets (issn+, crossref+) are available at Figshare.

For those who would like to re-use or quote it, please do not hesitate - no permission is required. This post and figshare repository provide ready-to-go citation texts.

Acknowledgments

Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2021). rmarkdown: Dynamic Documents for R. R package version 2.7, <URL: https://github.com/rstudio/rmarkdown>.

Chang, W (2014). extrafont: Tools for using fonts. R package version 0.17, <URL: https://CRAN.R-project.org/package=extrafont>.

Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, <URL: https://CRAN.R-project.org/package=purrr>.

Ooms J (2014). “The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects.” arXiv:1403.2805 [stat.CO]. <URL: https://arxiv.org/abs/1403.2805>.

Ooms J (2020). pdftools: Text Extraction, Rendering and Converting of PDF Documents. R package version 2.3.1, <URL: https://CRAN.R-project.org/package=pdftools>.

Pedersen T (2020). patchwork: The Composer of Plots. R package version 1.1.1, <URL: https://CRAN.R-project.org/package=patchwork>.

Wickham H (2020). tidyr: Tidy Messy Data. R package version 1.1.2, <URL: https://CRAN.R-project.org/package=tidyr>.

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL: https://ggplot2.tidyverse.org>.

Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=stringr>.

Wickham H (2020). httr: Tools for Working with URLs and HTTP. R package version 1.4.2, <URL: https://CRAN.R-project.org/package=httr>.

Wickham H (2021). rvest: Easily Harvest (Scrape) Web Pages. R package version 1.0.0, <URL: https://CRAN.R-project.org/package=rvest>.

Wickham H, Francois R, Henry L, Muller K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.3, <URL: https://CRAN.R-project.org/package=dplyr>.

Wickham H, Hester J (2020). readr: Read Rectangular Text Data. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=readr>.

Wickham H, Seidel D (2020). scales: Scale Functions for Visualization. R package version 1.1.1, <URL: https://CRAN.R-project.org/package=scales>.

Xie Y (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30, <URL: https://yihui.org/knitr/>.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL: https://yihui.org/knitr/>.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, <URL: http://www.crcpress.com/product/isbn/9781466561595>.

Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <URL: https://bookdown.org/yihui/rmarkdown>.

Xie Y, Cheng J, Tan X (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.17, <URL: https://CRAN.R-project.org/package=DT>.

Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <URL: https://bookdown.org/yihui/rmarkdown-cookbook>.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lutai (2021, May 12). ConviviaR Tools: Checking VAK titles in CrossRef. Retrieved from https://dwayzer.netlify.app/posts/2021-05-10-vak-titles-in-crossref/

BibTeX citation

@misc{lutai2021checking,
  author = {Lutai, Aleksei},
  title = {ConviviaR Tools: Checking VAK titles in CrossRef},
  url = {https://dwayzer.netlify.app/posts/2021-05-10-vak-titles-in-crossref/},
  year = {2021}
}