RIRO - Russian Index of the Research Organizations

russian data organization identifier scopus r wikidata web of science ror riro microsoft academic

With Web of Science & Scopus, ROR & GRID, Wikidata & Microsoft Academic, ISNI & other providers of the research organization identifiers, is there a need for another one? In this post I am going to answer this question by telling you about RIRO project, launched by me and Ivan Sterligov today. You will see what RORI is, how to get the data, and what value it can provide for the bibliometric research.

true , true
05-25-2021

What is RIRO?

RIRO is a public, open, and autonomous project, designed to link the numerous identifiers of the Russian organizations to the existing legal entities.

In its current form RIRO v.1.0 is a set of tables linking the organization identifiers and profiles provided by ROR, GRID, Wikidata, Microsoft Academic, Scopus, SciVal, Web of Science, InCites to the Russian legal entities.

The RIRO tables are available at Zenodo RIRO community: https://zenodo.org/communities/riro/.

The code & examples of use will be shared via GitHub: https://github.com/OpenRIRO.

The news & related information will be updated at web site.

Why do we need RIRO?

The databases mentioned above have assigned the identifiers to the organizations and put some efforts into curation process, but none of them is confident when it comes to:

As a result, the identifiers provided by the international services suffer from a patchy coverage and have a limited value for the research assessment and scientometric analysis.

With RIRO we aim to:

Selection Criteria for Organizations

As RORI is about research, any organization that appears in the affiliations texts of the scientific publications is a good candidate for RORI. And yet some organizations have been prioritized for v.1.0, those are:

We have not (even) reviewed the entities associated with the Russian state corporations (RosAtom, RosSpace, RosTech) and the military ones.

RIRO dataset

The RIRO dataset v.1.0 is a set of CSV tables, having a primary key named code, so one can easily join the tables or build a database.

The CSV tables can be downloaded from Zenodo community, via OAI-PMH Harvesting API or by using REST API.

Below I demonstrate how to download the dataset using zen4R package. This will require of you to do the following actions:

  1. Install [zen4R] package (https://github.com/eblondel/zen4R))

  2. Register at Zenodo

  3. Obtain a token here

Show code
# initialize a session
zenodo <- ZenodoManager$new(token = your_zenodo_token,  logger = "INFO")

# request the Concept ID (more info -> https://help.zenodo.org/#versioning)
my_rec <- zenodo$getRecordByConceptId("4775290")

# get the list of files with names and URLs
files <- my_rec$files %>% map_df(~.x %>% flatten()) %>% select(filename, filesize, download)

# if you set a folder named RIRO_tables for files, download them with
walk2(files$download, paste0(dir, files$filename), 
      ~download.file(url =.x, destfile = .y))

# At https://help.zenodo.org/#versioning Zenodo says that:
# Currently the Concept DOI resolves to the landing page of the latest version of your record. 
# This is not fully correct, and in the future we will change this to create a landing page 
# specifically representing the concept behind the record and all of its versions.
# so one day we may have to retrieve the latest DOI my_rec$getLastDOI() or smth else 

Let’s see what we have downloaded.

Show code
riro_files <- list.files(dir, full.names = TRUE)
riro_files_info <- riro_files  %>% file.info() %>% select(size)
riro_files_info
                                                 size
D://RIRO_tables/v.1.0_table_01_basic.csv      3156243
D://RIRO_tables/v.1.0_table_02_geo.csv        1819735
D://RIRO_tables/v.1.0_table_03_hierarchy.csv   191991
D://RIRO_tables/v.1.0_table_04_ror.csv         468079
D://RIRO_tables/v.1.0_table_05_wiki.csv        758197
D://RIRO_tables/v.1.0_table_06_scopus.csv      361781
D://RIRO_tables/v.1.0_table_07_mag.csv          14716
D://RIRO_tables/v.1.0_table_08_incites.csv      75484
D://RIRO_tables/v.1.0_table_09_scival.csv       24105
D://RIRO_tables/v.1.0_table_10_monitoring.csv   17944
D://RIRO_tables/v.1.0_table_11_wos_names.csv   354958

There are 11 files with an approx. total size 6.9 MB.

Below are the parts of each table corresponding to the 3 organizations:

I selected these organizations as they came through a chain of reorganizations in the last 10 years.

Table 1 - Official Info

Table 1 comprises the basic organization details - OGRN (Primary State Registration Number), INN (Taxpayer Identification Number), KPP (Tax Registration Reason Code), full & short names, status {active, liquidated, in reorganization process}, and the branch type {head or branch}.

A table below lists the legal entities found by OGRN and their branches (the branch organization and its head entity share same OGRN and INN). To save some space a column with the short names column is not included.

Show code
t1 <- riro_files[1] %>% read_csv(col_types = cols(.default = col_character())) 

# filtering data for 3 organizations by OGRN 
test_group <- t1 %>% filter(ogrn %in% c("1021100511332", "1037739552740", "1027000861568"))

test_group %>% select(code, level, status, name_full, ogrn, inn, kpp) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
      options = list(columnDefs = list(list(width = '650px', targets = c(3)))))

Each row has its unique “code” which serves as a primary key for all the RIRO tables.

Table 2 - Locations and Geodata

Table 2 comprises the full address and its separate parts (in Russian), accompanied with the geocode, geo coordinates and time zone. Table 2 is the only table in RIRO that corresponds 1:1 to Table 1 via “code”. The other tables can have few rows for unique code.

Show code
t2 <- riro_files[2] %>% read_csv(col_types = cols(.default = col_character()))  

t2 %>% filter(code %in% test_group$code) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, 
            class = "row-border", 
            options = list(columnDefs = list(
              list(width = '450px', targets = c(6)))))

Table 3 - Hierarchy

This is a very important table, as it links the parent organizations not only to its existing branches, but also to the predecessors (for convenience, in this document both will be referred as “children accounts”).

Table 3 does not pretend to be complete for few reasons:

  1. the list includes only the last predecessors of the current organizations, so there is no historical perspective from 2000s or from USSR.

  2. some organizations have many branch offices (e.g. hospitals), but information about their hierarchy has little or no value from research assessment perspective. Therefore, for some organizations RIRO does not show the branches. The organizations including the branches are mainly the federal organizations (subdued to the ministries and the federal agencies).

The “child_code” is a code for the children account, the values in the “relation” column reflect a nature of their subordination (it can be a branch or a predecessor).

Show code
t3 <- riro_files[3] %>% read_csv(col_types = cols(.default = col_character()))

full_test <- t3 %>% filter(code %in% test_group$code) %>% arrange(desc(code)) %>% 
  mutate(relation = ifelse(relation == "Филиал", "Branch","Predecessor"))

full_test %>% datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE, 
                           columnDefs = list(list(className = 'dt-center', targets = c(0:2)))))

Thus, using a list of OGRNs for 3 selected organizations, we extracted from Table 1 6 entities with unique codes (head & branch organizations), further used to retrieve a list of all the predecessors. As a result we have build a list of 42 entities with unique code values.

Show code
hierarchy <- full_test %>% 
  add_row(code = unique(full_test$code), 
          child_code = unique(full_test$code),
          relation = "Parent_Org") %>% 
  rename(parent_code = code, code = child_code) %>% 
  arrange(parent_code, relation) 

Now we can see how many branches & predecessors exist for 3 selected orgs.

Show code
hierarchy %>% count(parent_code, relation) %>% 
  pivot_wider(names_from = relation, 
              values_from = n, values_fill = 0) %>%
  select(parent_code, Parent_Org, Branch, Predecessor) %>% 
  datatable(rownames = FALSE, filter = "none", 
            escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE, 
  columnDefs = list(list(className = 'dt-center', 
                         targets = c(0:3)))))

We will use this hierarchy to gather the matched identifiers from other RIRO tables (see below).

Table 4 - ROR

Research Organizations Registry (ROR) is an international project launched in 2019 with an ambitious goal to create a public ORCID-like registry for the research organizations. It inherits a lot from GRID (Global Research Identifier Database) and (I guess) from Wikidata. The ROR organization info can be downloaded as a JSON dump or retrieved via API.

Table 4 contains not all the Russian records from ROR, but only those that we matched to the organizations present in RORI.

Show code
t4 <- riro_files[4] %>% read_csv(col_types = cols(.default = col_character()))  

hierarchy %>% inner_join(t4, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation, -ror_isni) %>% 
  datatable(rownames = FALSE, filter = "none", 
            escape = FALSE, class = "row-border", 
            options = list(columnDefs = list(
              list(width = '250px', targets = c(3:4)),
              list(width = '350px', targets = c(5)),
              list(width = '450px', targets = c(8)))))

The column “Relationships” have the composite values of following structure:

label:xxxx|type:yyyyyy|id:https://ror.org/zzzzz,

with 3 units (label, type, id) for the relative (according to ROR) organizations.

But in cases like our example such references can be misleading! According to ROR the research institutes of the Komi Federal Research Center have different parents:

So for a single organization ROR shows 4 accounts subordinating to three different RAS structures.

The actual truth is that the found 4 research institutes of the Komi Federal Research Center ceased to exist as legal entities 3 years ago, they were merged into one federal research center in May 2018.

Using the Table 3, one can gather the related identifiers and qualify them as corresponding to a branch or a predecessor.

Table 5 - Wikidata

WikiData is a public repository of structured data originating from multiple sources. Some sources are more or less consistent (like CrossRef or ISSN), but there’s also a lot of Wikidata records that are created and modified by people. As a result, even though Wikidata offers a pre-defined templates for the universities or research organizations, Wikidata profiles have a lot of unpopulated fields.

The table 5 comprises a list of fields found in the Wikidata profiles for the organizations, but not a full copy. Moreover, Table 5 include only those Russian research organizations that match to the organizations in RIRO.

Show code
t5 <- riro_files[5] %>% read_csv(col_types = cols(.default = col_character())) 

hierarchy %>% inner_join(t5, by = "code") %>% 
  arrange(parent_code) %>% 
  select(-parent_code, - relation, -wd_item) %>%  
  mutate_at(c("wikipedia_eng", "wikipedia_rus", "wd_itemaltlabel", "wd_altlabel"),
            ~ifelse(is.na(.x),.x, paste0(substr(.x,1,40),"...")))%>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(columnDefs = list(
              list(width = '250px', targets = c(2:4)),
              list(width = '150px', targets = c(6:9)))))

Table 6 - Scopus

Scopus is a (one of leading) citation index accumulating the metadata from 20k+ journal titles, selected conference sources, and some academic book titles. Table 6 lists the Scopus affiliation profiles matched to the organizations in RIRO, and also a number of publications under Scopus affiliation profile (on a data of request, April 2021).

Scopus Affiliation IDs can be used for search queries via online UI or the API-service. The latter has few wrappers for python and R that make working with API more comfortable.

Please note that matching the Scopus affiliation profiles to RIRO organizations is based on the affiliation name and location. It does not guarantee that all the publications in the profile are assigned to it correctly. More details on how to edit the affiliation profiles in Scopus can be found on Elsevier web site.

Show code
t6 <- riro_files[6] %>% read_csv(col_types = cols(.default = col_character())) 

hierarchy %>% inner_join(t6, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(columnDefs = list(
              list(className = 'dt-center', targets = c(0,6)),
              list(width = '350px', targets = c(2:3)))))

Table 7 - Microsoft Academic

Microsoft Academic Graph (MAG) is a database created based on the information extracted with Bing-parsers from the publisher web sites and PDF files details). This approach is different from the one utilized by Web of Science and Scopus that receive a large part of information for indexation directly from the publishers.

In recent years MAG has become a source of information for many novel solutions like Lens, Semantic Scholar, Open Academic Graph, Unsub.

Even though the last news about MAG shocked us too, we decided to include the MAG Organization IDs into RORI. Few international companies committed to launching a new tool that may substitute MAG:

ANNOUNCING: We’re building a replacement for Microsoft Academic Graph. https://t.co/GXelkpt6Zc

— Our Research (@our_research) May 8, 2021

Thanks to @MSFTResearch for providing the Microsoft Academic Graph (MAG). We’ve been working with MSR and MAG since 2018, and we’ve been collaborating on this transition for some time. 1/3https://t.co/7aHTLio8uK

— Semantic Scholar (@SemanticScholar) May 11, 2021

Table 7 lists just the MAG organization IDs and names agains the RORI codes.

Show code
t7 <- riro_files[7] %>% read_csv(col_types = cols(.default = col_character())) 

hierarchy %>% inner_join(t7, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE,
              columnDefs = list(
                list(className = 'dt-center', targets = c(0:1)),
                list(width = '450px', targets = c(2)))))

Table 8 - InCites

InCites is an analytical solution build over Web of Science Core Collection. It allows to export the records, so the matched names can be used for further analysis.

The table 8 lists the official organization names in InCite and Web of Science Core Collection against the RORI codes.

Show code
t8 <- riro_files[8] %>% read_csv(col_types = cols(.default = col_character())) 

hierarchy %>% inner_join(t8, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE,
              columnDefs = list(
                list(className = 'dt-center', targets = c(0,3)),
                list(width = '500px', targets = c(1:2)))))

Table 9 - SciVal

SciVal is an analytical tool build over Scopus. Some Russian organizations have an access to SciVal API and could use the IDs matched against the RORI codes in the table 9.

Show code
t9 <- riro_files[9] %>% read_csv(col_types = cols(.default = col_character()))

hierarchy %>% inner_join(t9, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE,
              columnDefs = list(
                list(className = 'dt-center', targets = c(0:1)),
                list(width = '450px', targets = c(2)))))

Table 10 - Russian Universities Assessment System

This system governed by the Russian Ministry of Science & Higher Education, collects the various statistical reports from all Russian higher education institutions (excluding some schools under the Ministry of Defence and alike). Such reports contain a lot of useful information - from financial to enrollment data.

Table 10 lists the IDs that corresponds to the university’s web page on the portal, matched to the RORI codes.

Show code
t10 <- riro_files[10] %>% read_csv(col_types = cols(.default = col_character()))  

hierarchy %>% inner_join(t10, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE,
              columnDefs = list(list(className = 'dt-center', 
                                     targets = c(0:1)))))

Table 11 - Web of Science

Web of Science is by far the world’s oldest and most prominent citation index. At this moment Web of Science does not provide the organization IDs that could be used for search or data retrieval, but the search results have the orgaization names. The table 11 lists almost 4000 such names matched to the organizations in RIRO. This is not a complete list of known affiliation names for the Russian research organizations, but we hope to adjust this table in future releases of RIRO.

Show code
t11 <- riro_files[11] %>% read_csv(col_types = cols(.default = col_character())) 

hierarchy %>% inner_join(t11, by = "code") %>% 
  arrange(parent_code) %>% select(-parent_code, - relation) %>% 
  datatable(rownames = FALSE, filter = "none", 
            escape = FALSE, class = "row-border", 
            options = list(autoWidth = FALSE,
              columnDefs = list(
                list(className = 'dt-center', targets = c(0)),
                list(width = '500px', targets = c(1,3)))))

All IDs for 3 organizations

Now, as we have glanced at the identifiers found in each table for 3 selected organizations, we are ready for a wider picture.

An illustration below shows the identifiers (ROR, GRID, Scopus Affiliation ID, InCites ID, MAG, Wikidata, 1-Monitoring) matched to parent organizations, branches and predecessors - each organization in a separate section. The identifiers are placed along X-axis (by organization). The entities (RORI records) are placed along Y-axis – the existing organizations are shown as squares (the parent organizations are marked with a special sign), the predecessors as circles.

Show code
knitr::include_graphics(chart_filename)

This picture is not an example of clarity, I admit, so let me explain it more thoroughly. The identifiers for each organization are present in a separacte section (from left to right). Each section has its own number of horizontal rows corresponding to the parent organization, branches and predecessors (from top to bottom). Marked as squares are the existing organizations - either a parent one (also marked with a sign, on top) or the branches (under the parent). Marked as circles (with less saturated colours) are the identifiers corresponding to the predecessors (liquidated organizations).

Or in a more plain language - every circle is an organization identifier that corresponds to a predecessor, not to existing legal entity.

RIRO’s scope?

An UpSet diagram below shows the sets of organizations listen in RIRO (v.1.0) and matched to the various identifiers. Only the head and active (existing) organizations are counted here, so the identifiers referring to the branches or to the predecessors (liquidated via acquisition), are not counted.

Show code
knitr::include_graphics(chart_filename2)

RIRO v.1.0 lists 1774 existing parent organizations. Many of those have no external identifiers, but (as previous picture proved) may have a lot of identifiers corresponding to the branches or predecessors.

Additional Notes

Feedback

So far RIRO is developed by only 2 persons - me and Ivan Sterligov orcid - so we decided to start with a google form (in Russian) with 5 pre-defined scenarios of change requests:

The future RORI releases will include a log of made changes.

Acknowledgments

Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2021). rmarkdown: Dynamic Documents for R. R package version 2.7, <URL: https://github.com/rstudio/rmarkdown>.

Blondel E (2021). zen4R: Interface to ‘Zenodo’ REST API. R package version 0.4-3, <URL: https://github.com/eblondel/zen4R>.

Chang, W (2014). extrafont: Tools for using fonts. R package version 0.17, <URL: https://CRAN.R-project.org/package=extrafont>.

Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, <URL: https://CRAN.R-project.org/package=purrr>.

Krassowski M (2020). “ComplexUpset.” doi: 10.5281/zenodo.3700590 (URL: https://doi.org/10.5281/zenodo.3700590), <URL: https://doi.org/10.5281/zenodo.3700590>.

Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H (2014). “UpSet: Visualization of Intersecting Sets,.” IEEE Transactions on Visualization and Computer Graphics, 20(12), 1983–1992. doi: 10.1109/TVCG.2014.2346248 (URL: https://doi.org/10.1109/TVCG.2014.2346248), <URL: https://doi.org/10.1109/TVCG.2014.2346248>.

Wickham H (2020). tidyr: Tidy Messy Data. R package version 1.1.2, <URL: https://CRAN.R-project.org/package=tidyr>.

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL: https://ggplot2.tidyverse.org>.

Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=stringr>.

Wickham H, Francois R, Henry L, Muller K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.3, <URL: https://CRAN.R-project.org/package=dplyr>.

Wickham H, Hester J (2020). readr: Read Rectangular Text Data. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=readr>.

Wickham H, Seidel D (2020). scales: Scale Functions for Visualization. R package version 1.1.1, <URL: https://CRAN.R-project.org/package=scales>.

Xie Y (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30, <URL: https://yihui.org/knitr/>.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL: https://yihui.org/knitr/>.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, <URL: http://www.crcpress.com/product/isbn/9781466561595>.

Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <URL: https://bookdown.org/yihui/rmarkdown>.

Xie Y, Cheng J, Tan X (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.17, <URL: https://CRAN.R-project.org/package=DT>.

Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <URL: https://bookdown.org/yihui/rmarkdown-cookbook>.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Lutai & Sterligov (2021, May 25). ConviviaR Tools: RIRO - Russian Index of the Research Organizations. Retrieved from https://dwayzer.netlify.app/posts/2021-05-24-riro/

BibTeX citation

@misc{lutai2021riro,
  author = {Lutai, Aleksei and Sterligov, Ivan},
  title = {ConviviaR Tools: RIRO - Russian Index of the Research Organizations},
  url = {https://dwayzer.netlify.app/posts/2021-05-24-riro/},
  year = {2021}
}