Dataverse: v5.4 Release

Release date:
April 5, 2021
Previous version:
v5.3 (released December 22, 2020)
Magnitude:
1,224 Diff Delta
Contributors:
14 total committers
Data confidence:
Commits:

Top Contributors in v5.4

qqmyers
mheppler
pdurbin
landreev
djbrooke
kcondon
scolapasta
sekmiller
donsizemore
janvanmansum

Directory Browser for v5.4

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

Dataverse Software 5.4

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project. Please note that there is an API backwards compatibility issue in 5.4, and we recommend using 5.4.1 for any production environments.

Release Highlights

Deactivate Users API, Get User Traces API, Revoke Roles API

A new API has been added to deactivate users to prevent them from logging in, receiving communications, or otherwise being active in the system. Deactivating a user is an alternative to deleting a user, especially when the latter is not possible due to the amount of interaction the user has had with the Dataverse installation. In order to learn more about a user before deleting, deactivating, or merging, a new "get user traces" API is available that will show objects created, roles, group memberships, and more. Finally, the "remove all roles" button available in the superuser dashboard is now also available via API.

New File Access API

A new API offers crawlable access view of the folders and files within a dataset:

/api/datasets/<dataset id>/dirindex/

will output a simple html listing, based on the standard Apache directory index, with Access API download links for individual files, and recursive calls to the API above for sub-folders. Please see the Native API Guide for more information.

Using this API, wget --recursive (or similar crawling client) can be used to download all the files in a dataset, preserving the file names and folder structure; without having to use the download-as-zip API. In addition to being faster (zipping is a relatively resource-intensive operation on the server side), this process can be restarted if interrupted (with wget --continue or equivalent) - unlike zipped multi-file downloads that always have to start from the beginning.

On a system that uses S3 with download redirects, the individual file downloads will be handled by S3 directly (with the exception of tabular files), without having to be proxied through the Dataverse application.

Restricted Files and DDI "dataDscr" Information (Summary Statistics, Variable Names, Variable Labels)

In previous releases, DDI "dataDscr" information (summary statistics, variable names, and variable labels, sometimes known as "variable metadata") for tabular files that were ingested successfully were available even if files were restricted. This has been changed in the following ways:

  • At the dataset level, DDI exports no longer show "dataDscr" information for restricted files. There is only one version of this export and it is the version that's suitable for public consumption with the "dataDscr" information hidden for restricted files.
  • Similarly, at the dataset level, the DDI HTML Codebook no longer shows "dataDscr" information for restricted files.
  • At the file level, "dataDscr" information is no longer publicly available for restricted files. In practice, it was only possible to get this publicly via API (the download/access button was hidden).
  • At the file level, "dataDscr" (variable metadata) information can still be downloaded for restricted files if you have access to download the file.

Search with Accented Characters

Many languages include characters that have close analogs in ascii, e.g. (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…). This release changes the default Solr configuration to allow search to match words based on these associations, e.g. a search for Mercè would match the word Merce in a Dataset, and vice versa. This should generally be helpful, but can result in false positives, e.g. "canon" will be found searching for "cañon".

Java 11, PostgreSQL 13, and Solr 8 Support/Upgrades

Several of the core components of the Dataverse Software have been upgraded. Specifically:

  • The Dataverse Software now runs on and requires Java 11. This will provide performance and security enhancements, allows developers to take advantage of new and updated Java features, and moves the project to a platform with better longer term support. This upgrade requires a few extra steps in the release process, outlined below.
  • The Dataverse Software has now been tested with PostgreSQL versions up to 13. Versions 9.6+ will still work, but this update is necessary to support the software beyond PostgreSQL EOL later in 2021.
  • The Dataverse Software now runs on Solr 8.8.1, the latest available stable release in the Solr 8.x series.

Saved Search Performance Improvements

A refactoring has greatly improved Saved Search performance in the application. If your installation has multiple, potentially long-running Saved Searches in place, this greatly improves the probability that those search jobs will complete without timing out.

Worldmap/Geoconnect Integration Now Obsolete

As of this release, the Geoconnect/Worldmap integration is no longer available. The Harvard University Worldmap is going through a migration process, and instead of updating this code to work with the new infrastructure, the decision was made to pursue future Geospatial exploration/analysis through other tools, following the External Tools Framework in the Dataverse Software.

Guides Updates

The Dataverse Software Guides have been updated to follow recent changes to how different terms are used across the Dataverse Project. For more information, see Mercè's note to the community:

https://groups.google.com/g/dataverse-community/c/pD-aFrpXMPo

Conditionally Required Metadata Fields

Prior to this release, when defining metadata for compound fields (via their dataset field types), fields could be either be optional or required, i.e. if required you must always have (at least one) value for that field. For example, Author Name being required means you must have at least one Author with an nonempty Author name.

In order to support more robust metadata (and specifically to resolve #7551), we need to allow a third case: Conditionally Required, that is, the field is required if and only if any of its "sibling" fields are entered. For example, Producer Name is now conditionally required in the citation metadata block. A user does not have to enter a Producer, but if they do, they have to enter a Producer Name.

Major Use Cases

Newly-supported major use cases in this release include:

  • Dataverse Installation Administrators can now deactivate users using a new API. (Issue #2419, PR #7629)
  • Superusers can remove all of a user's assigned roles using a new API. (Issue #2419, PR #7629)
  • Superusers can use an API to gather more information about actions a user has taken in the system in order to make an informed decisions about whether or not to deactivate or delete a user. (Issue #2419, PR #7629)
  • Superusers will now be able to harvest from installations using ISO-639-3 language codes. (Issue #7638, PR #7690)
  • Users interacting with the workflow system will receive status messages (Issue #7564, PR #7635)
  • Users interacting with prepublication workflows will see speed improvements (Issue #7681, PR #7682)
  • API Users will receive Dataverse collection API responses in a deterministic order. (Issue #7634, PR #7708)
  • API Users will be able to access a list of crawlable URLs for file download, allowing for faster and easily resumable transfers. (Issue #7084, PR #7579)
  • Users will no longer be able to access summary stats for restricted files. (Issue #7619, PR #7642)
  • Users will now see truncated versions of long strings (primarily checksums) throughout the application (Issue #6685, PR #7312)
  • Users will now be able to easily copy checksums, API tokens, and private URLs with a single click (Issue #6039, Issue #6685, PR #7539, PR #7312)
  • Users uploading data through the Direct Upload API will now be able to use additional checksums (Issue #7600, PR #7602)
  • Users searching for content will now be able to search using non-ascii characters. (Issue #820, PR #7378)
  • Users can now replace files in draft datasets, a functionality previously only available on published datasets. (Issue #7149, PR #7337)
  • Dataverse Installation Administrators can now set subfields of compound fields as conditionally required, that is, the field is required if and only if any of its "sibling" fields are entered. For example, Producer Name is now conditionally required in the citation metadata block. A user does not have to enter a Producer, but if they do, they have to enter a Producer Name. (Issue #7606, PR #7608)

Notes for Dataverse Installation Administrators

Java 11 Upgrade

There are some things to note and keep in mind regarding the move to Java 11:

  • You should install the JDK/JRE following your usual methods, depending on your operating system. An example of this on a RHEL/CentOS 7 or RHEL/CentOS 8 system is:

    $ sudo yum remove java-1.8.0-openjdk java-1.8.0-openjdk-devel java-1.8.0-openjdk-headless

    $ sudo yum install java-11-openjdk-devel

    The remove command may provide an error message if -headless isn't installed.

  • We targeted and tested Java 11, but 11+ will likely work. Java 11 was targeted because of its long term support.

  • If you're moving from a Dataverse installation that was previously running Glassfish 4.x (typically this would be Dataverse Software 4.x), you will need to adjust some JVM options in domain.xml as part of the upgrade process. We've provided these optional steps below. These steps are not required if your first installed Dataverse Software version was running Payara 5.x (typically Dataverse Software 5.x).

PostgreSQL Versions Up To 13 Supported

Up until this release our installation guide "strongly recommended" to install PostgreSQL v. 9.6. While that version is known to be very stable, it is nearing its end-of-life (in Nov. 2021). Dataverse Software has now been tested with versions up to 13. If you decide to upgrade PostgreSQL, the tested and recommended way of doing that is as follows:

  • Export your current database with pg_dumpall;
  • Install the new version of PostgreSQL; (make sure it's running on the same port, etc. so that no changes are needed in the Payara configuration)
  • Re-import the database with psql, as the postgres user.

Consult the PostgreSQL upgrade documentation for more information, for example https://www.postgresql.org/docs/13/upgrading.html#UPGRADING-VIA-PGDUMPALL.

Solr Upgrade

With this release we upgrade to the latest available stable release in the Solr 8.x branch. We recommend a fresh installation of Solr 8.8.1 (the index will be empty) followed by an "index all".

Before you start the "index all", the Dataverse installation will appear to be empty because the search results come from Solr. As indexing progresses, partial results will appear until indexing is complete.

See http://guides.dataverse.org/en/5.4/installation/prerequisites.html#installing-solr for more information.

Managing Conditionally Required Metadata Fields

Prior to this release, when defining metadata for compound fields (via their dataset field types), fields could be either be optional or required, i.e. if required you must always have (at least one) value for that field. For example, Author Name being required means you must have at least one Author with an nonempty Author name.

In order to support more robust metadata (and specifically to resolve #7551), we need to allow a third case: Conditionally Required, that is, the field is required if and only if any of its "sibling" fields are entered. For example, Producer Name is now conditionally required in the citation metadata block. A user does not have to enter a Producer, but if they do, they have to enter a Producer Name.

This change required some modifications to how "required" is defined in the metadata .tsv files (for compound fields).

Prior to this release, the value of required for the parent compound field did not matter and so was set to false.

Going forward:

  • For optional, the parent compound field would be required = false and all children would be required = false.
  • For required, the parent compound field would be required = true and at least one child would be required = true.
  • For conditionally required, the parent compound field would be required = false and at least one child would be required = true.

This release updates the citation .tsv file that is distributed with the software for the required parent compound fields (e.g. author), as well as sets Producer Name to be conditionally required. No other distributed .tsv files were updated, as they did not have any required compound values.

If you have created any custom metadata .tsv files, you will need to make the same (type of) changes there.

Citation Metadata Block Update

Due to the changes for Conditionally Required Metadata Fields, and a minor update in the citation metadata block to support extra ISO-639-3 language codes, a block upgrade is required. Instructions are provided below.

Retroactively Store Original File Size

Beginning in Dataverse Software 4.10, the size of the saved original file (for an ingested tabular datafile) was stored in the database. For files added before this change, we provide an API that retrieves and permanently stores the sizes for any already existing saved originals. See Datafile Integrity API for more information.

This was documented as a step in previous release notes, but we are noting it in these release notes to give it more visibility.

DB Cleanup for Saved Searches

A previous version of the Dataverse Software changed the indexing logic so that when a user links a Dataverse collection, its children are also indexed as linked. This means that the children do not need to be separately linked, and in this version we removed the logic that creates a saved search to create those links when a Dataverse collection is linked.

We recommend cleaning up the db to a) remove these saved searches and b) remove the links for the objects. We can do this via a few queries, which are available in the folder here:

https://github.com/IQSS/dataverse/raw/develop/scripts/issues/7398/

There are four sets of queries available, and they should be run in this order:

  • ss_for_deletion.txt to identify the Saved Searches to be deleted
  • delete_ss.txt to delete the Saved Searches identified in the previous query
  • dld_for_deletion.txt to identify the linked datasets and Dataverse collections to be deleted
  • delete_dld.txt to delete the linked datasets and Dataverse collections identified in the previous queries

Note: removing these saved searches and links should not affect what users will see as linked due to the aforementioned indexing change. Similarly, not removing these saved searches and links should not affect anything, but is a cleanup of unnecessary rows in the database.

DB Cleanup for Superusers Releasing without Version Updates

In datasets where a superuser has run the Curate command and the update included a change to the fileaccessrequest flag, those changes would not be reflected appropriately in the published version. This should be a rare occurrence.

Instead of an automated solution, we recommend inspecting the affected datasets and correcting the fileaccessrequest flag as appropriate. You can identify the affected datasets this via a query, which is available in the folder here:

https://github.com/IQSS/dataverse/raw/develop/scripts/issues/7687/

New JVM Options and Database Settings

For installations that were previously running on Dataverse Software 4.x, a number of new JVM options need to be added as part of the upgrade. The JVM Options are enumerated in the detailed upgrade instructions below.

Two new Database settings were added:

  • :InstallationName
  • :ExportInstallationAsDistributorOnlyWhenNotSet

For an overview of these new options, please see the Installation Guide

Notes for Tool Developers and Integrators

UTF-8 Characters and Spaces in File Names

UTF-8 characters in filenames are now preserved when downloaded.

Dataverse installations will no longer replace spaces in file names of downloaded files with the + character. If your tool or integration has any special handling around this, you may need to make further adjustments to maintain backwards compatibility while also supporting Dataverse installations on 5.4+.

Note that this follows a change from 5.1 that only corrected this for installations running with S3 storage. This makes the behavior consistent across installations running all types of file storage.

Complete List of Changes

For the complete list of code changes in this release, see the 5.4 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email [email protected].

Installation

If this is a new installation, please see our Installation Guide.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.4.

1. Upgrade to Java 11.

2. Upgrade to Solr 8.8.1.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

3. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

4. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

5. (only required for installations previously running Dataverse Software 4.x!) In other words, if you have a domain.xml that originated under Glassfish 4, the below JVM Options need to be added. If your Dataverse installation was first installed on the 5.x series, these JVM options should already be present.

In domain.xml:

Remove the following JVM options from the <config name="server-config"><java-config> section:

<jvm-options>-Djava.endorsed.dirs=/usr/local/payara5/glassfish/modules/endorsed:/usr/local/payara5/glassfish/lib/endorsed</jvm-options>

<jvm-options>-Djava.ext.dirs=${com.sun.aas.javaRoot}/lib/ext${path.separator}${com.sun.aas.javaRoot}/jre/lib/ext${path.separator}${com.sun.aas.instanceRoot}/lib/ext</jvm-options>

Add the following JVM options to the <config name="server-config"><java-config> section:

<jvm-options>[9|]--add-opens=java.base/jdk.internal.loader=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-exports=java.base/jdk.internal.ref=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/java.lang=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/java.net=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/java.nio=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/java.util=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/sun.nio.ch=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.management/sun.management=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/sun.net.www.protocol.jrt=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.base/sun.net.www.protocol.jar=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.naming/javax.naming.spi=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED</jvm-options>

<jvm-options>[9|]--add-opens=java.logging/java.util.logging=ALL-UNNAMED</jvm-options>

6. Start Payara

  • service payara start

7. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.4.war

8. Restart payara

  • service payara stop
  • service payara start

9. Reload Citation Metadata Block:

wget https://github.com/IQSS/dataverse/releases/download/v5.4/citation.tsv curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @citation.tsv -H "Content-type: text/tab-separated-values"

Additional Release Steps

1. Confirm that the schema.xml was updated with the new v5.4 version when you updated Solr.

2. Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration.

For example: (modify the path names as needed)

cd /usr/local/solr-8.8.1/server/solr/collection1/conf wget https://github.com/IQSS/dataverse/releases/download/v5.4/updateSchemaMDB.sh chmod +x updateSchemaMDB.sh ./updateSchemaMDB.sh -t .

See https://guides.dataverse.org/en/5.4/admin/metadatacustomization.html#updating-the-solr-schema for more information.

3. Do a clean reindex by first clearing then indexing. Re-indexing is required to get full-functionality from this change. Please refer to the guides on how to clear and index if needed.

4. Upgrade Postgres.

  • Export your current database with pg_dumpall;
  • Install the new version of PostgreSQL; (make sure it's running on the same port, etc. so that no changes are needed in the Payara configuration)
  • Re-import the database with psql, as the postgres user.

Consult the PostgreSQL upgrade documentation for more information, for example https://www.postgresql.org/docs/13/upgrading.html#UPGRADING-VIA-PGDUMPALL.

5. Retroactively store original file size

Use the Datafile Integrity API to ensure that the sizes of all original files are stored in the database.

6. DB Cleanup for Superusers Releasing without Version Updates

In datasets where a superuser has run the Curate command and the update included a change to the fileaccessrequest flag, those changes would not be reflected appropriately in the published version. This should be a rare occurrence.

Instead of an automated solution, we recommend inspecting the affected datasets and correcting the fileaccessrequest flag as appropriate. You can identify the affected datasets this via a query, which is available in the folder here:

https://github.com/IQSS/dataverse/raw/develop/scripts/issues/7687/

7. (Optional, but recommended) DB Cleanup for Saved Searches and Linked Objects

Perform the DB Cleanup for Saved Searches and Linked Objects, summarized in the "Notes for Dataverse Installation Administrators" section above.

8. Take a backup of the Worldmap links, if any.

9. (Only required if custom metadata blocks are used in your Dataverse installation) Update any custom metadata blocks:

In the .tsv for any custom metadata blocks, for any subfield that has a required value of TRUE, find the corresponding parent field and change its required value to TRUE.

Note: As there is an accompanying Flyway script that updates the values directly in the database, you do not need to reload these metadata .tsv files via API, unless you make additional changes, e.g set some compound fields to be conditionally required.