Dataverse: v5.10 Release

Release date:
March 18, 2022
Previous version:
v5.9 (released December 15, 2021)
Magnitude:
5,751 Diff Delta
Contributors:
24 total committers
Data confidence:
Commits:

91 Features Released with v5.10

Top Contributors in v5.10

qqmyers
pdurbin
poikilotherm
landreev
lubitchv
donsizemore
sekmiller
eunices
mdmADA
haarli

Directory Browser for v5.10

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

Dataverse Software 5.10

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Multiple License Support

Users can now select from a set of configured licenses in addition to or instead of the previous Creative Commons CC0 choice or provide custom terms of use (if configured) for their datasets. Administrators can configure their Dataverse instance via API to allow any desired license as a choice and can enable or disable the option to allow custom terms. Administrators can also mark licenses as "inactive" to disallow future use while keeping that license for existing datasets. For upgrades, only the CC0 license will be preinstalled. New installations will have both CC0 and CC BY preinstalled. The Configuring Licenses section of the Installation Guide shows how to add or remove licenses.

Note: Datasets in existing installations will automatically be updated to conform to new requirements that custom terms cannot be used with a standard license and that custom terms cannot be empty. Administrators may wish to manually update datasets with these conditions if they do not like the automated migration choices. See the "Notes for Dataverse Installation Administrators" section below for details.

This release also makes the license selection and/or custom terms more prominent when publishing and viewing a dataset and when downloading files.

Ingest and File Upload Messaging Improvements

Messaging around ingest failure has been softened to prevent support tickets. In addition, messaging during file upload has been improved, especially with regard to showing size limits and providing links to the guides about tabular ingest. For screenshots and additional details see PR #8271.

Downloading of Guestbook Responses with Fewer Clicks

A download button has been added to the page that lists guestbooks. This saves a click but you can still download responses from the "View Responses" page, as before.

Also, links to the guides about guestbooks have been added in additional places.

Dynamically Request Arbitrary Metadata Fields from Search API

The Search API now allows arbitrary metadata fields to be requested when displaying results from datasets. You can request all fields from metadata blocks or pick and choose certain fields.

The new parameter is called metadata_fields and the Search API documentation contains details and examples: https://guides.dataverse.org/en/5.10/api/search.html

Solr 8 Upgrade

The Dataverse Software now runs on Solr 8.11.1, the latest available stable release in the Solr 8.x series.

PostgreSQL Upgrade

A PostgreSQL upgrade is not required for this release but is planned for the next release. See below for details.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

  • When creating or updating datasets, users can select from a set of licenses configured by the administrator (CC, CC BY, custom licenses, etc.) or provide custom terms (if the installation is configured to allow them). (Issue #7440, PR #7920)
  • Users can get better feedback on tabular ingest errors and more information about size limits when uploading files. (Issue #8205, PR #8271)
  • Users can more easily download guestbook responses and learn how guestbooks work. (Issue #8244, PR #8402)
  • Search API users can specify additional metadata fields to be returned in the search results. (Issue #7863, PR #7942)
  • The "Preview" tab on the file page can now show restricted files. (Issue #8258, PR #8265)
  • Users wanting to upload files from GitHub to Dataverse can learn about a new GitHub Action called "Dataverse Uploader". (PR #8416)
  • Users requesting access to files now get feedback that it was successful. (Issue #7469, PR #8341)
  • Users may notice various accessibility improvements. (Issue #8321, PR #8322)
  • Users of the Social Science metadata block can now add multiples of the "Collection Mode" field. (Issue #8452, PR #8473)
  • Guestbooks now support multi-line text area fields. (Issue #8288, PR #8291)
  • Guestbooks can better handle commas in responses. (Issue #8193, PR #8343)
  • Dataset editors can now deselect a guestbook. (Issue #2257, PR #8403)
  • Administrators with a large actionlogrecord table can read docs on archiving and then trimming it. (Issue #5916, PR #8292)
  • Administrators can list locks across all datasets. (PR #8445)
  • Administrators can run a version of Solr that doesn't include a version of log4j2 with serious known vulnerabilities. We trust that you have patched the version of Solr you are running now following the instructions that were sent out. An upgrade to the latest version is recommended for extra peace of mind. (PR #8415)
  • Administrators can run a version of Dataverse that doesn't include a version of log4j with known vulnerabilities. (PR #8377)

Notes for Dataverse Installation Administrators

Updating for Multiple License Support

Adding and Removing Licenses and How Existing Datasets Will Be Automatically Updated

As part of installing or upgrading an existing installation, administrators may wish to add additional license choices and/or configure Dataverse to allow custom terms. Adding additional licenses is managed via API, as explained in the Configuring Licenses section of the Installation Guide. Licenses are described via a JSON structure providing a name, URL, short description, and optional icon URL. Additionally licenses may be marked as active (selectable for new or updated datasets) or inactive (only allowed on existing datasets) and one license can be marked as the default. Custom Terms are allowed by default (backward compatible with the current option to select "No" to using CC0) and can be disabled by setting :AllowCustomTermsOfUse to false.

Further, administrators should review the following automated migration of existing licenses and terms into the new license framework and, if desired, should manually find and update any datasets for which the automated update is problematic. To understand the migration process, it is useful to understand how the multiple license feature works in this release:

"Custom Terms", aka a custom license, are defined through entries in the following fields of the dataset "Terms" tab:

  • Terms of Use
  • Confidentiality Declaration
  • Special Permissions
  • Restrictions
  • Citation Requirements
  • Depositor Requirements
  • Conditions
  • Disclaimer

"Custom Terms" require, at a minimum, a non-blank entry in the "Terms of Use" field. Entries in other fields are optional.

Since these fields are intended for terms/conditions that would potentially conflict with or modify the terms in a standard license, they are no longer shown when a standard license is selected.

In earlier Dataverse releases, it was possible to select the CC0 license and have entries in the fields above. It was also possible to say "No" to using CC0 and leave all of these terms fields blank.

The automated process will update existing datasets as follows.

  • "CC0 Waiver" and no entries in the fields above -> CC0 License (no change)
  • No CC0 Waiver and an entry in the "Terms of Use" field and possibly others fields listed above -> "Custom Terms" with the same entries in these fields (no change)
  • CC0 Waiver and an entry in some of the fields listed -> 'Custom Terms' with the following text preprended in the "Terms of Use" field: "This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions:"
  • No CC0 Waiver and an entry in a field(s) other than the "Terms of Use" field -> "Custom Terms" with the following "Terms of Use" added: "This dataset is made available with limited information on how it can be used. You may wish to communicate with the Contact(s) specified before use."
  • No CC0 Waiver and no entry in any of the listed fields -> "Custom Terms" with the following "Terms of Use" added: "This dataset is made available without information on how it can be used. You should communicate with the Contact(s) specified before use."

Administrators who have datasets where CC0 has been selected along with additional terms, or datasets where the Terms of Use field is empty, may wish to modify those datasets prior to upgrading to avoid the automated changes above. This is discussed next.

Handling Datasets that No Longer Comply With Licensing Rules

In most Dataverse installations, one would expect the vast majority of datasets to either use the CC0 Waiver or have non-empty Terms of Use. As noted above, these will be migrated without any issue. Administrators may however wish to find and manually update datasets that specified a CC0 license but also had terms (no longer allowed) or had no license and no terms of use (also no longer allowed) rather than accept the default migrations for these datasets listed above.

Finding and Modifying Datasets with a CC0 License and Non-Empty Terms

To find datasets with a CC0 license and non-empty terms:

select CONCAT('doi:', dvo.authority, '/', dvo.identifier), v.alias as dataverse_alias, case when versionstate='RELEASED' then concat(dv.versionnumber, '.', dv.minorversionnumber) else versionstate END  as version, dv.id as datasetversion_id, t.id as termsofuseandaccess_id, t.termsofuse, t.confidentialitydeclaration, t.specialpermissions, t.restrictions, t.citationrequirements, t.depositorrequirements, t.conditions, t.disclaimer from dvobject dvo, termsofuseandaccess t, datasetversion dv, dataverse v where dv.dataset_id=dvo.id and dv.termsofuseandaccess_id=t.id and dvo.owner_id=v.id and t.license='CC0' and not (t.termsofuse is null and t.confidentialitydeclaration is null and t.specialpermissions is null and t.restrictions is null and citationrequirements is null and t.depositorrequirements is null and t.conditions is null and t.disclaimer is null);

The datasetdoi column will let you find and view the affected dataset in the Dataverse web interface. The version column will indicate which version(s) are relevant. The dataverse_alias will tell you which Dataverse collection the dataset is in (and may be useful if you want to adjust all datasets in a given collection). The termsofuseandaccess_id column indicates which specific entry in that table is associated with the dataset/version. The remaining columns show the values of any terms fields.

There are two options to migrate such datasets:

Option 1: Set all terms fields to null:

update termsofuseandaccess set termsofuse=null, confidentialitydeclaration=null, t.specialpermissions=null, t.restrictions=null, citationrequirements=null, depositorrequirements=null, conditions=null, disclaimer=null where id=<id>;

or to change several at once:

update termsofuseandaccess set termsofuse=null, confidentialitydeclaration=null, t.specialpermissions=null, t.restrictions=null, citationrequirements=null, depositorrequirements=null, conditions=null, disclaimer=null where id in (<comma separated list of termsanduseofaccess_ids>);

Option 2: Change the Dataset version(s) to not use the CCO waiver and modify the Terms of Use (and/or other fields) as you wish to indicate that the CC0 waiver was previously selected:

    update termsofuseandaccess set license='NONE', termsofuse=concat('New text. ', termsofuse) where id=<id>;

or

    update termsofuseandaccess set license='NONE', termsofuse=concat('New text. ', termsofuse) where id in (<comma separated list of termsanduseofaccess_ids>);
Finding and Modifying Datasets without a CC0 License and with Empty Terms

To find datasets with a without a CC0 license and with empty terms:

select CONCAT('doi:', dvo.authority, '/', dvo.identifier), v.alias as dataverse_alias, case when versionstate='RELEASED' then concat(dv.versionnumber, '.', dv.minorversionnumber) else versionstate END  as version, dv.id as datasetversion_id, t.id as termsofuseandaccess_id, t.termsofuse, t.confidentialitydeclaration, t.specialpermissions, t.restrictions, t.citationrequirements, t.depositorrequirements, t.conditions, t.disclaimer from dvobject dvo, termsofuseandaccess t, datasetversion dv, dataverse v where dv.dataset_id=dvo.id and dv.termsofuseandaccess_id=t.id and dvo.owner_id=v.id and t.license='NONE' and t.termsofuse is null;

As before, there are a couple options.

Option 1: These datasets could be updated to use CC0:

update termsofuseandaccess set license='CC0', confidentialitydeclaration=null, t.specialpermissions=null, t.restrictions=null, citationrequirements=null, depositorrequirements=null, conditions=null, disclaimer=null where id=<id>;

Option 2: Terms of Use could be added:

update termsofuseandaccess set termsofuse='New text. ' where id=<id>;

In both cases, the same where id in (<comma separated list of termsanduseofaccess_ids>); ending could be used to change multiple datasets/versions at once.

Standardizing Custom Licenses

If many datasets use the same set of Custom Terms, it may make sense to create and register a standard license including those terms. Doing this would include:

  • Creating and posting an external document that includes the custom terms, i.e. an HTML document with sections corresponding to the terms fields that are used.
  • Defining a name, short description, URL (where it is posted), and optionally an icon URL for this license
  • Using the Dataverse API to register the new license as one of the options available in your installation
  • Using the API to make sure the license is active and deciding whether the license should also be the default
  • Once the license is registered with Dataverse, making an SQL update to change datasets/versions using that license to reference it instead of having their own copy of those custom terms.

The benefits of this approach are:

  • usability: the license can be selected for new datasets without allowing custom terms and without users having to cut/paste terms or collection administrators having to configure templates with those terms
  • efficiency: custom terms are stored per dataset whereas licenses are registered once and all uses of it refer to the same object and external URL
  • security: with the license terms maintained external to Dataverse, users cannot edit specific terms and curators do not need to check for edits

Once a standardized version of your Custom Terms are registered as a license, an SQL update like the following can be used to have datasets use it:

UPDATE termsofuseandaccess
SET license_id = (SELECT license.id FROM license WHERE license.name = '<Your License Name>'), termsofuse=null, confidentialitydeclaration=null, t.specialpermissions=null, t.restrictions=null, citationrequirements=null, depositorrequirements=null, conditions=null, disclaimer=null 
WHERE termsofuseandaccess.termsofuse LIKE '%<Unique phrase in your Terms of Use>%';

Note that this information is also available in the Configuring Licenses section of the Installation Guide. Look for "Standardizing Custom Licenses".

PostgreSQL Version 10+ Required

If you are still using PostgreSQL 9.x, now is the time to upgrade. PostgreSQL 9.x is now EOL (no longer supported, as of January 2022), and in the next version of the Dataverse Software we plan to upgrade the Flyway library (used for database migrations) to a version that will no longer work with versions prior to PostgreSQL 10. See PR #8296 for more on this upcoming Flyway upgrade.

The Dataverse Software has been tested with PostgreSQL versions up to 13. The current stable version 13.5 is recommended. If that's not an option for reasons specific to your installation (for example, if PostgreSQL 13.5 is not available for the OS distribution you are using), any 10+ version should work.

See the upgrade section below for more information.

Providing S3 Storage Credentials via MicroProfile Config

With this release, you may use two new JVM options (dataverse.files.<id>.access-key and dataverse.files.<id>.secret-key) to pass an access key identifier and a secret access key for S3-based storage definitions without creating the files used by the AWS CLI tools (~/.aws/config & ~/.aws/credentials).

This has been added to ease setups using containers (Docker, Podman, Kubernetes, OpenShift) or testing and development installations. Find additional documentation and a word of warning in the Installation Guide.

New JVM Options and DB Settings

The following JVM settings have been added:

  • dataverse.files.<id>.access-key - S3 access key ID.
  • dataverse.files.<id>.secret-key - S3 secret access key.

See the JVM Options section of the Installation Guide for more information.

The following DB settings have been added:

  • :AllowCustomTermsOfUse (default: true) - allow users to provide Custom Terms instead of choosing one of the configured standard licenses.

See the Database Settings section of the Guides for more information.

Notes for Developers and Integrators

In the "Backward Incompatibilities" section below, note changes in the API regarding licenses and the native JSON format.

Backward Incompatibilities

With the change to support multiple licenses, which can include cases where CC0 is not an option, and the decision to prohibit two previously possible cases (no license and no entry in the "Terms of Use" field, a standard license and entries in "Terms of Use", "Special Permissions", and related fields), this release contains changes to the display, API payloads, and export metadata that are not backward compatible. These include:

  • "CC0 Waiver" has been replaced by "CC0 1.0" (the short name specified by Creative Commons) in the web interface, API payloads, and export formats that include a license name. (Note that installation admins can alter the license name in the database to maintain the original "CC0 Waiver" text, if desired.)
  • Schema.org metadata in page headers and the Schema.org JSON-LD metadata export now reference the license via URL (which should avoid the current warning from Google about an invalid license object in the page metadata).
  • Metadata exports and import methods (including SWORD) use either the license name (e.g. in the JSON export) or URL (e.g. in the OAI_ORE export) rather than a hardcoded value of "CC0" or "CC0 Waiver" currently (if the CC0 license is available, its default name would be "CC0 1.0").
  • API calls (e.g. for import, migrate) that specify both a license and custom terms will be considered an error, as would having no license and an empty/blank value for "Terms of Use".
  • Rollback. In general, one should not deploy an earlier release over a database that has been modified by deployment of a later release. (Make a db backup before upgrading and use that copy if you go back to a prior version.) Due to the nature of the db changes in this release, attempts to deploy an earlier version of Dataverse will fail unless the database is also restored to its pre-release state.

Also, note that since CC0 Waiver is no longer a hardcoded option, text strings that reference it have been edited or removed from Bundle.properties. This means that the ability to provide translations of the CC0 license name/description has been removed. The initial release of multiple license functionality doesn't include an alternative mechanism to provide translations of license names/descriptions, so this is a regression in capability (see #8346). The instructions and help information about license and terms remains internationalizable, it is only the name/description of the licenses themselves that cannot yet be translated.

An update in the metadata block Social Science changes the field CollectionMode to allow multiple values. This changes the way the field is encoded in the native JSON format. From

"typeName": "collectionMode",
"multiple": false,
"typeClass": "primitive",
"value": "some text"

to

"typeName": "collectionMode",
"multiple": true,
"typeClass": "primitive",
"value": ["some text", "more text"]

Complete List of Changes

For the complete list of code changes in this release, see the 5.10 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email [email protected].

Installation

If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.10.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.10.war

5. Restart payara

  • service payara stop
  • service payara start

6. Update the Social Science metadata block

  • wget https://github.com/IQSS/dataverse/releases/download/v5.10/social_science.tsv
  • curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @social_science.tsv -H "Content-type: text/tab-separated-values"
  • Note that this update also requires an updated Solr schema. We strongly recommend that you upgrade Solr as part of this release, by installing the latest stable release from scratch (see below). In the process you will configure it with the latest version of the schema as distributed with this Dataverse release, so no further steps will be needed. If you have already upgraded, or have some very good reason to stay on the old version a little longer, please refer to https://guides.dataverse.org/en/5.10/admin/metadatacustomization.html#updating-the-solr-schema for information on updating your Solr schema in place.

7. Run ReExportall to update Exports

Following the directions in the Admin Guide

8. Upgrade Solr

See "Additional Release Steps" below for how to upgrade Solr.

Additional Release Steps

Solr Upgrade

With this release we upgrade to the latest available stable release in the Solr 8.x branch. We recommend a fresh installation of Solr (the index will be empty) followed by an "index all".

Before you start the "index all", the Dataverse installation will appear to be empty because the search results come from Solr. As indexing progresses, partial results will appear until indexing is complete.

See http://guides.dataverse.org/en/5.10/installation/prerequisites.html#installing-solr for more information.

Please note that after you have followed the instruction above you will have Solr installed with the default schema that lists all the fields in the standard Dataverse metadata blocks. If your installation uses any custom metadata blocks, please refer to https://guides.dataverse.org/en/5.10/admin/metadatacustomization.html#updating-the-solr-schema for information on updating your Solr schema to include these extra fields.

PostgreSQL Upgrade

The tested and recommended way of upgrading an existing database is as follows:

  • Export your current database with pg_dumpall.
  • Install the new version of PostgreSQL (make sure it's running on the same port, so that no changes are needed in the Payara configuration).
  • Re-import the database with psql, as the user postgres.

It is strongly recommended to use the versions of the pg_dumpall and psql from the old and new versions of PostgreSQL, respectively. For example, the commands below were used to migrate a database running under PostgreSQL 9.6 to 13.5. Adjust the versions and the path names to match your environment.

Back up/export:

/usr/pgsql-9.6/bin/pg_dumpall -U postgres > /tmp/backup.sql

Restore/import:

/usr/pgsql-13/bin/psql -U postgres -f /tmp/backup.sql

When upgrading the production database here at Harvard IQSS we were able to go from version 9.6 all the way to 13.3 without any issues.

You may want to try these backup and restore steps on a test server to get an accurate estimate of how much downtime to expect with the final production upgrade. That of course will depend on the size of your database.

Consult the PostgreSQL upgrade documentation for more information, for example https://www.postgresql.org/docs/13/upgrading.html#UPGRADING-VIA-PGDUMPALL.