Dataverse: v6.1 Release

Release date:
December 12, 2023
Previous version:
v6.0 (released September 8, 2023)
Magnitude:
17,831 Diff Delta
Contributors:
26 total committers
Data confidence:
Commits:

126 Features Released with v6.1

Top Contributors in v6.1

GPortas
qqmyers
pdurbin
landreev
bencomp
steven-winship-8c8c
sekmiller
jp-tosca
poikilotherm
Lehebax

Directory Browser for v6.1

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

Dataverse 6.1

Please note: To read these instructions in full, please go to https://github.com/IQSS/dataverse/releases/tag/v6.1 rather than the list of releases, which will cut them off.

This release brings new features, enhancements, and bug fixes to the Dataverse software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release highlights

Guestbook at request

Dataverse can now be configured (via the dataverse.files.guestbook-at-request option) to display any configured guestbook to users when they request restricted files (new functionality) or when they download files (previous behavior).

The global default defined by this setting can be overridden at the collection level on the collection page and at the individual dataset level by a superuser using the API. The default, showing guestbooks when files are downloaded, remains as it was in prior Dataverse versions.

For details, see dataverse.files.guestbook-at-request and PR #9599.

Collection-level storage quotas

This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS.

Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the first, initial deployment of Dataverse 6.1.

For details, see Storage Quotas for Collections in the Admin Guide.

Globus support (experimental), continued

Globus support in Dataverse has been expanded to include support for using file-based Globus endpoints, including the case where files are stored on tape and are not immediately accessible and for the case of referencing files stored on remote Globus endpoints. Support for using the Globus S3 Connector with an S3 store has been retained but requires changes to the Dataverse configuration. Please note:

  • Globus functionality remains experimental/advanced in that it requires significant setup, differs in multiple ways from other file storage mechanisms, and may continue to evolve with the potential for backward incompatibilities.
  • The functionality is configured per store and replaces the previous single-S3-Connector-per-Dataverse-instance model.
  • Adding files to a dataset, and accessing files is supported via the Dataverse user interface through a separate dataverse-globus app.
  • The functionality is also accessible via APIs (combining calls to the Dataverse and Globus APIs)

Backward incompatibilities: - The configuration for use of a Globus S3 Connector has changed and is aligned with the standard store configuration mechanism - The new functionality is incompatible with older versions of the globus-dataverse app and the Globus-related functionality in the UI will only function correctly if a Dataverse 6.1 compatible version of the dataverse-globus app is configured.

New JVM options: - A new "globus" store type and associated store-related options have been added. These are described in the File Storage section of the Installation Guide. - dataverse.files.globus-cache-maxage - specifies the number of minutes Dataverse will wait between an initial request for a file transfer occurs and when that transfer must begin.

Obsolete Settings: the :GlobusBasicToken, :GlobusEndpoint, and :GlobusStores settings are no longer used

Further details can be found in the Big Data Support section of the Developer Guide.

Alternative Title now allows multiple values

Alternative Title now allows multiples. Note that JSON used to create a dataset with an Alternate Title must be changed. See "Backward incompatibilities" below and PR #9440 for details.

External tools: configure tools now available at the dataset level

Read/write "configure" tools (a type of external tool) are now available at the dataset level. They appear under the "Edit Dataset" menu. See External Tools in the Admin Guide and PR #9925.

S3 out-of-band upload

In some situations, direct upload might not work from the UI, e.g., when s3 storage is not accessible from the internet. This pull request adds an option to allow direct uploads via API only. This way, a third party application can use direct upload from within the internal network, while there is no direct download available to the users via UI. By default, Dataverse supports uploading files via the add a file to a dataset API. With S3 stores, a direct upload process can be enabled to allow sending the file directly to the S3 store (without any intermediate copies on the Dataverse server). With the upload-out-of-band option enabled, it is also possible for file upload to be managed manually or via third-party tools, with the Adding the Uploaded file to the Dataset API call (described in the Direct DataFile Upload/Replace API page) used to add metadata and inform Dataverse that a new file has been added to the relevant store.

JSON Schema for datasets

Functionality has been added to help validate dataset JSON prior to dataset creation. There are two new API endpoints in this release. The first takes in a collection alias and returns a custom dataset schema based on the required fields of the collection. The second takes in a collection alias and a dataset JSON file and does an automated validation of the JSON file against the custom schema for the collection. In this release functionality is limited to JSON format validation and validating required elements. Future releases will address field types, controlled vocabulary, etc. See Retrieve a Dataset JSON Schema for a Collection in the API Guide and PR #10109.

OpenID Connect (OIDC) improvements

Using MicroProfile Config for provisioning

With this release it is possible to provision a single OIDC-based authentication provider by using MicroProfile Config instead of or in addition to the classic Admin API provisioning.

If you are using an external OIDC provider component as an identity management system and/or broker to other authentication providers such as Google, eduGain SAML and so on, this might make your life easier during instance setups and reconfiguration. You no longer need to generate the necessary JSON file.

Adding PKCE Support

Some OIDC providers require using PKCE as additional security layer. As of this version, you can enable support for this on any OIDC provider you configure. (Note that OAuth2 providers have not been upgraded.)

For both features, see the OIDC section of the Installation Guide and PR #9273.

Solr improvements

As of this release, application-side support has been added for the "circuit breaker" mechanism in Solr that makes it drop requests more gracefully when the search engine is experiencing load issues.

Please see the Installing Solr section of the Installation Guide.

New release of Dataverse Previewers (including a Markdown previewer)

Version 1.4 of the standard Dataverse Previewers from https://github/com/gdcc/dataverse-previewers is available. The new version supports the use of signedUrls rather than API keys when previewing restricted files (including files in draft dataset versions). Upgrading is highly recommended. Please note:

  • SignedUrls can now be used with PrivateUrl access tokens, which allows PrivateUrl users to view previewers that are configured to use SignedUrls. See #10093.
  • Launching a dataset-level configuration tool will automatically generate an API token when needed. This is consistent with how other types of tools work. See #10045.
  • There is now a Markdown (.md) previewer.

New or improved APIs

The development of a new UI for Dataverse is driving the addition or improvement of many APIs.

New API endpoints

  • deaccessionDataset (/api/datasets/{id}/versions/{versionId}/deaccession): version deaccessioning through API (Given a dataset and a version).
  • /api/files/{id}/downloadCount
  • /api/files/{id}/dataTables
  • /api/files/{id}/metadata/tabularTags New endpoint to set tabular file tags.
  • canManageFilePermissions (/access/datafile/{id}/userPermissions) Added for getting user permissions on a file.
  • getVersionFileCounts (/api/datasets/{id}/versions/{versionId}/files/counts): Given a dataset and its version, retrieves file counts based on different criteria (Total count, per content type, per access status and per category name).
  • setFileCategories (/api/files/{id}/metadata/categories): Updates the categories (by name) for an existing file. If the specified categories do not exist, they will be created.
  • userFileAccessRequested (/api/access/datafile/{id}/userFileAccessRequested): Returns true or false depending on whether or not the calling user has requested access to a particular file.
  • hasBeenDeleted (/api/files/{id}/hasBeenDeleted): Know if a particular file that existed in a previous version of the dataset no longer exists in the latest version.
  • getZipDownloadLimit (/api/info/zipDownloadLimit): Get the configured zip file download limit. The response contains the long value of the limit in bytes.
  • getMaxEmbargoDurationInMonths (/api/info/settings/:MaxEmbargoDurationInMonths): Get the maximum embargo duration in months, if available, configured through the database setting :MaxEmbargoDurationInMonths.
  • getDatasetJsonSchema (/api/dataverses/{id}/datasetSchema): Get a dataset schema with the fields required by a given dataverse collection.
  • validateDatasetJsonSchema (/api/dataverses/{id}/validateDatasetJson): Validate that a dataset JSON file is in proper format and contains the required elements and fields for a given dataverse collection.
  • downloadTmpFile (/api/admin/downloadTmpFile): For testing purposes, allows files to be downloaded from /tmp.

Pagination of files in dataset versions

  • optional pagination has been added to /api/datasets/{id}/versions that may be useful in datasets with a large number of versions
  • a new flag includeFiles is added to both /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid} (true by default), providing an option to drop the file information from the output
  • when files are requested to be included, some database lookup optimizations have been added to improve the performance on datasets with large numbers of files.

This is reflected in the Dataset Versions API section of the Guide.

DataFile API payload has been extended to include the following fields

  • tabularData: Boolean field to know if the DataFile is of tabular type
  • fileAccessRequest: Boolean field to know if the file access requests are enabled on the Dataset (DataFile owner)
  • friendlyType: String

The getVersionFiles endpoint (/api/datasets/{id}/versions/{versionId}/files) has been extended to support pagination, ordering, and optional filtering

  • Access status: through the accessStatus query parameter, which supports the following values:
    • Public
    • Restricted
    • EmbargoedThenRestricted
    • EmbargoedThenPublic
  • Category name: through the categoryName query parameter. To return files to which the particular category has been added.
  • Content type: through the contentType query parameter. To return files matching the requested content type. For example: "image/png".

Additional improvements to existing API endpoints

  • getVersionFiles (/api/datasets/{id}/versions/{versionId}/files): Extended to support optional filtering by search text through the searchText query parameter. The search will be applied to the labels and descriptions of the dataset files. Added tabularTagName to return files to which the particular tabular tag has been added. Added optional boolean query parameter "includeDeaccessioned", which, if enabled, causes the endpoint to consider deaccessioned versions when searching for versions to obtain files.
  • getVersionFileCounts (/api/datasets/{id}/versions/{versionId}/files/counts): Added optional boolean query parameter "includeDeaccessioned", which, if enabled, causes the endpoint to consider deaccessioned versions when searching for versions to obtain file counts. Added support for filtering by optional criteria query parameter:
    • contentType
    • accessStatus
    • categoryName
    • tabularTagName
    • searchText
  • getDownloadSize ("api/datasets/{identifier}/versions/{versionId}/downloadsize"): Added optional boolean query parameter "includeDeaccessioned", which, if enabled, causes the endpoint to consider deaccessioned versions when searching for versions to obtain files. Added a new optional query parameter "mode" This parameter applies a filter criteria to the operation and supports the following values:
    • All (Default): Includes both archival and original sizes for tabular files
    • Archival: Includes only the archival size for tabular files
    • Original: Includes only the original size for tabular files.
  • /api/datasets/{id}/versions/{versionId} New query parameter includeDeaccessioned added to consider deaccessioned versions when searching for versions.
  • /api/datasets/{id}/userPermissions Get user permissions on a dataset, in particular, the user permissions that this API call checks, returned as booleans, are the following:
    • Can view the unpublished dataset
    • Can edit the dataset
    • Can publish the dataset
    • Can manage the dataset permissions
    • Can delete the dataset draft
  • getDatasetVersionCitation (/api/datasets/{id}/versions/{versionId}/citation) endpoint now accepts a new boolean optional query parameter "includeDeaccessioned", which, if enabled, causes the endpoint to consider deaccessioned versions when searching for versions to obtain the citation.

Improvements for developers

  • Developers can enjoy a dramatically faster feedback loop when iterating on code if they are using Netbeans or IntelliJ IDEA Ultimate (with the Payara Platform Tools plugin). For details, see https://guides.dataverse.org/en/6.1/container/dev-usage.html#intellij-idea-ultimate-and-payara-platform-tools and the thread on the mailing list.
  • Developers can now test S3 locally by using the Dockerized development environment, which now includes both LocalStack and MinIO. API (end to end) tests are in S3AccessIT.
  • In addition, a new integration test class (not an API test, the new Testcontainers-based test launched with mvn verify) has been added at S3AccessIOLocalstackIT. It uses Testcontainers to spin up Localstack for S3 testing and does not require Dataverse to be running.
  • With this release, we add a new type of testing to Dataverse: integration tests which are not end-to-end tests (like our API tests). Starting with OIDC authentication support, we test regularly on CI for working condition of both OIDC login options in UI and API.
  • The testing and development Keycloak realm has been updated with more users and compatibility with Keycloak 21.
  • The support for setting JVM options during testing has been improved for developers. You now may add the @JvmSetting annotation to classes (also inner classes) and reference factory methods for values. This improvement is also paving the way to enable manipulating JVM options during end-to-end tests on remote ends.
  • As part of these testing improvements, the code coverage report file for unit tests has moved from target/jacoco.exec to target/coverage-reports/jacoco-unit.exec.

Major use cases and infrastructure enhancements

Changes and fixes in this release not already mentioned above include:

  • Validation has been added for the Geographic Bounding Box values in the Geospatial metadata block. This will prevent improperly defined bounding boxes from being created via the edit page or metadata imports. This also fixes the issue where existing datasets with invalid geoboxes were quietly failing to get reindexed. See PR #10142.
  • Dataverse's OAI_ORE Metadata Export format and archival BagIT exports (which include the OAI-ORE metadata export file) have been updated to include information about the dataset version state, e.g. RELEASED or DEACCESSIONED and to indicate which version of Dataverse was used to create the archival Bag.

As part of the latter, the current OAI_ORE Metadata format has been given a 1.0.0 version designation and it is expected that any future changes to the OAI_ORE export format will result in a version change and that tools such as DVUploader that can recreate datasets from archival Bags will start indicating which version(s) of the OAI_ORE format they can read.

Dataverse installations that have been using archival Bags may wish to update any existing archival Bags they have, e.g. by deleting existing Bags and using the Dataverse archival Bag export API to generate updated versions. - For BagIT export, it is now possible to configure the following information in bag-info.txt. (Previously, customization was possible by editing Bundle.properties but this is no longer supported.) For details, see https://guides.dataverse.org/en/6.1/installation/config.html#bag-info-txt - Source-Organization from dataverse.bagit.sourceorg.name. - Organization-Address from dataverse.bagit.sourceorg.address. - Organization-Email from dataverse.bagit.sourceorg.address. - This release fixes several issues (#9952, #9953, #9957) where the Signposting output did not match the Signposting specification. These changes introduce backward-incompatibility, but since Signposting support was added recently (in Dataverse 5.14 in PR #8981), we feel it's best to do this clean up and not support the old implementation that was not fully compliant with the spec. - To fix #9952, we surround the license info with < and >. - To fix #9953, we no longer wrap the response in a {"status":"OK","data":{ JSON object. This has also been noted in the guides at https://dataverse-guide--9955.org.readthedocs.build/en/9955/api/native-api.html#retrieve-signposting-information - To fix #9957, we corrected the mime/content type, changing it from json+ld to ld+json. For backward compatibility, we are still supporting the old one, for now. - It's now possible to configure the docroot, which holds collection logos and more. See dataverse.files.docroot in the Installation Guide and PR #9819. - We have started maintaining an API changelog of breaking changes: https://guides.dataverse.org/en/6.1/api/changelog.html See also #10060.

New configuration options

  • dataverse.auth.oidc.auth-server-url
  • dataverse.auth.oidc.client-id
  • dataverse.auth.oidc.client-secret
  • dataverse.auth.oidc.enabled
  • dataverse.auth.oidc.pkce.enabled
  • dataverse.auth.oidc.pkce.max-cache-age
  • dataverse.auth.oidc.pkce.max-cache-size
  • dataverse.auth.oidc.pkce.method
  • dataverse.auth.oidc.subtitle
  • dataverse.auth.oidc.title
  • dataverse.bagit.sourceorg.address
  • dataverse.bagit.sourceorg.address
  • dataverse.bagit.sourceorg.name
  • dataverse.files.docroot
  • dataverse.files.globus-cache-maxage
  • dataverse.files.guestbook-at-request
  • dataverse.files.{driverId}.upload-out-of-band

Backward incompatibilities

  • Since Alternative Title is now repeatable, the JSON you send to create or edit a dataset must be an array rather than a simple string. For example, instead of "value": "Alternative Title", you must send "value": ["Alternative Title1", "Alternative Title2"]
  • Several issues (#9952, #9953, #9957) where the Signposting output did not match the Signposting specification introduce backward-incompatibility. See above for details.
  • For BagIT export, if you were configuring values in bag-info.txt using Bundle.properties, you must switch to the new dataverse.bagit JVM options mentioned above. For details, see https://guides.dataverse.org/en/6.1/installation/config.html#bag-info-txt
  • See "Globus support" above for backward incompatibilies specific to Globus.

Complete list of changes

For the complete list of code changes in this release, see the 6.1 Milestone in GitHub.

Getting help

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email [email protected].

Installation

If this is a new installation, please follow our Installation Guide. Please don't be shy about asking for help if you need it!

Once you are in production, we would be delighted to update our map of Dataverse installations around the world to include yours! Please create an issue or email us at [email protected] to join the club!

You are also very welcome to join the Global Dataverse Community Consortium (GDCC).

Upgrade instructions

Upgrading requires a maintenance window and downtime. Please plan ahead, create backups of your database, etc.

These instructions assume that you've already upgraded through all the 5.x releases and are now running Dataverse 6.0.

0. These instructions assume that you are upgrading from 6.0. If you are running an earlier version, the only safe way to upgrade is to progress through the upgrades to all the releases in between before attempting the upgrade to 5.14.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 6 is installed in /usr/local/payara6. If not, adjust as needed.

export PAYARA=/usr/local/payara6

(or setenv PAYARA /usr/local/payara6 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin undeploy dataverse-6.0

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-6.1.war

As noted above, deployment of the war file might take several minutes due a database migration script required for the new storage quotas feature.

5. Restart Payara

  • service payara stop
  • service payara start

6. Update Geospatial Metadata Block (to improve validation of bounding box values)

  • wget https://github.com/IQSS/dataverse/releases/download/v6.1/geospatial.tsv
  • curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file geospatial.tsv

6a. Update Citation Metadata Block (to make Alternative Title repeatable)

  • wget https://github.com/IQSS/dataverse/releases/download/v6.1/citation.tsv
  • curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv

7. Upate Solr schema.xml to allow multiple Alternative Titles to be used. See specific instructions below for those installations without custom metadata blocks (7a) and those with custom metadata blocks (7b).

7a. For installations without custom or experimental metadata blocks:

  • Stop Solr instance (usually service solr stop, depending on Solr installation/OS, see the Installation Guide)

  • Replace schema.xml

    • wget https://github.com/IQSS/dataverse/releases/download/v6.1/schema.xml
    • cp schema.xml /usr/local/solr/solr-9.3.0/server/solr/collection1/conf
  • Start Solr instance (usually service solr start, depending on Solr/OS)

7b. For installations with custom or experimental metadata blocks:

  • Stop Solr instance (usually service solr stop, depending on Solr installation/OS, see the Installation Guide)

  • There are 2 ways to regenerate the schema: Either by collecting the output of the Dataverse schema API and feeding it to the update-fields.sh script that we supply, as in the example below (modify the command lines as needed): wget https://raw.githubusercontent.com/IQSS/dataverse/master/conf/solr/9.3.0/update-fields.sh chmod +x update-fields.sh curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-9.3.0/server/solr/collection1/conf/schema.xml OR, alternatively, you can edit the following line in your schema.xml by hand as follows (to indicate that alternative title is now multiValued="true"): <field name="alternativeTitle" type="text_en" multiValued="true" stored="true" indexed="true"/>

  • Restart Solr instance (usually service solr restart depending on solr/OS)

8. Run ReExportAll to update dataset metadata exports. Follow the directions in the Admin Guide.