Dataverse: v5.0 Release

Release date:
August 18, 2020
Previous version:
v4.20 (released March 30, 2020)
Magnitude:
218 Diff Delta
Contributors:
1 total committer
Data confidence:
Commits:

187 Features Released with v5.0

Top Contributors in v5.0

qqmyers

Directory Browser for v5.0

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

Dataverse 5.0

This release brings new features, enhancements, and bug fixes to Dataverse. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Please note that this is a major release and these are long release notes. We offer no apologies. :)

Release Highlights

Continued Dataset and File Redesign: Dataset and File Button Redesign, Responsive Layout

The buttons available on the Dataset and File pages have been redesigned. This change is to provide more scalability for future expanded options for data access and exploration, and to provide a consistent experience between the two pages. The dataset and file pages have also been redesigned to be more responsive and function better across multiple devices.

This is an important step in the incremental process of the Dataset and File Redesign project, following the release of on-page previews, filtering and sorting options, tree view, and other enhancements. Additional features in support of these redesign efforts will follow in later 5.x releases.

Payara 5

A major upgrade of the application server provides security updates, access to new features like MicroProfile Config API, and will enable upgrades to other core technologies.

Note that moving from Glassfish to Payara will be required as part of the move to Dataverse 5.

Download Dataset

Users can now more easily download all files in Dataset through both the UI and API. If this causes server instability, it's suggested that Dataverse Installation Administrators take advantage of the new Standalone Zipper Service described below.

Download All Option on the Dataset Page

In previous versions of Dataverse, downloading all files from a dataset meant several clicks to select files and initiate the download. The Dataset Page now includes a Download All option for both the original and archival formats of the files in a dataset under the "Access Dataset" button.

Download All Files in a Dataset by API

In previous versions of Dataverse, downloading all files from a dataset via API was a two step process:

  • Find all the database ids of the files.
  • Download all the files, using those ids (comma-separated).

Now you can download all files from a dataset (assuming you have access to them) via API by passing the dataset persistent ID (PID such as DOI or Handle) or the dataset's database id. Versions are also supported, and you can pass :draft, :latest, :latest-published, or numbers (1.1, 2.0) similar to the "download metadata" API.

A Multi-File, Zipped Download Optimization

In this release we are offering an experimental optimization for the multi-file, download-as-zip functionality. If this option is enabled, instead of enforcing size limits, we attempt to serve all the files that the user requested (that they are authorized to download), but the request is redirected to a standalone zipper service running as a cgi executable. Thus moving these potentially long-running jobs completely outside the Application Server (Payara); and preventing service threads from becoming locked serving them. Since zipping is also a CPU-intensive task, it is possible to have this service running on a different host system, thus freeing the cycles on the main Application Server. The system running the service needs to have access to the database as well as to the storage filesystem, and/or S3 bucket.

Please consult the scripts/zipdownload/README.md in the Dataverse 5 source tree.

The components of the standalone "zipper tool" can also be downloaded here:

https://github.com/IQSS/dataverse/releases/download/v5.0/zipper.zip

Updated File Handling

Files without extensions can now be uploaded through the UI. This release also changes the way Dataverse handles duplicate (filename or checksum) files in a dataset. Specifically:

  • Files with the same checksum can be included in a dataset, even if the files are in the same directory.
  • Files with the same filename can be included in a dataset as long as the files are in different directories.
  • If a user uploads a file to a directory where a file already exists with that directory/filename combination, Dataverse will adjust the file path and names by adding "-1" or "-2" as applicable. This change will be visible in the list of files being uploaded.
  • If the directory or name of an existing or newly uploaded file is edited in such a way that would create a directory/filename combination that already exists, Dataverse will display an error.
  • If a user attempts to replace a file with another file that has the same checksum, an error message will be displayed and the file will not be able to be replaced.
  • If a user attempts to replace a file with a file that has the same checksum as a different file in the dataset, a warning will be displayed.
  • Files without extensions can now be uploaded through the UI.

Pre-Publish DOI Reservation with DataCite

Dataverse installations using DataCite will be able to reserve the persistent identifiers for datasets with DataCite ahead of publishing time. This allows the DOI to be reserved earlier in the data sharing process and makes the step of publishing datasets simpler and less error-prone.

Primefaces 8

Primefaces, the open source UI framework upon which the Dataverse front end is built, has been updated to the most recent version. This provides security updates and bug fixes and will also allow Dataverse developers to take advantage of new features and enhancements.

Major Use Cases

Newly-supported use cases in this release include:

  • Users will be presented with a new workflow around dataset and file access and exploration. (Issue #6684, PR #6909)
  • Users will experience a UI appropriate across a variety of device sizes. (Issue #6684, PR #6909)
  • Users will be able to download an entire dataset without needing to select all the files in that dataset. (Issue #6564, PR #6262)
  • Users will be able to download all files in a dataset with a single API call. (Issue #4529, PR #7086)
  • Users will have DOIs reserved for their datasets upon dataset create instead of at publish time. (Issue #5093, PR #6901)
  • Users will be able to upload files without extensions. (Issue #6634, PR #6804)
  • Users will be able to upload files with the same name in a dataset, as long as a those files are in different file paths. (Issue #4813, PR #6924)
  • Users will be able to upload files with the same checksum in a dataset. (Issue #4813, PR #6924)
  • Users will be less likely to encounter locks during the publishing process due to PID providers being unavailable. (Issue #6918, PR #7118)
  • Users will now have their files validated during publish, and in the unlikely event that anything has happened to the files between deposit and publish, they will be able to take corrective action. (Issue #6558, PR #6790)
  • Administrators will likely see more success with Harvesting, as many minor harvesting issues have been resolved. (Issues #7127, #7128, #4597, #7056, #7052, #7023, #7009, and #7003)
  • Administrators can now enable an external zip service that frees up application server resources and allows the zip download limit to be increased. (Issue #6505, PR #6986)
  • Administrators can now create groups based on users' email domains. (Issue #6936, PR #6974)
  • Administrators can now set date facets to be organized chronologically. (Issue #4977, PR #6958)
  • Administrators can now link harvested datasets using an API. (Issue #5886, PR #6935)
  • Administrators can now destroy datasets with mapped shapefiles. (Issue #4093, PR #6860)

Notes for Dataverse Installation Administrators

Glassfish to Payara

This upgrade requires a few extra steps. See the detailed upgrade instructions below.

Dataverse Installations Using DataCite: Upgrade Action Required

If you are using DataCite as your DOI provider you must add a new JVM option called "doi.dataciterestapiurlstring" with a value of "https://api.datacite.org" for production environments and "https://api.test.datacite.org" for test environments. More information about this JVM option can be found in the Installation Guide.

"doi.mdcbaseurlstring" should be deleted if it was previously set.

Dataverse Installations Using DataCite: Upgrade Action Recommended

For installations that are using DataCite, Dataverse v5.0 introduces a change in the process of registering the Persistent Identifier (DOI) for a dataset. Instead of registering it when the dataset is published for the first time, Dataverse will try to "reserve" the DOI when it's created (by registering it as a "draft", using DataCite terminology). When the user publishes the dataset, the DOI will be publicized as well (by switching the registration status to "findable"). This approach makes the process of publishing datasets simpler and less error-prone.

New APIs have been provided for finding any unreserved DataCite-issued DOIs in your Dataverse, and for reserving them (see below). While not required - the user can still attempt to publish a dataset with an unreserved DOI - having all the identifiers reserved ahead of time is recommended. If you are upgrading an installation that uses DataCite, we specifically recommend that you reserve the DOIs for all your pre-existing unpublished drafts as soon as Dataverse v5.0 is deployed, since none of them were registered at create time. This can be done using the following API calls:

  • /api/pids/unreserved will report the ids of the datasets
  • /api/pids/:persistentId/reserve reserves the assigned DOI with DataCite (will need to be run on every id reported by the the first API).

See the Native API Guide for more information.

Scripted, the whole process would look as follows (adjust as needed):

   API_TOKEN='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'

   curl -s -H "X-Dataverse-key:$API_TOKEN" http://localhost:8080/api/pids/unreserved |
   # the API outputs JSON; note the use of jq to parse it:
   jq '.data.count[].pid' | tr -d '"' | 
   while read doi
   do
      curl -s -H "X-Dataverse-key:$API_TOKEN" -X POST http://localhost:8080/api/pids/:persistentId/reserve?persistentId=$doi
   done

Going forward, once all the DOIs have been reserved for the legacy drafts, you may still get an occasional dataset with an unreserved identifier. DataCite service instability would be a potential cause. There is no reason to expect that to happen often, but it is not impossible. You may consider running the script above (perhaps with some extra diagnostics added) regularly, from a cron job or otherwise, to address this preemptively.

Terms of Use Display Updates

In this release we’ve fixed an issue that would cause the Application Terms of Use to not display when the user's language is set to a language that does not match one of the languages for which terms were created and registered for that Dataverse installation. Instead of the expected Terms of Use, users signing up could receive the “There are no Terms of Use for this Dataverse installation” message. This could potentially result in some users signing up for an account without having the proper Terms of Use displayed. This will only affect installations that use the :ApplicationTermsOfUse setting.

Please note that there is not currently a native workflow in Dataverse to display updated Terms of Use to a user or to force re-agreement. This would only potentially affect users that have signed up since the upgrade to 4.17 (or a following release if 4.17 was skipped).

Datafiles Validation when Publishing Datasets

When a user requests to publish a dataset, Dataverse will now attempt to validate the physical files in the dataset, by recalculating the checksums and verifying them against the values in the database. The goal is to prevent any corrupted files in published datasets. Most of all the instances of actual damage to physical files that we've seen in the past happened while the datafiles were still in the Draft state. (Physical files become essentially read-only once published). So this is the logical place to catch any such issues.

If any files in the dataset fail the validation, the dataset does not get published, and the user is notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. See the "Troubleshooting" section of the Guide on how to fix such problems.

This validation will be performed asynchronously, the same way as the registration of the file-level persistent ids. Similarly to the file PID registration, this validation process can be disabled on your system, with the setting :FileValidationOnPublishEnabled. (A Dataverse admin may choose to disable it if, for example, they are already running an external auditing system to monitor the integrity of the files in their Dataverse, and would prefer the publishing process to take less time). See the Configuration section of the Installation Guide.

Please note that we are not aware of any bugs in the current versions of Dataverse that would result in damage to users' files. But you may have some legacy files in your archive that were affected by some issue in the past, or perhaps affected by something outside Dataverse, so we are adding this feature out of abundance of caution. An example of a problem we've experienced in the early versions of Dataverse was a possible scenario where a user actually attempted to delete a Draft file from an unpublished version, where the database transaction would fail for whatever reason, but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. The fix for this case, since the user wanted to delete the file in the first place, is simply to confirm it and purge the datafile entity from the database.

The Setting :PIDAsynchRegFileCount is Deprecated as of 5.0

It used to specify the number of datafiles in the dataset to warrant adding a lock during publishing. As of v5.0 all datasets get locked for the duration of the publishing process. The setting will be ignored if present.

Location Changes for Related Projects

The dataverse-ansible and dataverse-previewers repositories have been moved to the GDCC Organization on GitHub. If you have been referencing the dataverse-ansible repository from IQSS and the dataverse-previewers from QDR, please instead use them from their new locations:

https://github.com/GlobalDataverseCommunityConsortium/dataverse-ansible https://github.com/GlobalDataverseCommunityConsortium/dataverse-previewers

Harvesting Improvements

Many updates have been made to address common Harvesting failures. You may see Harvests complete more often and have a higher success rate on a dataset-by-dataset basis.

New JVM Options and Database Settings

Several new JVM options and DB Settings have been added in this release. More documentation about each of these settings can be found in the Configuration section of the Installation Guide.

New JVM Options

  • doi.dataciterestapiurlstring: Set with a value of "https://api.datacite.org" for production environments and "https://api.test.datacite.org" for test environments. Must be set if you are using DataCite as your DOI provider.
  • dataverse.useripaddresssourceheader: If set, specifies an HTTP Header such as X-Forwarded-For to use to retrieve the user's IP address. This setting is useful in cases such as running Dataverse behind load balancers where the default option of getting the Remote Address from the servlet isn't correct (e.g. it would be the load balancer IP address). Note that unless your installation always sets the header you configure here, this could be used as a way to spoof the user's address. See the Configuration section of the Installation Guide for more information about proper use and security concerns.
  • http.request-timeout-seconds: To facilitate large file upload and download, the Dataverse installer bumps the Payara server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds setting from its default 900 seconds (15 minutes) to 1800 (30 minutes).

New Database Settings

  • :CustomZipDownloadServiceUrl: If defined, this is the URL of the zipping service outside the main application server where zip downloads should be directed (instead of /api/access/datafiles/).
  • :ShibAttributeCharacterSetConversionEnabled: By default, all attributes received from Shibboleth are converted from ISO-8859-1 to UTF-8. You can disable this behavior by setting to false.
  • :ChronologicalDateFacets: Facets with Date/Year are sorted chronologically by default, with the most recent value first. To have them sorted by number of hits, e.g. with the year with the most results first, set this to false.
  • :NavbarGuidesUrl: Set to a fully-qualified URL which will be used for the "User Guide" link in the navbar.
  • :FileValidationOnPublishEnabled: Toggles validation of the physical files in the dataset when it's published, by recalculating the checksums and comparing against the values stored in the DataFile table. By default this setting is absent and Dataverse assumes it to be true. If enabled, the validation will be performed asynchronously, similarly to how we handle assigning persistent identifiers to datafiles, with the dataset locked for the duration of the publishing process.

Custom Analytics Code Changes

You should update your custom analytics code to implement necessary changes for tracking updated dataset and file buttons. There was also a fix to the analytics code that will now properly track downloads for tabular files.

For more information, see the documentation and sample analytics code snippet provided in Installation Guide > Configuration > Web Analytics Code to reflect the changes implemented in this version (#6938/#6684).

Tracking Users' IP Addresses Behind an Address-Masking Proxy

It is now possible to collect real user IP addresses in MDC logs and/or set up an IP group on a system running behind a proxy/load balancer that hides the addresses of incoming requests. See "Recording User IP Addresses" in the Configuration section of the Installation Guide.

Reload Astrophysics Metadata Block (if used)

Tooltips have been updated for the Astrophysics Metadata Block. If you'd like these updated Tooltips to be displayed to users of your installation, you should update the Astrophysics Metadata Block:

curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @astrophysics.tsv -H "Content-type: text/tab-separated-values"

We've included this in the step-by-step instructions below.

Run ReExportall

We made changes to the JSON Export in this release. If you'd like these changes to reflected in your JSON exports, you should run ReExportall as part of the upgrade process following the steps in Admin Guide

We've included this in the step-by-step instructions below.

Notes for Tool Developers and Integrators

Complete List of Changes

For the complete list of code changes in this release, see the 5.0 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Google Group or email [email protected].

Installation

If this is a new installation, please see our Installation Guide

Upgrade Instructions

Upgrade from Glassfish 4.1 to Payara 5

The instructions below describe the upgrade procedure based on moving an existing glassfish4 domain directory under Payara. We recommend this method instead of setting up a brand-new Payara domain using the installer because it appears to be the easiest way to recreate your current configuration and preserve all your data.

  1. Download Payara, v5.2020.2 as of this writing:

curl -L -O https://github.com/payara/Payara/releases/download/payara-server-5.2020.2/payara-5.2020.2.zip sha256sum payara-5.2020.2.zip 1f5f7ea30901b1b4c7bcdfa5591881a700c9b7e2022ae3894192ba97eb83cc3e

  1. Unzip it somewhere (/usr/local is a safe bet)

sudo unzip payara-5.2020.2.zip -d /usr/local/

  1. Copy the Postgres driver to /usr/local/payara5/glassfish/lib

sudo cp /usr/local/glassfish4/glassfish/lib/postgresql-42.2.9.jar /usr/local/payara5/glassfish/lib/

  1. Move payara5/glassfish/domains/domain1 out of the way

sudo mv /usr/local/payara5/glassfish/domains/domain1 /usr/local/payara5/glassfish/domains/domain1.orig

  1. Undeploy the Dataverse web application (if deployed; version 4.20 is assumed in the example below)

sudo /usr/local/glassfish4/bin/asadmin list-applications sudo /usr/local/glassfish4/bin/asadmin undeploy dataverse-4.20

  1. Stop Glassfish; copy domain1 to Payara

sudo /usr/local/glassfish4/bin/asadmin stop-domain sudo cp -ar /usr/local/glassfish4/glassfish/domains/domain1 /usr/local/payara5/glassfish/domains/

  1. Remove the cache directories

sudo rm -rf /usr/local/payara5/glassfish/domains/domain1/generated/
sudo rm -rf /usr/local/payara5/glassfish/domains/domain1/osgi-cache/

  1. In domain.xml:

Replace the -XX:PermSize and -XX:MaxPermSize JVM options with -XX:MetaspaceSize and -XX:MaxMetaspaceSize.

<jvm-options>-XX:MetaspaceSize=256m</jvm-options> <jvm-options>-XX:MaxMetaspaceSize=512m</jvm-options>

Add the below JVM options beneath the -Ddataverse settings:

<jvm-options>-Dfish.payara.classloading.delegate=false</jvm-options> <jvm-options>-XX:+UseG1GC</jvm-options> <jvm-options>-XX:+UseStringDeduplication</jvm-options> <jvm-options>-XX:+DisableExplicitGC</jvm-options>

Also in domain.xml, replace the follow element:

<jdbc-connection-pool datasource-classname="org.apache.derby.jdbc.EmbeddedXADataSource" name="__TimerPool" res-type="javax.sql.XADataSource"> <property name="databaseName" value="${com.sun.aas.instanceRoot}/lib/databases/ejbtimer"></property><property name="connectionAttributes" value=";create=true"></property></jdbc-connection-pool>

with

<jdbc-connection-pool datasource-classname="org.h2.jdbcx.JdbcDataSource" name="__TimerPool" res-type="javax.sql.XADataSource"> <property name="URL" value="jdbc:h2:${com.sun.aas.instanceRoot}/lib/databases/ejbtimer;AUTO_SERVER=TRUE"></property> </jdbc-connection-pool>

  1. Change any full pathnames /usr/local/glassfish4/... to /usr/local/payara5/... or whatever it is in your case. (Specifically check the -Ddataverse.files.directory and -Ddataverse.files.file.directory JVM options)

  2. In domain1/config/jhove.conf, change the hard-coded /usr/local/glassfish4 path, as above.

(Optional): If you renamed your service account from glassfish to payara or appserver, update the ownership permissions. The Installation Guide recommends a service account of dataverse:

sudo chown -R dataverse /usr/local/payara5/glassfish/domains/domain1 sudo chown -R dataverse /usr/local/payara5/glassfish/lib

  1. You will also need to check that the service account has write permission on the files directory, if they are located outside the old Glassfish domain. And/or make sure the service account has the correct AWS credentials, if you are using S3 for storage.

  2. Finally, start Payara:

sudo -u dataverse /usr/local/payara5/bin/asadmin start-domain

  1. Deploy the Dataverse 5 warfile:

sudo -u dataverse /usr/local/payara5/bin/asadmin deploy /path/to/dataverse-5.0.war

  1. Then restart Payara:

sudo -u dataverse /usr/local/payara5/bin/asadmin stop-domain sudo -u dataverse /usr/local/payara5/bin/asadmin start-domain

Additional Upgrade Steps

  1. Update Astrophysics Metadata Block (if used)

wget https://github.com/IQSS/dataverse/releases/download/v5.0/astrophysics.tsv curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @astrophysics.tsv -H "Content-type: text/tab-separated-values"

  1. (Recommended) Run ReExportall to update JSON Exports

http://guides.dataverse.org/en/5.0/admin/metadataexport.html?highlight=export#batch-exports-through-the-api

  1. (Required for installations using DataCite) Add the JVM option doi.dataciterestapiurlstring

For production environments:

/usr/local/payara5/bin/asadmin create-jvm-options "\-Ddoi.dataciterestapiurlstring=https\://api.datacite.org"

For test environments:

/usr/local/payara5/bin/asadmin create-jvm-options "\-Ddoi.dataciterestapiurlstring=https\://api.test.datacite.org"

The JVM option doi.mdcbaseurlstring should be deleted if it was previously set, for example:

/usr/local/payara5/bin/asadmin delete-jvm-options "\-Ddoi.mdcbaseurlstring=https\://api.test.datacite.org"

  1. (Recommended for installations using DataCite) Pre-register DOIs

Execute the script described in the section "Dataverse Installations Using DataCite: Upgrade Action Recommended" earlier in the Release Note.

Please consult the earlier sections of the Release Note for any additional configuration options that may apply to your installation.