Dataverse: v4.20 Release

Release date:
March 30, 2020
Previous version:
v4.19 (released January 21, 2020)
Magnitude:
408 Diff Delta
Contributors:
4 total committers
Data confidence:
Commits:

100 Features Released with v4.20

Top Contributors in v4.20

sekmiller
qqmyers
djbrooke
kcondon

Directory Browser for v4.20

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

Dataverse 4.20

This release brings new features, enhancements, and bug fixes to Dataverse. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Multiple Store Support

Dataverse can now be configured to store files in more than one place at the same time (multiple file, s3, and/or swift stores).

General information about this capability can be found below and in the <a href="http://guides.dataverse.org/en/4.20/installation/config.html">Configuration Guide</a> - File Storage section.

S3 Direct Upload support

S3 stores can now optionally be configured to support direct upload of files, as one option for supporting upload of larger files. In the current implementation, each file is uploaded in a single HTTP call. For AWS, this limits file size to 5 GB. With Minio the theoretical limit should be 5 TB and 50+ GB file uploads have been tested successfully. (In practice other factors such as network timeouts may prevent a successful upload a multi-TB file and minio instances may be configured with a < 5 TB single HTTP call limit.) No other S3 service providers have been tested yet. Their limits should be the lower of the maximum object size allowed and any single HTTP call upload limit.

General information about this capability can be found in the <a href="http://guides.dataverse.org/en/4.20/developers/big-data-support.html">Big Data Support Guide</a> with specific information about how to enable it in the <a href="http://guides.dataverse.org/en/4.20/installation/config.html">Configuration Guide</a> - File Storage section.

Integration Test Coverage Reporting

The percentage of code covered by the API-based integration tests is now shown on a badge at the bottom of the README.md file that serves as the homepage of Dataverse Github Repository.

New APIs

New APIs for Role Management and Dataset Size have been added. Previously, managing roles at the dataset and file level was only possible through the UI. API users can now also retrieve the size of a dataset through an API call, with specific parameters depending on the type of information needed.

More information can be found in the <a href="http://guides.dataverse.org/en/4.20/api">API Guide</a>.

Major Use Cases

Newly-supported use cases in this release include:

  • Users will now be able to see the number of linked datasets and dataverses accurately reflected in the facet counts on the Dataverse search page. (Issue #6564, PR #6262)
  • Users will be able to upload large files directly to S3. (Issue #6489, PR #6490)
  • Users will be able to see the PIDs of datasets and files in the Guestbook export. (Issue #6534, PR #6628)
  • Administrators will be able to configure multiple stores per Dataverse installation, which allow dataverse-level setting of storage location, upload size limits, and supported data transfer methods (Issue #6485, PR #6488)
  • Administrators and integrators will be able to manage roles using a new API. (Issue #6290, PR #6622)
  • Administrators and integrators will be able to determine a dataset's size. (Issue #6524, PR #6609)
  • Integrators will now be able to retrieve the number of files in a dataset as part of a single API call instead of needing to count the number of files in the response. (Issue #6601, PR #6623)

Notes for Dataverse Installation Administrators

Potential Data Integrity Issue

We recently discovered a potential data integrity issue in Dataverse databases. One manifests itself as duplicate DataFile objects created for the same uploaded file (https://github.com/IQSS/dataverse/issues/6522); the other as duplicate DataTable (tabular metadata) objects linked to the same DataFile (https://github.com/IQSS/dataverse/issues/6510). This issue impacted approximately .03% of datasets in Harvard's Dataverse.

To see if any datasets in your installation have been impacted by this data integrity issue, we've provided a diagnostic script here:

https://github.com/IQSS/dataverse/raw/develop/scripts/issues/6510/check_datafiles_6522_6510.sh

The script relies on the PostgreSQL utility psql to access the database. You will need to edit the credentials at the top of the script to match your database configuration.

If neither of the two issues is present in your database, you will see a message "... no duplicate DataFile objects in your database" and "no tabular files affected by this issue in your database".

If either, or both kinds of duplicates are detected, the script will provide further instructions. We will need you to send us the produced output. We will then assist you in resolving the issues in your database.

Multiple Store Support Changes

Existing installations will need to make configuration changes to adopt this version, regardless of whether additional stores are to be added or not.

Multistore support requires that each store be assigned a label, id, and type - see the <a href="http://guides.dataverse.org/en/4.20/installation/config.html">Configuration Guide</a> for a more complete explanation. For an existing store, the recommended upgrade path is to assign the store id based on it's type, i.e. a 'file' store would get id 'file', an 's3' store would have the id 's3'.

With this choice, no manual changes to datafile 'storageidentifier' entries are needed in the database. If you do not name your existing store using this convention, you will need to edit the database to maintain access to existing files.

The following set of commands to change the Glassfish JVM options will adapt an existing file or s3 store for this upgrade: For a file store:

./asadmin create-jvm-options "\-Ddataverse.files.file.type=file"
./asadmin create-jvm-options "\-Ddataverse.files.file.label=file"
./asadmin create-jvm-options "\-Ddataverse.files.file.directory=<your directory>"

For a s3 store:

./asadmin create-jvm-options "\-Ddataverse.files.s3.type=s3"
./asadmin create-jvm-options "\-Ddataverse.files.s3.label=s3"
./asadmin delete-jvm-options "-Ddataverse.files.s3-bucket-name=<your_bucket_name>"
./asadmin create-jvm-options "-Ddataverse.files.s3.bucket-name=<your_bucket_name>"

Any additional S3 options you have set will need to be replaced as well, following the pattern in the last two lines above - delete the option including a '-' after 's3' and creating the same option with the '-' replaced by a '.', using the same value you currently have configured.

Once these options are set, restarting the Glassfish service is all that is needed to complete the change.

Note that the "-Ddataverse.files.directory", if defined, continues to control where temporary files are stored (in the /temp subdir of that directory), independent of the location of any 'file' store defined above.

Also note that the :MaxFileUploadSizeInBytes property has a new option to provide independent limits for each store instead of a single value for the whole installation. The default is to apply any existing limit defined by this property to all stores.

Direct S3 Upload Changes

Direct upload to S3 is enabled per store by one new jvm option:

./asadmin create-jvm-options "\-Ddataverse.files.<id>.upload-redirect=true"

The existing :MaxFileUploadSizeInBytes property and dataverse.files.<id>.url-expiration-minutes jvm option for the same store also apply to direct upload.

Direct upload via the Dataverse web interface is transparent to the user and handled automatically by the browser. Some minor differences in file upload exist: directly uploaded files are not unzipped and Dataverse does not scan their content to help in assigning a MIME type. Ingest of tabular files and metadata extraction from FITS files will occur, but can be turned off for files above a specified size limit through the new dataverse.files.<id>.ingestsizelimit jvm option.

API calls to support direct upload also exist, and, if direct upload is enabled for a store in Dataverse, the latest DVUploader (v1.0.8) provides a'-directupload' flag that enables its use.

Solr Update

With this release we upgrade to the latest available stable release in the Solr 7.x branch. We recommend a fresh installation of Solr 7.7.2 (the index will be empty) followed by an "index all".

Before you start the "index all", Dataverse will appear to be empty because the search results come from Solr. As indexing progresses, results will appear until indexing is complete.

Dataverse Linking Fix

The fix implemented for #6262 will display the datasets contained in linked dataverses in the linking dataverse. The full reindex described above will correct these counts. Going forward, this will happen automatically whenever a dataverse is linked.

Google Analytics Download Tracking Bug

The button tracking capability discussed in the installation guide (http://guides.dataverse.org/en/4.20/installation/config.html#id88) relies on an analytics-code.html file that must be configured using the :WebAnalyticsCode setting. The example file provided in the installation guide is no longer compatible with recent Dataverse releases (>v4.16). Installations using this feature should update their analytics-code.html file by following the installation instructions using the updated example file. Alternately, sites can modify their existing files to include the one-line change made in the example file at line 120.

Run ReExportall

We made changes to the JSON Export in this release (Issue #6650, PR #6669). If you'd like these changes to reflected in your JSON exports, you should run ReExportall as part of the upgrade process. We've included this in the step-by-step instructions below.

New JVM Options and Database Settings

New JVM Options for file storage drivers

  • The JVM option dataverse.files.file.directory=<your directory> controls where temporary files are stored (in the /temp subdir of the defined directory), independent of the location of any 'file' store defined above.
  • The JVM option dataverse.files.<id>.upload-redirect enables direct upload of files added to a dataset to the S3 bucket. (S3 stores only!)
  • The JVM option dataverse.files.<id>.MaxFileUploadSizeInBytes controls the maximum size of file uploads allowed for the given file store.
  • The JVM option dataverse.files.<id>.ingestsizelimit controls the maximum size of files for which ingest will be attempted, for the given file store.

New Database Settings for Shibboleth

  • The database setting :ShibAffiliationAttribute can now be set to prevent affiliations for Shibboleth users from being reset upon each log in.

Notes for Tool Developers and Integrators

Integration Test Coverage Reporting

API-based integration tests are run every time a branch is merged to develop and the percentage of code covered by these integration tests is now shown on a badge at the bottom of the README.md file that serves as the homepage of Dataverse Github Repository.

Guestbook Column Changes

Users of downloaded guestbooks should note that two new columns have been added:

  • Dataset PID
  • File PID

If you are expecting column in the CSV file to be in a particular order, you will need to make adjustments.

Old columns: Guestbook, Dataset, Date, Type, File Name, File Id, User Name, Email, Institution, Position, Custom Questions

New columns: Guestbook, Dataset, Dataset PID, Date, Type, File Name, File Id, File PID, User Name, Email, Institution, Position, Custom Questions

API Changes

As reported in #6570, the affiliation for dataset contacts has been wrapped in parentheses in the JSON output from the Search API. These parentheses have now been removed. This is a backward incompatible change but it's expected that this will not cause issues for integrators.

Role Name Change

The role alias provided in API responses has changed, so if anything was hard-coded to "editor" instead of "contributor" it will need to be updated.

Complete List of Changes

For the complete list of code changes in this release, see the <a href="https://github.com/IQSS/dataverse/milestone/88?closed=1">4.20 milestone</a> in Github.

For help with upgrading, installing, or general questions please post to the <a href="https://groups.google.com/forum/#!forum/dataverse-community">Dataverse Google Group</a> or email [email protected].

Installation

If this is a new installation, please see our <a href="http://guides.dataverse.org/en/4.20/installation/">Installation Guide</a>.

Upgrade

  1. Undeploy the previous version.
  • <glassfish install path>/glassfish4/bin/asadmin list-applications
  • <glassfish install path>/glassfish4/bin/asadmin undeploy dataverse
  1. Stop glassfish and remove the generated directory, start.
  • service glassfish stop
  • remove the generated directory: rm -rf <glassfish install path>glassfish4/glassfish/domains/domain1/generated
  • service glassfish start
  1. Install and configure Solr v7.7.2

See http://guides.dataverse.org/en/4.20/installation/prerequisites.html#installing-solr

  1. Deploy this version.
  • <glassfish install path>/glassfish4/bin/asadmin deploy <path>dataverse-4.20.war
  1. The following set of commands to change the Glassfish JVM options will adapt an existing file or s3 store for this upgrade:

For a file store:

./asadmin create-jvm-options "\-Ddataverse.files.file.type=file"
./asadmin create-jvm-options "\-Ddataverse.files.file.label=file"
./asadmin create-jvm-options "\-Ddataverse.files.file.directory=<your directory>"

For a s3 store:

./asadmin create-jvm-options "\-Ddataverse.files.s3.type=s3"
./asadmin create-jvm-options "\-Ddataverse.files.s3.label=s3"
./asadmin delete-jvm-options "-Ddataverse.files.s3-bucket-name=<your_bucket_name>"
./asadmin create-jvm-options "-Ddataverse.files.s3.bucket-name=<your_bucket_name>"

Any additional S3 options you have set will need to be replaced as well, following the pattern in the last two lines above - delete the option including a '-' after 's3' and creating the same option with the '-' replaced by a '.', using the same value you currently have configured.

  1. Restart glassfish.

  2. Update Citation Metadata Block

  • wget https://github.com/IQSS/dataverse/releases/download/v4.20/citation.tsv
  • curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @citation.tsv -H "Content-type: text/tab-separated-values"
  1. Kick off full reindex

http://guides.dataverse.org/en/4.20/admin/solr-search-index.html

  1. (Recommended) Run ReExportall to update JSON Exports

http://guides.dataverse.org/en/4.20/admin/metadataexport.html?highlight=export#batch-exports-through-the-api