Polars: rs-0.38.0 Release

Release date:
February 29, 2024
Previous version:
rs-0.37.0 (released January 26, 2024)
Magnitude:
15,561 Diff Delta
Contributors:
47 total committers
Data confidence:
Commits:

308 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored January 31, 2024
Authored February 3, 2024
Authored February 29, 2024
Authored February 3, 2024

Top Contributors in rs-0.38.0

stinodego
ritchie46
grinya007
orlp
mcrumiller
reswqa
MarcoGorelli
nameexhaustion
c-peters
alexander-beedie

Directory Browser for rs-0.38.0

All files are compared to previous version, rs-0.37.0. Click here to browse diffs between other versions.

Loading File Browser...

Release Notes Published

πŸ† Highlights

  • fast path for COUNT(*) queries (#14574)
  • Implemented tree formatting for LogicalPlan (#14221)

πŸ’₯ Breaking changes

  • Infer values columns in DataFrame.pivot when values is None (#14477)
  • Mark DataFrame::new_no_checks and DataFrame::new_no_length_checks unsafe (#14443)
  • Remove DatetimeChunked::convert_time_zone (#14046)
  • Rename LiteralValue::to_anyvalue to LiteralValue::to_any_value (#14033)

πŸš€ Performance improvements

  • auto-tune concurrency budget (#14753)
  • Don't materialize for broadcasting fill_null value and default value of replace (#14736)
  • Improve performance of boolean filters 1-100x. (#14746)
  • fix accidental quadratic utf8 validation in parquet (#14705)
  • fast path for COUNT(*) queries (#14574)
  • Elide the total order wrapper for non-(float/option) types (#14648)
  • add utf8-validation fast paths for utf8view (#14644)
  • don't reassign chunks back to df owner (#14633)
  • If there are many small chunks in write_parquet(), convert to a single chunk (#14484) (#14487)
  • Polars thread pool was not used properly in various functions (#14583)
  • use owned arithmetic in horizontal_sum (#14525)
  • Combine small chunks in sinks for streaming pipelines (#14346)
  • reduce heap allocs in expression/logical-plan iteration (#14440)
  • simplify and speed up cum_sum and cum_prod (#14409)
  • simplify negated predicates to improve row groups skipping (#14370)
  • prune parquet row groups when is_not_null is used (#14260)
  • use is_between to skip parquet row groups (#14244)
  • Use a compression API that is designed for this use case (#11699) (#14194)
  • Use UnitVec in polars-plan traversal (#14199)
  • use UnitVec in streaming joins (#14197)
  • improve ChunkId (#14175)
  • improve iteration performance (#14126)
  • elide unneeded work in window? (#14108)
  • run window functions more in parallel (#14095)
  • improve skip row group using statistics condition (#14056)

✨ Enhancements

  • Change default for maximum number of Series items printed to 10 to match DataFrame (#14703)
  • Infer values columns in DataFrame.pivot when values is None (#14477)
  • fast path for COUNT(*) queries (#14574)
  • let rolling accept index_column of type UInt32 or UInt64 (#14669)
  • Treat float -0.0 == 0.0 and -NaN == NaN in group-by, joins and unique (#14617)
  • Properly cache object-stores (#14598)
  • Mark DataFrame::new_no_checks and DataFrame::new_no_length_checks unsafe (#14443)
  • flatten aliases (#14512)
  • Make formatting more consistent in DOT graphs (#14486)
  • add flush operator to streaming operators (#14500)
  • Increase verbosity of duplicate column error message (#11899)
  • change print to warn in reading csv from python file like object (#14469)
  • Raise if pivot would introduce duplicate column names (#14431)
  • apply negate in simplify expression pass (#14436)
  • restrict more cloud interop to semaphore budget (#14435)
  • Implement min/max for categorical dtype (#14112)
  • add boolean rle decoding for parquet (#14403)
  • Allow brackets in SQL join conditions (#14263)
  • Improve panic message for missing struct feature in DataType::from_arrow (#14392)
  • Implement the IntoLazy trait for LazyFrame (#14323)
  • Implemented tree formatting for LogicalPlan (#14221)
  • Implement mean_horizontal expression (#14369)
  • support decimal comparison (#14338)
  • Implements arr.shift (#14298)
  • Implements list.n_unique (#14306)
  • Do not panic when casting from an empty Series to pl.Decimal (#14330)
  • add u8/i8/u16/i16 parsers to CSV reader (#14241)
  • Implements list.gather_every (#14253)
  • Implements prefix/suffix_fields (#14251)
  • Polish decimal arithmetic (#14172)
  • Introduce arr.to_struct (#14202)
  • Supports map fields name of struct (#14203)
  • make IdxVec generic as UnitVec (#14196)
  • add new arithmetic kernels (#14026)
  • Supports unique and hash_rows for null column (#14111)
  • Implement arithmetic operations for Null columns (#14107)
  • Add strict/non-strict construction of Boolean/Binary series (#14073)
  • Improve Series::from_any_values logic (#14052)
  • Adapt extend_constant to function expr architecture and expressify it (#14058)
  • add integer negation (#14049)
  • list & array measures of dispersion (#13245)
  • gc binview when writing ipc (#14035)
  • When calling convert_time_zone on time-zone-naive datetime, convert as if converting from UTC (#13960)

🐞 Bug fixes

  • fix hashing specialization (#14754)
  • Sum after filter in aggregation context sometimes returned NULL (#14752)
  • Allow list.contains() for list of categoricals (#14744)
  • Fix bug where alias was ignored in COUNT(*) optimization (#14738)
  • Fix DataFrame.sum for decimals (#14732)
  • Fix parallel strategy for LazyFrame not being applied (#14696)
  • Block slice pushdown past non-literal projections or when the projection doesn't contain any columns from the input (#14684)
  • Fix number of rows printed in DataFrame/Series repr (edge cases) (#14548)
  • Fix contention panics in file gc threads (#14690)
  • Fix feature combination (#14688)
  • Only push predicates depending on the subset columns past unique() (#14668)
  • Reading RLE_DICTIONARY-encoded parquet incorrectly coalesced NULL to empty string in some cases (#14670)
  • use correct flooring division/modulo operator in literal optimizer and const_lhs <> series ops (#14671)
  • Enable is_in for string in categorical/enum (#14576)
  • Polars thread pool was not used properly in various functions (#14583)
  • Semi-join and multiple keys outer-join did not respect POLARS_MAX_THREADS (#14571)
  • Correct sorted flag of chunked gather (#14570)
  • ensure the streaming dispatcher can replace placeholders in unions (#14537)
  • Ensure series are contiguous prior to transpose (#14527)
  • write csv header if necessary when finishing sinks (#14518)
  • fix logical dtypes in take_chunked (#14517)
  • fix binary-offset row-encode (#14514)
  • race conditions in OOC writing (#14510)
  • don't gc after variadic buffers are written (#14473)
  • Increase verbosity of duplicate column error message (#11899)
  • Return appropriate data type for duration mean and median (#14376)
  • change print to warn in reading csv from python file like object (#14469)
  • regression in out-of-core group-by by new string-type (#14464)
  • DataFrame.pivot was returning incorrect results when multiple columns were passed to index and one of them was Struct (#14438)
  • remove literal Series from projection state (#14437)
  • pivot was producing incorrect results when (single) index was Struct (#14308)
  • Error on some invalid clip inputs (#14416)
  • Series.hist panicking on empty/all-null (#14407)
  • rechunk series when apply_lambda (#14406)
  • don't make column from filenames, don't ignore directories with (.) (#14317)
  • Remove duplicated content in error messages (#8107)
  • Fix set_operation if the input is sliced and be broadcast (#14303)
  • Wrap par_iter in list.to_struct by POOL.install (#14304)
  • Do not panic when casting from an empty Series to pl.Decimal (#14330)
  • Preserve name when casting to Enum (#14320)
  • list.get does not work on list of decimals (#14276)
  • relax precision when up scaling (#14270)
  • Allow format object series with registry (#14272)
  • deduplicate recursive growables (#14264)
  • Fix glimpse overload signature (#14258)
  • allow set operations on list of categoricals (#14110)
  • any/all_horizontal with single input has incorrect type (#14256)
  • load numpy array with np array values #14237 (#14238)
  • Fix join validation for String types (#14229)
  • make csv parser more robust to edge cases (#14210)
  • Fix for set_operations of binary dtype (#14152)
  • fix read_csv date/datetime inference and parsing (#14113)
  • don't see files as hive partitions (#14128)
  • allow eval on list of categoricals (#14132)
  • add missing conditional compile flag for StringFunction::Find (#14129)
  • Forbid casting from Date to Time and vice versa (#14127)
  • preserve old naming convention for multi-value pivot (this will change in 1.0 to no longer redundantly have the column name in the middle) (#14120)
  • Implements gt/lt cmp for null dtype (#14119)
  • ignore comments at beginning of csv if schema provided (#14115)
  • fix pivot when multiple columns are passed. Output is now aligned with what tidyverse / pandas.pivot_table would do (#14048)
  • some temporal conversion errors for datetimes earlier than 1970-01-01 (#14050)
  • Preserve name when casting from categorical (#14085)
  • fix cse bug when window function is nested (#14070)
  • Fix melt panic when there are no value vars (#14057)
  • json_encode should respect the logical type (#14063)
  • improve skip row group using statistics condition (#14056)
  • Raise for .dt.epoch and .dt.timestamp for Duration dtype (#13962)
  • handle SliceSink with empty data (#14025)
  • correct field type schema inference (using read_csv) (#14042)
  • Map AnyValue::Null to datatype Null (#14045)
  • Use int formatter for unsigned ints (#14043)
  • quick fix for multiple chunks binary reverse (#14024)
  • count matches on list categorical (#14021)
  • list.min/max with empty and/or None elements (#14018)

πŸ“– Documentation

  • Link to plugins tutorial more prominently (#14727)
  • Separate "writing a plugin" from "registering an expression" in user guide, add some extra links, don't use deprecated _register_plugin (#14621)
  • Remove some outdated information in polars crate docs (#14608)
  • Fix code block path for group by example in getting started guide (#14612)
  • Add missing 'string' column in reading-writing Rust example to match Python example (#14597)
  • Fix typo of "Cartesian" product (#14585)
  • Mention in contributing guide that PR titles should start with an uppercase letter (#14584)
  • Fix markdown newline for rendering function description in VSCode (#14567)
  • Clarify doc summary of upsample_stable (#13623)
  • Clean up grammar and capitalization in README.md (#14488)
  • Fix typo in plugins section (#14402)
  • Add debugging section to contributing docs (#10576)
  • Fix some typos (#14394)
  • Realign file structure of user guide (#14360)
  • Rust examples for data structures in user guide (#14339)
  • Add deprecation period policy example for post-1.0.0 (#14184)
  • Fix capitalization of user guide references (#14291)
  • fix code block in user-guide/lazy/schemas (#14228)
  • Fix typo in contributing guide (#14181)
  • Small improvements Ecosystem page (#14176)
  • fix code blocks in user-guide/concepts/data-structures (#14146)
  • Fix bullet point formatting in CI contributing guide (#14117)
  • Remove outdated reference to horizontal concat feature (#14105)
  • Replace alternatives page with more objective comparison (#13784)

πŸ“¦ Build system

  • update ahash (#14731)
  • Limit CMake threads to fix crash compiling libz-ng-sys on macOS (#14715)
  • Fix json feature for polars-sql crate (#14501)
  • Enable feature nightly with optional sql feature (#14222)

πŸ› οΈ Other improvements

  • update ahash (#14731)
  • replace transmute with bytemuck cast (#14747)
  • Limit CMake threads to fix crash compiling libz-ng-sys on macOS (#14715)
  • Refactor AnyValue casting logic (#13140)
  • update rustc (#14678)
  • redundant imports all crates (#14662)
  • remove redundant imports up to polars-io, polars-time, polars-ops (#14658)
  • remove redundant imports (up until polars-core) (#14646)
  • Simplify compressed_chunk_size calculation and leave comments to explain for rle encode (#14634)
  • Rename coverage file (#14607)
  • Format safety sections in Rust docstrings (#14446)
  • Refactor code coverage workflow (#14563)
  • Disable status from code coverage (#14545)
  • Add code coverage CI (#14532)
  • Format safety comments (#14447)
  • Bump release drafter to v6 (#14429)
  • Bump setup-graphviz action to v2 (#14418)
  • Update make clean command (#14408)
  • Minor refactor to satisfy clippy (#14364)
  • make gather_chunked completely generic (#14195)
  • Add .cargo directory to .gitignore (#14191)
  • take_chunked to polars-ops (#14185)
  • Enable clippy lint to warn on debug macros (#14178)
  • Run cargo update (#14160)
  • merge take kernels (#14137)
  • improve From<Ca> -> Vec (#14123)
  • hoist boolean -> string cast (#14122)
  • Remove DatetimeChunked::convert_time_zone (#14046)
  • More generic way to present an expression tree diagram (#14020)
  • Rename LiteralValue::to_anyvalue to LiteralValue::to_any_value (#14033)

Thank you to all our contributors for making this release possible! @BGR360, @CBell045, @CaselIT, @FBruzzesi, @JulianCologne, @Kylea650, @MarcoGorelli, @Migi, @NedJWestern, @Object905, @Vincenthays, @Wainberg, @alexander-beedie, @apcamargo, @braaannigan, @bsubei, @c-peters, @dannyfriar, @deanm0000, @dependabot, @dependabot[bot], @dpinol, @eLVas, @edavisau, @eitsupi, @engdoreis, @flisky, @grinya007, @i-aki-y, @ion-elgreco, @itamarst, @janosh, @jdanford, @kalekundert, @lukemanley, @mbuhidar, @mcrumiller, @nameexhaustion, @orlp, @petrosbar, @r-brink, @rben01, @reswqa, @rijkvp, @ritchie46, @stinodego, @taki-mekhalfa and @thomasfrederikhoeck