Polars: rs-0.45.0.1 Release

Release date:
December 8, 2024
Previous version:
rs-0.44.2 (released November 1, 2024)
Magnitude:
58,720 Diff Delta
Contributors:
47 total committers
Data confidence:
Commits:

578 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored October 21, 2024
Authored October 29, 2024
Authored November 4, 2024
Authored October 14, 2024
Authored October 15, 2024
Authored November 5, 2024
Authored November 19, 2024
Authored November 3, 2024
Authored October 27, 2024
Authored November 13, 2024
Authored November 13, 2024

Top Contributors in rs-0.45.0.1

coastalwhite
nameexhaustion
rodrigogiraoserrao
alexander-beedie
ritchie46
itamarst
orlp
adamreeve
MarcoGorelli
mcrumiller

Directory Browser for rs-0.45.0.1

All files are compared to previous version, rs-0.44.2. Click here to browse diffs between other versions.

Loading File Browser...

Release Notes Published

πŸ’₯ Breaking changes

  • Remove dedicated sink_(parquet/ipc)_cloud functions (#20164)
  • Experimental cloud write support (#20129)

πŸš€ Performance improvements

  • Add fast paths for series.arg_sort and dataframe.sort (#19872)
  • Utilize the RangedUniqueKernel for Enum/Categorical (#20150)
  • Reduce memory copy when scanning from Python objects (#20142)
  • Don't instantiate validity mask when unneeded in Parquet (#20149)
  • Expand more filters (#20022)
  • Cache the DataFrame schema in get_column_index (#20021)
  • Reduce the size of row encoding UTF-8 (#19911)
  • Memoize duplicates in rolling-gb-dyn (#19939)
  • More efficient row encoding for pl.List (#19907)
  • Half the size of Booleans in row encoding (#19927)
  • Rolling 'iter_lookbehind' breeze through duplicates (#19922)
  • Initially trim leading and trailing filtered rows (#19850)
  • Increase default async thread count for low core count systems (#19829)
  • Move row group decode off async thread for local streaming parquet scan (#19828)
  • Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697)
  • Improve DataFrame.sort().limit/top_k performance (#19731)
  • Improve cloud scan performance (#19728)
  • Fix quadratic 'with_columns' behavior (#19701)
  • Improve hive partition pruning with datetime predicates from SQL (#19680)
  • Allow for arbitrary skips in Parquet Dictionary Decoding (#19649)
  • Reorder conditions in is_leap_year (#19602)
  • Rechunk in DataFrame.rows if needed (#19628)
  • Dispatch Parquet Primitive PLAIN decoding to faster kernels when possible (#19611)
  • Use faster iteration in 'starts_with'/'ends_with' (#19583)
  • Branchless Parquet Prefiltering (#19190)

✨ Enhancements

  • Retry with reloaded credentials on cloud error (#20185)
  • Support reading Enum dtype from csv (#20188)
  • Allow sorting of lists and arrays (#20169)
  • Add maintain_order parameter to joins (#20026)
  • Allow for to_datetime / strftime to automatically parse dates with single-digit hour/minute/second (#20144)
  • Experimental cloud write support (#20129)
  • Allow setting and reading custom schema-level IPC metadata (#20066)
  • Add optimized row encoding for Decimals (#20050)
  • Add drop_nans method to DataFrame and LazyFrame (#20029)
  • Catch use of 'polars' in to_string for non-Duration dtypes and raise an informative error (#19977)
  • Add AhoCorasick backed 'find_many' (#19952)
  • Speed up starts_with for small prefixes (#19904)
  • Auto-enable hive partitioning if hive_schema was given (#19902)
  • Add pl.concat_arr to concatenate columns into an Array column (#19881)
  • Support both "iso" and "iso:strict" format options for dt.to_string (#19840)
  • Add rounding for Decimal type (#19760)
  • Improved array arithmetic support (#19837)
  • Raise informative error on Unknown unnest (#19830)
  • Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697)
  • Allow specification of chunk_size on LazyCsvReader.read_options (#19819)
  • Add an is_literal method to expression meta namespace (#19773)
  • A different approach to warning users of fork() issues with Polars (#19197)
  • Add dylib (#19759)
  • Add IPC source node for new streaming engine (#19454)
  • Implement max/min methods for dtypes (#19494)
  • Improve hive partition pruning with datetime predicates from SQL (#19680)
  • Parallel IPC sink for the new streaming engine (#19622)
  • Add SQL support for RIGHT JOIN, fix an issue with wildcard aliasing (#19626)
  • Add show_graph to display a GraphViz plot for expressions (#19365)

🐞 Bug fixes

  • Don't trigger length check in array construction (#20205)
  • Allow row encoding for 32-bit architectures (e.g. WASM) (#20186)
  • Properly project unordered column in parquet prefiltered (#20189)
  • Csv stop simd cache if eol char is hit (#20199)
  • Estimated size for object (#20191)
  • Respect parallel argument in parquet (#20187)
  • Only validate UTF-8 for selected items when all below len 128 (#20183)
  • Serialize categories of Enum in arrow metadata (#20181)
  • Don't use RLE encoding for Parquet Boolean (#20172)
  • Invalid bitwise_xor for ScalarColumn (#20140)
  • Add temporal feature gate in is_elementwise_top_level (#20177)
  • Column name mismatch or not found in Parquet scan with filter (#20178)
  • Raise if apply returns different types (#20168)
  • Deal with masked out list elements (#20161)
  • Fix index out of bounds in uniform_hist_count (#20133)
  • Implement arg_sort for Null series (#20135)
  • Handle slice pushdown in PythonUDF GroupBy (#20132)
  • Check shape for *_horizontal functions (#20130)
  • Properly coerce types in lists (#20126)
  • Incorrect aggregation of empty groups after slice (#20127)
  • DataFrame .get_column after drop_in_place (#20120)
  • Subtraction with underflow on empty FixedSizeBinaryArray (#20109)
  • Materialize smallest dyn ints to use feature gate for i8/i16 (#20108)
  • Return null instead of 0. for rolling_std when window contains a single element and ddof=1 and there are nulls elsewhere in the Series (#20077)
  • Only slice after sort when slice is smaller than frame length (#20084)
  • Preserve Series name in __rpow__ operation (#20072)
  • Allow nested is_in() in when()/then() for full-streaming (#20052)
  • Fix datetime cast behavior for pre-epoch times (#19949)
  • Improve hist binning around breakpoints (#20054)
  • Fix invalid len due to projection pushdown selection of scalar (#20049)
  • Fix empty scalar agg type (#20051)
  • Improve binning in Series.hist with bin_count when all values are the same (#20034)
  • Less intrusive forking warnings (#20032)
  • Reading nullable sliced / masked Categoricals from Parquet (#20024)
  • Regression in hist panicking on out of bounds index (#20016)
  • Fix starts_with out of bounds (#20006)
  • Fix incorrect column order for parquet scan with hive columns in file (#19996)
  • Incorrectly gave list.len() for masked-out rows (#19999)
  • Bug fix in existing fast path for sorted series (#20004)
  • Incorrect collect_schema() for fill_null() after an aggregation expression in group-by context (#19993)
  • Fix Decimal type fill_null (#19981)
  • Fix panic on schema merge for prefiltering (#19972)
  • Fix lazy frame join expression (#19974)
  • Fix gather_every for Scalar (#19964)
  • Toggle 'fast_unique' on new_from_index (#19956)
  • Raise proper error message when too small interval is passed to datetime_range (#19955)
  • Fix scalar object (#19940)
  • Raise InvalidOperationError for invalid float to decimal casts (e.g. Inf, NaN) (#19938)
  • Fix panic with combination of hive and parquet prefiltering (#19905)
  • Fix panic when joining with empty frame (debug only) (#19896)
  • Fix incorrect result from inequality filter after join on LazyFrame (#19898)
  • Misleading ShapeError error message on dataframe creation (#19901)
  • Fix panic with empty delta scan, or empty parquet scan with a provided schema (#19884)
  • Ensure type object of inputs for cached any-value conversion functions are kept alive (#19866)
  • Fix panic using scan_parquet().with_row_index() with hive partitioning enabled (#19865)
  • Improve histogram bin logic (#18761)
  • Raise informative error instead of panicking for list arithmetic on some invalid dtypes (#19841)
  • Properly handle Zero-Field Structs in row encoding (#19846)
  • Incorrect explode schema for LazyFrame.explode() (#19860)
  • Ensure List element truncation ellipses respect ASCII* table formats (#19835)
  • Validate subnodes in validate IR (#19831)
  • Raise if merge non-global categoricals in unpivot (#19826)
  • Type hints for window_size incorrectly included timedelta in some rolling functions (#19827)
  • Don't panic if column not found (#19824)
  • Fix gather of Scalar null + idx w/ validity (#19823)
  • Fix object chunked gather (#19811)
  • Fix inconsistency between code and comment (#19810)
  • Fix filter scalar nulls (#19786)
  • Altair tooltip was being incorrectly applied to plots which did not accept it (#19789)
  • Fix scanning google cloud with service account credentials file (#19782)
  • Fix incorrect filter after right-join on LazyFrame (#19775)
  • Fix incorrect lazy schema for explode on array columns (#19776)
  • Fix incorrect lazy schema for aggregations (#19753)
  • Fix validation for inner and left join when join_nulls unflaged (#19698)
  • SQL ELSE clause should be implicitly NULL when omitted (#19714)
  • In group_by_dynamic, period and every were getting applied in reverse order for the window upper boundary (#19706)
  • Only allow list.to_struct to be elementwise when width is fixed (#19688)
  • Make Array arithmetic ops fully elementwise (#19682)
  • Update line-splitting logic in batched CSV reader (#19508)
  • Fix incorrect lazy schema for explode() in agg() (#19629)
  • Fix filter incorrectly pushed past struct unnest when unnested column name matches upper column name (#19638)
  • Ensure mean_horizontal raises on non-numeric input (#19648)
  • Reorder conditions in is_leap_year (#19602)
  • Copy height in .vstack() for empty dataframes (#19641) (#19642)
  • Run join type coercion with correct schemas active (#19625)
  • Correct wildcard and input expansion for some more functions (#19588)
  • Allow .struct.with_fields inside list.eval (#19617)
  • Sortedness was incorrectly being preserved in dt.offset_by when offsetting by non-constant durations in the timezone-naive case (#19616)
  • Fix incorrect scan_parquet().with_row_index() with non-zero slice or with streaming collect (#19609)
  • Fix mask and validity confusion in Parquet String decoding (#19614)
  • Parquet decoding of nested dictionary values (#19605)
  • Do not attempt to load default credentials when credential_provider is given (#19589)
  • Fix gather len in group-by state (#19586)
  • Added input validation for explode operation in the array namespace (#19163)
  • Improve error message (#19546)
  • Fix predicate pushdown into inequality joins (#19582)

πŸ“– Documentation

  • Add more Rust examples to User Guide (#20194)
  • Expand plotting docs (#19719)
  • Fix Rust examples in user guide (#20075)
  • Update by param description for rolling_*_by functions (#19715)
  • Fix inconsistency between code and comment (#20070)
  • Correct supported compression formats (#20085)
  • Specify strictness in cast (#20067)
  • Fix broken links to user guide (#19989)
  • Minor doc fixes and cleanup (#19935)
  • Complete parameters description and add an example for clip() (#19875)
  • Fix some warnings during docs build (#19848)
  • Change dprint config (#19747)
  • Fix formatting of nested list (#19746)
  • Add meta.is_column to API docs (#19744)
  • Fix join API reference links (#19745)
  • Revise and rework user-guide/expressions (#19360)
  • Update Excel page of user guide to refer to fastexcel as the default engine (#19691)
  • Alter examples for round_sig_figs to make behaviour clearer (#19667)
  • Assorted fixes to Rust API docs (#19664)
  • Improve replace and replace_all docstring explanation of the "$" character with reference to capture groups (vs use as a literal) (#19529)

πŸ“¦ Build system

  • Upgrade sqlparser-rs from version 0.49 to 0.52 (#20110)
  • Bump memmap2 to version 0.9 (#20105)
  • Bump object_store to version 0.11 (#20102)
  • Bump fs4 to version 0.12 (#20101)
  • Fix path to polars-dylib crate in workspace (#20103)
  • Bump thiserror to version 2 (#20097)
  • Bump atoi_simd to version 0.16 (#20098)
  • Bump chrono-tz to 0.10 (#20094)
  • Update Rust dependency ndarray to 0.16 (#20093)
  • Bump Rust toolchain to nightly-2024-11-28 (#20064)
  • Pin maturin (#20063)
  • Use public windows runners in python release (#19982)
  • Add windows-aarch64 to python binaries (#19966)

πŸ› οΈ Other improvements

  • Deprecate ddof parameter for correlation coefficient (#20197)
  • Move Bitwise aggregations to FunctionExpr (#20193)
  • Add ragged lines test (#20182)
  • Remove dedicated sink_(parquet/ipc)_cloud functions (#20164)
  • Move new-streaming parquet and CSV sources to under io_sources/ (#20160)
  • Move horizontal methods to polars-ops (#20134)
  • Remove useless SeriesTrait::get implementations (#20136)
  • Add a bunch more automated row encoding sortedness tests (#20056)
  • Replace custom PushNode trait with Extend (#20107)
  • Update AWS doc dependencies (#20095)
  • Move cast from polars-arrow to polars-compute (#19967)
  • Implement nested row encoding / decoding (#19874)
  • Remove use of cast in ArrowArray::new (#19899)
  • Switch back to PyO3 0.22 (#19851)
  • Make chunked gathers generic over chunk bit width (#19856)
  • Add proper tests for row encoding (#19843)
  • Add ToField context for common args (#19833)
  • Add new streaming CSV source (#19694)
  • Add BytesIndexMap and use in RowEncodedHashGrouper (#19817)
  • Use HashKeys abstraction (#19785)
  • Migrate polars-expr AggregationContext to use Column (#19736)
  • Add InMemoryJoin to new-streaming engine (#19741)
  • Use Column for the {try,}_apply_columns{_par,} functions on DataFrame (#19683)
  • Remove more @scalar-opt (#19666)
  • Move Series bitops to std::ops::Bit... (#19673)
  • Mark test_parquet.py test_dict_slices as slow (#19675)
  • Get Column into polars-expr (#19660)
  • Remove unused file (#19661)
  • Delegate feature flags for polars-stream (#19659)
  • Streamline internal SQL join condition processing (#19658)
  • Factor out logic for re-use by new streaming CSV source (#19637)
  • Configure grouped Dependabot updates (#19604)
  • Share source token between all sender tasks of source nodes in new-streaming engine (#19593)
  • Fix PyO3 error in CI (#19545)
  • Update nightly compiler version (#19590)
  • Added input validation for explode operation in the array namespace (#19163)
  • Remove MutableStructArray (#19587)
  • Fix lint (#19584)
  • Add a Column::Partitioned variant (#19557)
  • Move to fast-float2 (#19578)
  • Only run remote bench on rust changes (#19581)

Thank you to all our contributors for making this release possible! @3tilley, @DzenanJupic, @MarcoGorelli, @TNieuwdorp, @YichiZhang0613, @alexander-beedie, @barak1412, @braaannigan, @cmdlineluser, @coastalwhite, @corwinjoy, @dependabot, @dependabot[bot], @eitsupi, @engylemure, @etiennebacher, @flowlight0, @gab23r, @henryharbeck, @iharthi, @iliya-malecki, @ion-elgreco, @itamarst, @jackxxu, @janpipek, @jqnatividad, @letkemann, @lukapeschke, @lukemanley, @max-muoto, @mcrumiller, @mhogervo, @nameexhaustion, @orlp, @ptiza, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @sn0rkmaiden, @stijnherfst, @stinodego, @wence- and @wsyxbcl