π Highlights
- Add new
Enum categorical data type which allows a fixed set of categories (#11822)
π₯ Breaking changes
- Rename
Utf8 data type to String (#13224)
- Rename
set_at_idx to scatter (#12687)
- Preserve left and right join keys in outer joins (#12963)
- Implement
dtype parameter for int_range on Rust side (#12940)
- Update
Expr.count to ignore null values by default (#12934)
- Change
value_counts resulting column name from counts to count (#12506)
- Change default
join behavior with regard to nulls, add join_nulls parameter to keep existing behavior (#12840)
- Smaller integer data types for datetime components (#12070)
- Fix
NaN ordering to make NaNs compare greater than any other float, and equal to themselves (#12721)
- Rename
frame_equal/series_equal to equals (#12663)
- Rename
not_ expression to not on the Rust side (#12587)
- Rename
str.json_extract to str.json_decode (#12586)
- Rename DataFrame column index methods (#12542)
π Performance improvements
- optimize set bit count (#13317)
- speed up
.dt.truncate for large numbers of years (#13310)
- don't eagerly evaluate error branches (#13311)
- don't needlessly allocate validity in concat/rechunk (#13288)
- add fast path to
count_bits_set_by_offsets (#13253)
- make
.dt.truncate('*mo') more than 3x faster (#13192)
- ensure single expression evaluation for replace (#13147)
- Elide allocation in outer join materialization (#12992)
- Ensure we reduce for
any/all_horizontal (#12976)
- Add fast paths for UTC in
truncate (#12965)
- Improve
rolling_median algorithm (#12704)
- Use fast path for non-null data in new SQL-like null matching (#12874)
- improve
merge_local_rhs_categorical traversal (#12660)
- make values_size estimate correct for sliced arrays (#12658)
- improve parquet utf8 validation (#12655)
- parquet pre-allocate buffer in binary plain encode (#12652)
- optimize dict binary decoding in parquet (#12648)
- ensure we only check the values within bounds (#12633)
- parquet; elide recursion in hot path (#12625)
- improve cov/corr algorithm (#12590)
- apply left side predicate pushdown also to right side on semi join (#12565)
- ensure streaming parquet download remains concurrent
~7x (#12552)
- speed up parquet download of streaming engine (#12544)
β¨ Enhancements
- support negative indices in
gather in group_by context (#13373)
- support negative indexing in gather (select context) (#13343)
- support min_periods for temporal rolling aggregations (#13342)
- support
REGEXP and RLIKE pattern matching in SQL engine (#13359)
- gracefully handle panics in plugins (#13329)
- Implement
unique/n_unique/unique_counts/is_unique/is_duplicated for Null series (#13307)
- support common variant spelling
STDEV in the SQL engine (in addition to STDDEV) (#13303)
- change doc links to new url docs.pola.rs (#13290)
- support horizontal concatenation of LazyFrames (#13139)
- Impl serde for array dtype (#13168)
- dispatch strict_cast via cast (#13255)
- Impl any/all for array type (#13250)
- add cancellable queries (#13178)
- add
offset parameter to gather_every (#13156)
- Support
Array dtype AnyValue Series construction (#12817)
- Allow
step parameter in int_ranges to take an expression (#13148)
- Implement
count for DataFrame/LazyFrame (#13153)
- Move from GA to more privacy friendly framework (#13155)
- Rename
set_at_idx to scatter (#12687)
- prune all/any_horizontals with single inputs (#13146)
- ensure we get cleaner logical plans with
any/all_horizontal (#13144)
- Add
str.contains_any and str.replace_many (Aho-Corasick algorithms) (#13073)
- Auto-infer credentials from
.aws folder (#13062)
- Support private cloud S3 storage in
scan_parquet (#13060)
- Allow order operators (<,>,>=,<=) on Enum types (#12982)
- Reimplement
replace expression on the Rust side (#13002)
- Use tokio semaphore for concurrency handling (#13026)
- Improve and expressify
hist (#13014)
- Preserve left and right join keys in outer joins (#12963)
- Allow
end before start in date/time_range (#12964)
- Implement group-tuples for
Null dtype (#12975)
- Implement
dtype parameter for int_range on Rust side (#12940)
- Cast to an enum from int (#12954)
- Move categorical ordering into dtype (#12911)
- Update
Expr.count to ignore null values by default (#12934)
- Enable partial predicate pushdown past window expressions (#12710)
- Add
str.reverse (#12878)
- Change
value_counts resulting column name from counts to count (#12506)
- Implement
std and var for Duration columns (#12865)
- Change default
join behavior with regard to nulls, add join_nulls parameter to keep existing behavior (#12840)
- Preserve base dtype when raising to
UInt power (#10446)
- Smaller integer data types for datetime components (#12070)
- Support SQL subqueries for
JOIN and FROM (#12819)
- parquet support required deltabyte encoding (#12836)
- Add new
Enum categorical data type which allows a fixed set of categories (#11822)
- support nested null in vstack/append/extend/concat (#12771)
- Improve error messages on attempted Arrow conversions involving incompatible/unknown dtypes (#12421)
- determine mode parallelism depending on current tasks (#12764)
- enable slice push down past
with_columns (#12742)
- implement From<LazyGroupBy> for LazyFrame (#12562)
- Rename
frame_equal/series_equal to equals (#12663)
- Join operations on local categoricals (#12657)
- use RLE_DICTIONARY for integers in parquet (#12647)
- Add configuration option for where Polars spills to disk (#12595)
- implement RLE_DICT encoding for utf8/binary columns (reduced parquet file size) (#12623)
- implement 'DeltaByteArray' decoding for parquet (#12602)
- warn if
by column is not sorted in rolling aggregations (as opposed to raising), add warn_if_unsorted argument (#12398)
- struct -> json encoding expression (#12583)
- Implement support for multi-character comments in
read_csv (#12519)
- Implement
LazyFrame.sink_ndjson (#10786)
- improve concurrency parameters (#12567)
- Adds sink_ipc_cloud (#12556)
- Adds sink_ipc_cloud (#11008)
- In explain(), rename PIPELINE to STREAMING so it's clearer what it means (#12547)
π Bug fixes
- range/ranges output name should follow lhs rule (#13369)
- updated Display trait for enum categoricals (#13331)
- nested dtypes: export logical type in plugins (#13325)
- fix invalid dtype setting in array (#13327)
- fix
csv parser error when commented-out rows precede the header row (#13318)
- invalid schema outer join after projection pd (#13315)
- invalid predicate optimization (#13313)
- Account for null values in categorical
unique/n_unique (#13308)
- fix schema when subtracting (#13309)
- broadcasting of unit LHS in string operations (#12737)
- casting list/arr to arr/list shouldn't convert chunks to logical type (#13259)
- sorting categorical lexically bugs on null values (#13271)
- improve replace on categoricals (#13223)
- round trip to JSON and back should preserve Enum type (#13267)
- enable and fix SIMD in polars-compute (#13251)
- match_chunks shouldn't change the dtype (#13222)
- sink_csv deadlock (#13239)
is_in operator for categoricals (#13205)
- Better handle mismatched dtypes in
replace (#13213)
- Fix
replace fast path by casting old input to the right data type (#13176)
- ndjson nested null schema inference (#13206)
- slice for
NullChunked no longer force single chunk (#13174)
- don't cast to unknown dtypes (#13197)
- Allow casting nullable list to array (#13196)
- maintain old join behavior in window expression (#13179)
- Fix comparison of categoricals (#13137)
- Use the name of the leftmost expression in horizontal operations (#13143)
- any_value should supports cast to boolean (#13125)
- Update offsets of null value correctly for all
from_iter_xxx_trusted_len (#13132)
- fix neq for series cmp str (#13128)
- fix category list builder append series with multiple chunks (#13116)
- repeat_by should not raise if by contains nulls (#13105)
- [csv] raise on single quote char (#13104)
- Raise if scan zstd compressed csv file (#13102)
- Don't check map length if input is literal (#13098)
- use
FunctionExpr's scalar return type for is_in (#13091)
- rolling_quantile can get incorrect state (#13088)
- Fix off-by-one error in
quantile(method="nearest") (#13058)
- Fix incorrect schema inference on nested columns (#13057)
- Don't raise for
datetime_range if starting on ambiguous datetime and earliest was specified (#13050)
- add cast safety to literals (#12983)
- Parse
json_decode per max buffer length (#13029)
- Parse
00:00 time zone as UTC (#13034)
- Fix timeout errors in concurrent downloads (#13023)
- Fix SQL substring indexing (#13016)
- Allow broadcasting in
ranges (#11900)
- Prevent deadlock in
sink_csv (#12991)
- Don't get mutable if buffer is sliced (#12979)
- Dataframes with Decimal columns cannot be pickled (#12955)
- Fix
truncate when truncating by multiple weeks (#12948)
- Fix segfault / memory corruption after plugins return
Err result (#12953)
- Don't panic when
ambiguous parameter is not Utf8 (#12913)
- don't panic on empty df in
merge_sort (#12923)
- Patch
rolling_var/rolling_std numerical stability (#12909)
- Fix incorrect Int16
min/max due to incorrect SIMD mask construction (#12908)
- Fix OOB error in list set operations on empty frame (#12845)
- Fix repr of
Expr.gather (which was still showing deprecated take) (#12864)
- Fix
nan_min/max incorrectly aggregating chunks with addition (#12848)
- write only one dict page per row rowgroup (#12831)
- incorrect values from parquet RLE decoding (#12818)
- Handle aggregation for all-NaN groups in
group_by (#12304)
- Use total float ordering in
is_in (#12800)
- Fix
NaN ordering to make NaNs compare greater than any other float, and equal to themselves (#12721)
- don't use streaming engine if aggregate is unknown (#12769)
- hold align_chunks_invariant (#12738)
- allow leading zero and plus in integer parsing (#12744)
- csv lines iter, always return remainder (#12739)
- fix oob in set operations (#12736)
- undo regression in ability to read certain parquet files (#12731)
- corr return nan if denominator is invalid (#12708)
- parquet decimal statistics and schema (#12705)
- support
append/extend with null series (#11824) (#12686)
- fix carrying over infinity into other windows (#12685)
- json null inference (#12677)
- cov/corr respect f32 type (#12676)
- fix ternary zip_with null broadcast (#12668)
- support negative slice on eager frame (#12644)
- fix concurrency budget assertion (#12641)
- fix oob in set operations (#12640)
- Rename
not_ expression to not on the Rust side (#12587)
- panic reading parquet nested struct column (#12614)
- features:
performant,lazy,random (#12600)
- error when invalid list to array is given (#12584)
- parquet: do not extend existing nested that is already complete (#12569)
- accidental panic if predicate selects no files (#12575)
- fix lazy parquet slice with nested columns (#12558)
- ensure stats-evalutor exists (#12566)
- list schema of list
eval (#12563)
- ensure concurrency budget never locks (#12555)
- Fix lazy schema for
group_by_dynamic and rolling (#12551)
- address overflow on vec capacity calculation for
int_ranges with negative step (#12548)
π οΈ Other improvements
- Update CODEOWNERS (#13292)
- Change base url of docs/guide to
docs.pola.rs (#13281)
- Add note about Rust examples versioning in user guide (#13280)
- split-up file_sink module (#13256)
- Rename
Utf8 data type to String (#13224)
- update rustc (#13219)
- fix horizontal concatenation documentation (#13141)
- Set minimum version for
bytemuck to 1.11 (#13191)
- bump sysinfo from 0.29.11 to 0.30.0 (#13188)
- Remove
polars-algo reference in Cargo.toml (#13187)
- Use the name of the leftmost expression in horizontal operations (#13143)
- make pre_agg generic (#13150)
- move StaticArray to polars-arrow (#13106)
- ensure we get cleaner logical plans with
any/all_horizontal (#13144)
- Update
auto_explode param name to returns_scalar (#13119)
- don't compile polars-ops by default (#13100)
- update user-defined-functions for 0.19.x (#13071)
- Linting updates (#13069)
- take pl.concat out of StringCache context manager in "mismatched string cache" error message (#13076)
- add Enum to dtype list (#13080)
- further use TotalOrd (#13046)
- Minor typo fix (#13003)
- use new MinMax kernels (#12961)
- Refer to arrow crate unambiguously from polars-parquet (#12939)
- Fix issue with docs for
group_by_dynamic (#12906)
- Fix failing tests (#12859)
- Update
make check to only check polars crate (#12834)
- apply TotalOrd in more places (#12810)
- Use latest
atoi_simd release (#12748)
- simplify rolling_median update (#12745)
- move nan_cmp and IsFloat to polars_utils (#12691)
- remove utf8 code in favor of binary (#12604)
- update custom allocator instructions to include macOS (#12593)
- Rename
str.json_extract to str.json_decode (#12586)
- parquet refactors (#12574)
- convert all recursive parquet deserialize to iterative (#12560)
- Rename DataFrame column index methods (#12542)
Thank you to all our contributors for making this release possible!
@0siride, @MarcoGorelli, @Object905, @PierreAttard, @Qqwy, @RoDmitry, @SeanTroyUWO, @TNieuwdorp, @Yerachmiel-Feltzman, @adamreeve, @alexander-beedie, @c-peters, @cardoso, @cjfuller, @dependabot, @dependabot[bot], @dmitrybugakov, @eitsupi, @fernandocast, @gab23r, @ion-elgreco, @itamarst, @jankislinger, @jeroenboeye, @kszlim, @mcrumiller, @nameexhaustion, @oli-clive-griffin, @orlp, @paddymul, @petrosbar, @r-brink, @rancomp, @reswqa, @ritchie46, @rob-sil, @robvanmieghem, @romanovacca, @stinodego, @tkarabela, @uchiiii and @xuestrange