🏆 Highlights
- Out-of-core unique (#8573)
⚠️ Breaking changes
- Rename
concat_lst
to concat_list
(#8597)
- Schema improvements (#8286)
- don't create duplicate pivot names (#8002)
- rename
toggle_string_cache
to enable_string_cache
(#7970)
- change top_k(descending) -> bottom_k (#7969)
- in
sort
, top_k
, sort_by
, and arg_sort_by
, raise if descending
is a sequence and its length doesn't match the number of columns to sort by (#7957)
🚀 Performance improvements
- elide function calls in AnyValue::eq (#8725)
- add fused multiply add optimization for expressions (#8690)
- use expression for dot product (#8686)
- improve nested grouptuples related code (#8618)
- buffer spill partitions in ooc sort.
~10/20%
(#8616)
- improve OOC sort performance during partition phase (#8590)
- remove some unnecessary calls and matches (#8490)
- less naive count (#8473)
- parallelize almost all flattens (#8468)
- optimize horizontal min/max (#8463)
- reinstate old behavior in numeric group-tuples (#8445)
- remove false sharing in perfect hash table
>2x
(#8432)
- further optimised conversions to python date/datetime (#8417)
- optimize join inner materialization of single keys (#8405)
- parallelize sorted group tuple materialization (#8387)
- improve materialization of huge cardinality group tuples (#8382)
- improve group_tuples materialization (#8375)
- use online variance kernel for aggregation (#8306)
- add specialized boolean aggregation for min/max (#8294)
- fail fast on non-inferable strings in strptime if no
fmt
is provided (#8111)
- make chunks search more resilient (#8229)
- SIMD accelerated
arg_min
/arg_max
(via argminmax
) (#8074)
- speed up csv parsing for slower datetimes formats (#8213)
arr.eval
run on groupby expression engine when possible (#8199)
FromParalleIter<Option<str>> for Utf8Chunked
~1.9x
(#8058)
- speed up from_par_iter Option<bool>
~2.5x
(#8057)
- parallelize numeric ChunkedArray materialization
~2x
. (#8053)
- parallelize
into_groups
materialization ~-25%
(#8036)
- use a trusted anyvalue builder (#8001)
- numeric grouptuples with nulls hash in single pass
~25%
(#7980)
- use perfect hash table for categoricals (#7951)
- improve group_tuples of high cardinality data
~10%
(#7938)
- use streaming instead of partitioned groupby (#7907)
- don't auto-stream groupby (#7906)
- rechunk before aggs (#7903)
- don't re-allocate groups in sorted to_dummies (#7897)
✨ Enhancements
- add support for
DISTINCT
keyword in SQL select clauses (#8740)
- support any day of the week in 'start_by' in groupby_dynamic (#8720)
- add support for
USING
clause in SQL join operations (#8731)
- add support for
HAVING
clause to SQL GROUP BY
operations (#8704)
- streaming unions (#8676)
- expression cache (#8674)
- rolling covariance and correlation (#8671)
- Add
dt.to_string
alias for dt.strftime
(#8290)
- use temp dir for ooc spills (#8614)
- make ooc-sort resilient against chunk_size (#8588)
- Set
strptime
default strict/exact=true
(#8587)
- Out-of-core unique (#8573)
- Add
to_date
, to_datetime
, to_time
to String namespace (#8579)
- more detailed error message on failure to cast
List
dtype (#8583)
- don't trigger unreachable code if no dtype is set (#8532)
- accept expressions in
groupby_dynamic/rolling
(#8528)
- expose quantile/mean for duration (#8491)
- require explicitly sorted flag for upsample (#8488)
- allow for _saturating suffix in duration strings (#8479)
- let duration string accept "1mo_saturating" (#8469)
- add dt.month_start and dt.month_end (#8435)
- add SQL support for cumulative functions (#8457)
- add
str_slice
method to StringNameSpace
(#8427)
- allow negative 'arange' expression (#8413)
- warn if argument is not explicitly sorted (#8409)
- Schema improvements (#8286)
- add support for SQL "IN" expr (#8396)
- cli output mode & sql read_json (#8336)
- rename 'csv-file' to 'csv' (#8101)
- preserve time zone in combine (#8263)
- add
use_earliest
argument to replace_time_zone
for dealing with ambiguous datetimes (#8087)
- SQL CTE's (#8208)
- add duration cumsum and remainder (#8219)
- better algorithm for streaming unique (#8003)
- Add approx distinct count via
approx_unique()
(#7937)
- adopt
FunctionExpr
for cat
namespace (#8173)
DatetimeArgs
ergonomics (#8133)
- Remove Seek constraint from IpcStreamReader and SerReader (#8166)
- implement
FunctionExpr
for bound and round methods (#8172)
- display skipped row if same number of rows (#8170)
- move all boolean expressions into
BooleanFunction
enum (#8132)
- rewrite log expressions to make them serializable (#8126)
- make unique expr serde and cmp (#8153)
- adopt
FunctionExpr
for abs
to allow for serialization (#8129)
- adopt
FunctionExpr
for cum*
functions (#8130)
- support negative index in
pct_change
(#8137)
- add
log1p
to list of mathematical functions (#8102)
- expand list of tz-aware formats which can be auto-inferred (#8085)
- clearer error message if strptime without a fmt specified fails (#8086)
- infer tz-aware formats with try_parse_dates in read_csv (#8084)
- feat(python, rust)! make 'mo' interval raise if the target date does not exist (#8078)
- auto-infer fmt for tz-aware date strings (#7405)
- multiple sql contexts & optional sql highlighting in cli (#8072)
- implement arg_sort for struct dtype (#8051)
- support struct in df.unique (#7976)
- change top_k(descending) -> bottom_k (#7969)
- optimize away nested unions in lp (#7861)
- Add seed argument to rank for random (#7913)
- auto-infer detecting time-zone-awareness of fmt argument in strptime; deprecate tz_aware argument (#7886)
- deal with null values in cut/qcut (#7878)
- support datetime/date subclasses (e.g. FreezeGun) (#7819)
🐞 Bug fixes
- groupby_dynamic was unnecessarily failing on ambiguous local datetime (#8737)
- ensure count aggregation has proper length when spilling (#8735)
- fix return value of std for single-element sequence with ddof=1 (#8730)
- don't take logical plan during streaming fmt (#8711)
- Don't upcast in round() for f32 when decimal is 0 (#8706)
- block predicate containing shifts and windows after sort (#8670)
- ensure perfect hash table processes the nulls (#8668)
- Reading more tiny CSVs than workers in parallel will deadlock (#8441)
- respect maintain_order in partitioned groupby (#8653)
- fix explode null series (#8654)
- fix categorical agg type (#8645)
- allow list<null> -> list<cat> (#8636)
- maintain sorted info on top-k and empty sort (#8615)
- maintain sortedness in date -> datetime cast (#8606)
- fix determining of supertype for tz-aware and tz-naive datetimes (#8585)
- fix csv reader with new line in header (#8580)
- correct for nested offsets in json serialization (#8584)
- fix wrong dtype init in streaming groupby (#8574)
- fix categorical/string_cache fill_null panic (#8562)
- fix window function contention in binary expression (#8544)
- fix StructChunked
not_equal
comparator/operator (#8547)
- fix struct pyarrow ffi (#8543)
- don't trigger unreachable code if no dtype is set (#8532)
- keep sorted info on agg_first and simple singleton… (#8526)
- unset fast_unique coming from arrow (#8521)
- correct sign-reversed scale on DecimalChunked to Python Decimal conversion (fixes #8423) (#8508)
- don't error on cast if column is not projected (#8495)
- ensure window function succeeds on empty frame (#8492)
- don't set verbose on union (#8487)
- check literal/group length before claiming agg sta… (#8486)
- fix error message of offset_by if offsetting by negative number of months (#8464)
- fix sorted warning (#8462)
- fix features serde and dtype-struct not compiling together (#8439)
- respect dtype in anonymous list builder in case of… (#8428)
- infer supertype in json serde (#8411)
- duration on empty df (#8403)
- don't inadvertently set
Series
initialised with nested tuple data as Object
dtype (#8401)
- use physical in streaming unique global table (#8390)
- recursively bubble up all dtypes in list cast (#8386)
- is_in struct logical types (#8378)
- fix nested null parquet read (#8372)
- fix logical type in ListChunked::new_from_index (#8367)
- bubble up logical type in recursive list cast (#8356)
- implement clone_inner for all series (#8357)
- fix fill_null for categorical (#8353)
- time.cast(str) as strftime (#8351)
- fix logical dtypes in parallel list collection (#8349)
- improve logical types of explode operation (#8348)
- logical type in anonymous list builders (#8346)
- escape csv header names if they contain special chars (#8331)
- nested struct/list/categorical logical/physical (#8334)
- fix deserialize empty list (#8326)
- fix coalesce schema (#8324)
- don't do null propagation (#8322)
- ensure invalid list eval raises (#8317)
- pass name to struct construction in aggregation (#8299)
- Use three slashes for doc comments (#8284)
- improve nested list construction (#8278)
- Fix DataFrame.sum returning empty column names (#8283)
- always sort in
top_k
fast path (#8275)
- don't use fast paths for sorted join if there are … (#8272)
- fix boolean par materialization (#8257)
- improve null/empty list construction (#8255)
- fix offsets in parallel utf8 materialization (#8254)
- nested struct logical type consistency (#8249)
- keep literal state if elementwise function is applied (#8195)
- decimal ensure backed arrow arrays have correct dtype (#8193)
- ensure cached nodes are initialized once (#8103)
- validate
map
lenghts (#8147)
- fix row-wise init of
UInt64
values that exceed Int64
upper bound (#8146)
- implement list<null> constructor (#8143)
- add all primitives to av_buffer builder (#8140)
- struct
is_in
(#8139)
- fix wrong display name of binary expressions (#8131)
- lazy: fix boolean sum schema (#8108)
- don't exponentially grow error messages (partial fix). (#8081)
- check element count in multi-column explode (#8050)
- set lower limit for chunk_size (#8048)
- impl to_static for struct (#8037)
- all/any empty sets (#8012)
- struct null_count, cast string, tranpose and describe (#8009)
- fix pivot and transpose of struct data (#8005)
- don't create duplicate pivot names (#8002)
- fix chunked literals in expression engine (#7973)
- in
sort
, top_k
, sort_by
, and arg_sort_by
, raise if descending
is a sequence and its length doesn't match the number of columns to sort by (#7957)
- concat object types (#7958)
- fix decimal conversion alignment (#7954)
- Fix lazy encode schema (#7912)
- respect skip_nulls in apply for temporal types (#7908)
- fix lit agg (#7904)
- disable ooc groupby (#7901)
- fix abs logical type (#7895)
- fix boolean min/max output type and null handling (#7894)
- validate groupby_dynamic inputs (#7876)
- correct for chunks in arg_where (#7873)
- fix nested logical/physical list (#7872)
- fix arbitrary nested logical types (#7869)
- don't use fxhash in sink_sorted fast path (#7849)
- parquet stats & all kernel (#7846)
🛠️ Other improvements
- remove unnecessary feature flag requirement for start_by=monday in groupby_dynamic (#8716)
- remove some branches (#8688)
- streaming pipeline creation (#8656)
- simplify replace_time_zone (#8644)
- make slice attribute in UnionOptions consistent with … (#8639)
- document the dispatcher (#8637)
- Rename
concat_lst
to concat_list
(#8597)
- remove unreachable/duplicated code in get_supertype (#8592)
- change partition strategy (#8561)
- remove some unnecessary calls and matches (#8490)
- improve sorted warning/ fix tests (#8484)
- bubble up time_iter errors (#8467)
- Minor update to
strptime
(#8345)
- use
concat_owned_array_unchecked
when possible (#8274)
- Rename
strptime
/strftime
args (#8221)
- change sampling ratio for groupby strategy (#8223)
- Rename
Expr.list
to implode
(#8165)
- introduce
FieldsMapper
utility class for obtaining FunctionExpr
schema (#8175)
- don't panic on err in offset_by (#8210)
- remove unused list_construction (#8197)
- split dsl paragraph header (#8162)
- feature flag guards (#8117)
- use
map_private
where applicable to reduce code duplication (#8128)
- remove unnecessary to_string (#8083)
- docs(rust) Add note about
-1
to show all rows. (#8080)
- Fixed a bunch of clippy warnings (#7967)
- rename
toggle_string_cache
to enable_string_cache
(#7970)
- Include license files in polars-error and polars-row crates (#7930)
- quantile typo in qcut (#7936)
- Improve
Duration::parse
docs (#7918)
- improve shift and fill performance in case of periods >= ca.len() (#7843)
Thank you to all our contributors for making this release possible!
@DeflateAwning, @JoonHong-Kim, @LdRoW, @MarcoGorelli, @Newtoniano, @StefanBRas, @alexander-beedie, @alonme, @ankane, @avimallu, @ayemjay, @borchero, @cgevans, @chitralverma, @clickingbuttons, @dependabot, @dependabot[bot], @ghuls, @grantmcdermott, @jonashaag, @josh, @jvdd, @lorentzenchr, @mcrumiller, @mzjp2, @n8henrie, @pgimalac, @rben01, @ritchie46, @stinodego, @uchiiii, @universalmind303, @utkarshgupta137, @zaynetro and @zundertj