π Highlights
- postfix
rolling
expression as a special case of window functions. (#11445)
- support 'hive partitioning' aware readers (#11284)
π₯ Breaking changes
- Rename
.list.lengths
and .str.lengths
(#11613)
- Rename
write_csv
parameter quote
to quote_char
(#11583)
- Add
disable_string_cache
(#11020)
π Performance improvements
- fix regression non-null asof join (#11984)
- drasticly improve performance of limit on async parquet datasets (#11965)
- support multiple files in a single scan parquet node. (#11922)
- fix accidental quadratic behavior; cache null_count (#11889)
- fix quadratic behavior in append sorted check (#11893)
- properly push down slice before left/asof join (#11854)
- Improve performance of
cot
(cotangent) (#11717)
- rechunk before grouping on multiple keys (#11711)
- process parquet statistics before downloading row-group (#11709)
- push down predicates that refer to group_by keys (#11687)
- slightly faster float equality (#11652)
- actually use projection information in async parquet reader (#11637)
- improve performance and fix panic in async parquet reader (#11607)
- use try_binary_elementwise over try_binary_elementwise_values (#11596)
- skip empty chunks in concat (#11565)
- improve sparse sample performance (#11544)
- early return in replace_time_zone if target and source time zones match (#11478)
- greatly improve parquet cloud reading (#11479)
- ensure we download row-groups concurrently. (#11464)
- don't load N metadata files when globbing N files (#11422)
- remove double memcopy (#11365)
- adress perf regression (#11354)
- improve dynamic_groupby_iter (#11341)
- improve and fix rolling windows by linear scanning (#11326)
- improve outer join materialization (#11241)
- use ryu and itoa for primitive serialization (#11193)
- use try-binary-elementwise instead of try-binary-elementwise-values in dt_truncate (#11189)
- Using cache for str.contains regex compilation (#11183)
β¨ Enhancements
- optimize asof_join and allow null/string keys (#11712)
- limit concurrent downloads in async parquet (#11971)
- sample fraction can take an expr (#11943)
- Add
infer_schema_length
to pl.read_json
(#11724)
- improve error handling in scan_parquet and deal with file limits (#11938)
- support multiple files in a single scan parquet node. (#11922)
- error instead of panic in unsupported sinks (#11915)
- Introduce list.sample (#11845)
- don't require empty config for cloud scan_parquet (#11819)
- Expressify pct_change and move to ops (#11786)
- add
DATE
function for SQL (#11541)
- right-align numeric columns (#7475)
- Add config setting to control how many List items are printed (#11409)
- allow specifying schema in
pl.scan_ndjson
(#10963)
- easier arrow2/arrow-rs conversion (#11666)
- support multiple sources in scan_file (#11661)
- allow coalesce in streaming (#11633)
- Implement
schema
, schema_override
for pl.read_json
with array-like input (#11492)
- add SQL support for
UNION [ALL] BY NAME
, add "diagonal_relaxed" strategy for pl.concat
(#11597)
- improve performance and fix panic in async parquet reader (#11607)
- add time_unit argument to duration, default to "us" (#11586)
- elide overflow checks on i64 (#11563)
- add
INITCAP
string function for SQL (#9884)
- Use IPC for (un)pickling dataframes/series (#11507)
- support left and right anti/semi joins from the SQL interface (#11501)
- expressify peak_min/peak_max (#11482)
IN(subquery)
and SQL Subquery Infrastructure (#11218)
- Format null arrays in Series (#11289)
- postfix
rolling
expression as a special case of window functions. (#11445)
- allow for "by" column to be of dtype Date in rolling_* functions (#11004)
- support 'abfss' for azure (#11413)
- multi-threaded async runtime (#11411)
- async parquet. (#11403)
- fail fast when invalid cloud settings; introduce retries arg (#11380)
- modernize CPU features (#11351)
- introduce 'label' instead of 'truncate' in group_by_dynamic, which can take
label='right'
(#11337)
- Expressify list.shift (#11320)
- add gather_skip_nulls implementation (#11329)
- top_k and bottom_k supports pass an expr (#11344)
- support 'hive partitioning' aware readers (#11284)
- str.strip_chars supports take an expr argument (#11313)
- sample n can take an expr (#11257)
- Add
disable_string_cache
(#11020)
- clip supports expr arguments and physical numeric dtype (#11288)
- Introduce list.drop_nulls (#11272)
- str.splitn and split_exact can take an expr argument by (#11275)
- introduce ambiguous option for dt.round (#11269)
- improve binary helper so we don't need to rechunk. (#11247)
- Adds
NULLIF
and COALESCE
SQL functions (#11124)
- better
tree-formatting
representation (#11176)
- Support
duration + date
(#11190)
- binary search and rechunk in chunked gather (#11199)
- Expressify str.strip_prefix & suffix (#11197)
- sql udfs (#10957)
- run cloud parquet reader in default engine (#11196)
- list.join's separator can be expression (#11167)
- argument every of datetime.truncate can be expression (#11155)
π Bug fixes
- fix streaming multi-column/multi-dtype sort (#11981)
- ensure streaming parquet datasets deal with limits (#11977)
- implement proper hash for identifier in cse (#11960)
- fix take return dtype in group context. (#11949)
- sql In should work without specific ops (#11947)
- construct list series from any values subject to dtype (#11944)
- avoid integer overflow in offsets_to_groups when bigidx is enabled (#11901)
read_csv
for empty lines (#11924)
- predicate push-down remove predicate refers to alias for more branch (#11887)
- use physcial append (#11894)
- recursively apply
cast_unchecked
in lists (#11884)
- recursively check allowed streaming dtypes (#11879)
- fix project pushdown for double projection contains count (#11843)
- series.to_numpy fails with dtype=Null (#11858)
- panic on hive scan from cloud (#11847)
- Propagate validity when cast primitive to list (#11846)
- Edge cases for list count formatting (#11780)
- remove flag inconsistency 'map_many' (#11817)
- ensure projections containing only hive columns are projected (#11803)
- patch broken aHash AES intrinsics on ARM (#11801)
- fix key in object-store cache (#11790)
- handle logical types in plugins (#11788)
- make
PyLazyGroupby
reusable (#11769)
- only exclude final output names of group_by key expressions (#11768)
- fix ambiguity wrt list aggregation states (#11758)
- Correctly process subseconds in
pl.duration
(#11748)
- LazyFrame.drop_columns overflow issue when columns.len()>schema.len() (#11716)
- index_to_chunked_index's fast path is not correct (#11710)
- use actual number of read rows for hive materialization (#11690)
- return float dtype in interpolate (for method="linear") for numeric dtypes (#11624)
- fix seg fault in concat_str of empty series (#11704)
- Fix match on last item for
join_asof
with strategy="nearest"
(#11673)
- fix display str for peak_max and top_k (#11657)
- Fix input replacement logic for slice (#11631)
- slice expr can be taken in cse (#11628)
- ensure nested logical types are converted to physical (#11621)
- correctly convert nullability of nested parquet fields to arrow (#11619)
- improve performance and fix panic in async parquet reader (#11607)
- expand all literals before group_by (#11590)
- mark take_group_last function as unsafe (#11587)
- handle unary operators applied to numbers used in SQL
IN
clauses (#11574)
- Align new_columns argument for
scan_csv
and read_csv
(#11575)
- don't conflate supported UNION ops in the SQL parser with (currently) unsupported UNION "BY NAME" variations (#11576)
- incomplete reading of list types from parquet (#11578)
- respect identity in horizontal sum (#11559)
- bug in BitMask::get_u32 (#11560)
- take slice into account in parallel unions (#11558)
- correct schema empty df in hive partitioning read (#11557)
- ensure ListChunked::full_null uses physical types (#11554)
- respect 'hive_partitioning' argument in parquet (#11551)
- fix parquet deserialization Overflow error by using i64 offset types when promoting Arrow Lists to LargeLists (#11549)
- streamline
is_in
handling of mismatched dtypes and fix a minor regression (#11533)
- catch use of non equi-joins in SQL interface and raise appropriate error (#11526)
- rework SQL join constraint processing to properly account for all
USING
columns (#11518)
- literal hash (#11508)
- Fix lazy schema for
cut
/qcut
when allow_breaks=True
(#11287)
- correct output schema of hive partition and projection at scan (#11499)
- correct projection pushdown in hive partitioned read (#11486)
- fix for
write_csv
when using non-default "quote" char (#11474)
- fix deserialization of parquets with large string list columns causing stack overflow (#11471)
- Fix SQL
ANY
and ALL
behaviour (#10879)
- address multiple issues caused by implicit casting of
is_in
values to the column dtype being searched (#11427)
- raise on invalid sort_by group lengths (#11423)
- fix outer join on bools (#11417)
- fix categorical collect (#11414)
- Free bitmap when slicing into a non-null array (#11405)
- async parquet. (#11403)
- Fix edge-case where the Array dtype could (internally) be considered numeric (#11398)
- Fix empty check when building a list (#11378)
- more cloud urls (#11361)
- ensure cloud globbing can deal with spaces (#11360)
- recognize more cloud urls (#11357)
- Fix
Series.__contains__
for None values and implement is_in
for null Series (#11345)
- don't panic on multi-nodes in streaming conversion (#11343)
- ensure trailing quote is written for temporal data when CSV
quote_style
is non-numeric (#11328)
- fix empty Series construction edge-case with Struct dtype (#11301)
- add missing feature flags on tests (#11305)
- set partitions independent of thread pool (#11304)
- parse sign for decimal properly (#11302)
- consume duplicates in rolling_by window (#11261)
- handle url encoded paths in objectpath creation (#11240)
- use POOL when writing csv (#11222)
- is_in for bool evaluate has_false incorrectly (#11217)
- fix nullable filter mask in group_by (#11207)
- replace n-th in filter (#11206)
- fix translation of Series-nested datetime/date values for
scan_pyarrow
predicates (#11195)
- impl hash for more function expr (#11182)
- list.join's separator can be expression (#11167)
- Add some missing expr type hint for series (#11171)
- Make pl.struct serializable (#11169)
- Fix rust test for logical plan optimizer for categoricals (#11135)
- propagate null value for str/binary starts/ends_with and contains (#11141)
π οΈ Other improvements
- optimize asof_join and allow null/string keys (#11712)
- Add
Development
and Releases
sections to the documentation (#11932)
- use ahash from crates.io release (#11964)
- move unique_counts to ops (#11963)
- fix take return dtype in group context. (#11949)
- move moment to ops (#11941)
- fix some typos and add polars-business to curated plugin list (#11916)
- prepare for multiple files in a node (#11918)
- load 40x40 avatar from github and add loading=lazy attribute. (#11886)
- Fix Cargo warning for parquet2 dependency (#11882)
- Allow manual trigger for docs deployment (#11881)
- rename new_from_owned_with_null_bitmap (#11828)
- add section about plugins (#11855)
- fix incorrect example of valid time zones (#11873)
- Bump docs dependencies (#11852)
- add missing polars-ops tests to CI (#11859)
- Update doc comments for with_column to reflect that columns can be updated (#11840)
- Move round to ops (#11838)
- arrow: remove unused arithmetic code and remove doctests (#11820)
- Move diff to polars-ops (#11818)
- remove redundant if branch in nested parquet (#11814)
- Move ewma to polars-ops (#11794)
- Make some functions in dsl::mod non-anonymous (#11799)
- Move cum_agg to polars-ops (#11770)
- more granular polars-ops imports (#11760)
- Make all emw function expr non-anonymous (#11638)
- clarify polars-arrow <=> arrow2 license (#11755)
- Version
polars-arrow
with the other crates (#11738)
- fill missing fill_null strategies (#11751)
- Minor fix in code example in section Coming from Pandas (#11745) (#11745)
- Update group_by_dynamic example (#11737)
- merge nano-arrow/polars-arrow (#11719)
- Improving the documentation of the SQL expressions (#11708)
- *_horizontal dependent on reduce_expr to expression architecture (#11685)
- update document of folds (#11705)
- update rustc and fix future (#11696)
- better align
help
command output following addition of some longer options (#11681)
- sum_horizontal to expression architecture (#11659)
- Cleanup the match block for date inference (#11677)
- Adding feature annotation (#11671)
- add note about use of
polars-lts-cpu
for macOS x86-64/rosetta (#11660)
- improve rank implementation, especially around nulls (#11651)
- Bring cloud monikers in line with the ones in
is_cloud_url
(#11629)
- Rename
.list.lengths
and .str.lengths
(#11613)
- Make backwardfill and forwardfill function expr non-anonymous (#11630)
- Make all expr in dt namespace non-anonymous (#11627)
- Fix changelog for language-specific breaking changes (#11617)
- avoid nightly rust for case conversion (#11610)
- Make value_counts and unique_counts function expr non-anonymous (#11601)
- Make arg_min(max), diff in list namespace non-anonymous (#11602)
- Rename
write_csv
parameter quote
to quote_char
(#11583)
- use a generic consistent total ordering, also for floats (#11468)
- Move mode operation from core to ops crate (#11543)
- fix lints (#11555)
- use single threaded take under certain values size (#11539)
- fix some features (#11529)
- move (hor_)str_concat to polars-ops (#11488)
- minor changes in peak-min/max (#11491)
- align cloud url regex in rust and python (#11481)
- move AnonymousScan into Scan node (#11502)
- move
repeat_by
to polars-ops (#11461)
- upgrade to nightly-10-02 (#11460)
- Update contributing guide to include memory requirement (#11458)
- remove unused order_by attribute (#11434)
- cleanup sort_by expresion impl (#11431)
- large windows runner for release (#11370)
- Fix error message reference to
infer_schema_length
(#11358)
- move rank to polars-ops (#11349)
- unify display for namespaced function expr (#11342)
- Fix some cargo manifest warnings (#11327)
- Use
GITHUB_TOKEN
to get contributor information for docs (#11321)
- Add
disable_string_cache
(#11020)
- remove default auto-explode for map_many_private (#11270)
- Add API links for Rust user guide examples (#11294)
- update a few dependencies (#11283)
- move scan helpers to separate module (#11279)
- update sponsors (#11271)
- bump chrono to 0.4.31 (#11258)
- bind all remaining method in StringNameSpace to function expr (#11229)
- Make some list function expr non-anonymous (#11230)
- remove lz4_flex feature (#11253)
- remove unnecessary transmute (#11250)
- move (almost) all join related code from polars-core to polars-ops. (#11228)
- Mention the
performant
feature only once (#11223)
- remove unneeded indirection (#11233)
- remove unneeded mutex around object-store (#11224)
- bind struct.rename_fields to function expr (#11215)
- fix un-compilable rust example in user guide. (#11214)
- add various missing expression doc-comments (#11213)
- Fix user_guide of str.split (#11185)
- New take implementation (#11138)
- Fix rust test for logical plan optimizer for categoricals (#11135)
Thank you to all our contributors for making this release possible!
@ByteNybbler, @Cheukting, @Fokko, @Hofer-Julian, @JulianCologne, @LaurynasMiksys, @MarcoGorelli, @Rohxn16, @SeanTroyUWO, @TheDataScientistNL, @Walnut356, @aberres, @alexander-beedie, @alicja-januszkiewicz, @andysham, @billylanchantin, @bowlofeggs, @c-peters, @cmdlineluser, @dannyvankooten, @dependabot, @dependabot[bot], @ewoolsey, @jhorstmann, @jonashaag, @jrycw, @mcrumiller, @messense, @nameexhaustion, @orlp, @petrosbar, @ptiza, @rancomp, @reswqa, @ritchie46, @rjthoen, @romanovacca, @sd2k, @shenker, @squnit, @stinodego, @svaningelgem, @thomasjpfan, @uchiiii, @universalmind303 and Romano Vacca