Polars: rs-0.34.0 Release

Release date:
October 24, 2023
Previous version:
rs-0.33.0 (released September 17, 2023)
Magnitude:
42,629 Diff Delta
Contributors:
45 total committers
Data confidence:
Commits:

399 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored October 20, 2023
Authored September 29, 2023
Authored September 20, 2023

Top Contributors in rs-0.34.0

ritchie46
stinodego
alexander-beedie
reswqa
orlp
MarcoGorelli
c-peters
svaningelgem
SeanTroyUWO
Fokko

Directory Browser for rs-0.34.0

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

πŸ† Highlights

  • postfix rolling expression as a special case of window functions. (#11445)
  • support 'hive partitioning' aware readers (#11284)

πŸ’₯ Breaking changes

  • Rename .list.lengths and .str.lengths (#11613)
  • Rename write_csv parameter quote to quote_char (#11583)
  • Add disable_string_cache (#11020)

πŸš€ Performance improvements

  • fix regression non-null asof join (#11984)
  • drasticly improve performance of limit on async parquet datasets (#11965)
  • support multiple files in a single scan parquet node. (#11922)
  • fix accidental quadratic behavior; cache null_count (#11889)
  • fix quadratic behavior in append sorted check (#11893)
  • properly push down slice before left/asof join (#11854)
  • Improve performance of cot (cotangent) (#11717)
  • rechunk before grouping on multiple keys (#11711)
  • process parquet statistics before downloading row-group (#11709)
  • push down predicates that refer to group_by keys (#11687)
  • slightly faster float equality (#11652)
  • actually use projection information in async parquet reader (#11637)
  • improve performance and fix panic in async parquet reader (#11607)
  • use try_binary_elementwise over try_binary_elementwise_values (#11596)
  • skip empty chunks in concat (#11565)
  • improve sparse sample performance (#11544)
  • early return in replace_time_zone if target and source time zones match (#11478)
  • greatly improve parquet cloud reading (#11479)
  • ensure we download row-groups concurrently. (#11464)
  • don't load N metadata files when globbing N files (#11422)
  • remove double memcopy (#11365)
  • adress perf regression (#11354)
  • improve dynamic_groupby_iter (#11341)
  • improve and fix rolling windows by linear scanning (#11326)
  • improve outer join materialization (#11241)
  • use ryu and itoa for primitive serialization (#11193)
  • use try-binary-elementwise instead of try-binary-elementwise-values in dt_truncate (#11189)
  • Using cache for str.contains regex compilation (#11183)

✨ Enhancements

  • optimize asof_join and allow null/string keys (#11712)
  • limit concurrent downloads in async parquet (#11971)
  • sample fraction can take an expr (#11943)
  • Add infer_schema_length to pl.read_json (#11724)
  • improve error handling in scan_parquet and deal with file limits (#11938)
  • support multiple files in a single scan parquet node. (#11922)
  • error instead of panic in unsupported sinks (#11915)
  • Introduce list.sample (#11845)
  • don't require empty config for cloud scan_parquet (#11819)
  • Expressify pct_change and move to ops (#11786)
  • add DATE function for SQL (#11541)
  • right-align numeric columns (#7475)
  • Add config setting to control how many List items are printed (#11409)
  • allow specifying schema in pl.scan_ndjson (#10963)
  • easier arrow2/arrow-rs conversion (#11666)
  • support multiple sources in scan_file (#11661)
  • allow coalesce in streaming (#11633)
  • Implement schema, schema_override for pl.read_json with array-like input (#11492)
  • add SQL support for UNION [ALL] BY NAME, add "diagonal_relaxed" strategy for pl.concat (#11597)
  • improve performance and fix panic in async parquet reader (#11607)
  • add time_unit argument to duration, default to "us" (#11586)
  • elide overflow checks on i64 (#11563)
  • add INITCAP string function for SQL (#9884)
  • Use IPC for (un)pickling dataframes/series (#11507)
  • support left and right anti/semi joins from the SQL interface (#11501)
  • expressify peak_min/peak_max (#11482)
  • IN(subquery) and SQL Subquery Infrastructure (#11218)
  • Format null arrays in Series (#11289)
  • postfix rolling expression as a special case of window functions. (#11445)
  • allow for "by" column to be of dtype Date in rolling_* functions (#11004)
  • support 'abfss' for azure (#11413)
  • multi-threaded async runtime (#11411)
  • async parquet. (#11403)
  • fail fast when invalid cloud settings; introduce retries arg (#11380)
  • modernize CPU features (#11351)
  • introduce 'label' instead of 'truncate' in group_by_dynamic, which can take label='right' (#11337)
  • Expressify list.shift (#11320)
  • add gather_skip_nulls implementation (#11329)
  • top_k and bottom_k supports pass an expr (#11344)
  • support 'hive partitioning' aware readers (#11284)
  • str.strip_chars supports take an expr argument (#11313)
  • sample n can take an expr (#11257)
  • Add disable_string_cache (#11020)
  • clip supports expr arguments and physical numeric dtype (#11288)
  • Introduce list.drop_nulls (#11272)
  • str.splitn and split_exact can take an expr argument by (#11275)
  • introduce ambiguous option for dt.round (#11269)
  • improve binary helper so we don't need to rechunk. (#11247)
  • Adds NULLIF and COALESCE SQL functions (#11124)
  • better tree-formatting representation (#11176)
  • Support duration + date (#11190)
  • binary search and rechunk in chunked gather (#11199)
  • Expressify str.strip_prefix & suffix (#11197)
  • sql udfs (#10957)
  • run cloud parquet reader in default engine (#11196)
  • list.join's separator can be expression (#11167)
  • argument every of datetime.truncate can be expression (#11155)

🐞 Bug fixes

  • fix streaming multi-column/multi-dtype sort (#11981)
  • ensure streaming parquet datasets deal with limits (#11977)
  • implement proper hash for identifier in cse (#11960)
  • fix take return dtype in group context. (#11949)
  • sql In should work without specific ops (#11947)
  • construct list series from any values subject to dtype (#11944)
  • avoid integer overflow in offsets_to_groups when bigidx is enabled (#11901)
  • read_csv for empty lines (#11924)
  • predicate push-down remove predicate refers to alias for more branch (#11887)
  • use physcial append (#11894)
  • recursively apply cast_unchecked in lists (#11884)
  • recursively check allowed streaming dtypes (#11879)
  • fix project pushdown for double projection contains count (#11843)
  • series.to_numpy fails with dtype=Null (#11858)
  • panic on hive scan from cloud (#11847)
  • Propagate validity when cast primitive to list (#11846)
  • Edge cases for list count formatting (#11780)
  • remove flag inconsistency 'map_many' (#11817)
  • ensure projections containing only hive columns are projected (#11803)
  • patch broken aHash AES intrinsics on ARM (#11801)
  • fix key in object-store cache (#11790)
  • handle logical types in plugins (#11788)
  • make PyLazyGroupby reusable (#11769)
  • only exclude final output names of group_by key expressions (#11768)
  • fix ambiguity wrt list aggregation states (#11758)
  • Correctly process subseconds in pl.duration (#11748)
  • LazyFrame.drop_columns overflow issue when columns.len()>schema.len() (#11716)
  • index_to_chunked_index's fast path is not correct (#11710)
  • use actual number of read rows for hive materialization (#11690)
  • return float dtype in interpolate (for method="linear") for numeric dtypes (#11624)
  • fix seg fault in concat_str of empty series (#11704)
  • Fix match on last item for join_asof with strategy="nearest" (#11673)
  • fix display str for peak_max and top_k (#11657)
  • Fix input replacement logic for slice (#11631)
  • slice expr can be taken in cse (#11628)
  • ensure nested logical types are converted to physical (#11621)
  • correctly convert nullability of nested parquet fields to arrow (#11619)
  • improve performance and fix panic in async parquet reader (#11607)
  • expand all literals before group_by (#11590)
  • mark take_group_last function as unsafe (#11587)
  • handle unary operators applied to numbers used in SQL IN clauses (#11574)
  • Align new_columns argument for scan_csv and read_csv (#11575)
  • don't conflate supported UNION ops in the SQL parser with (currently) unsupported UNION "BY NAME" variations (#11576)
  • incomplete reading of list types from parquet (#11578)
  • respect identity in horizontal sum (#11559)
  • bug in BitMask::get_u32 (#11560)
  • take slice into account in parallel unions (#11558)
  • correct schema empty df in hive partitioning read (#11557)
  • ensure ListChunked::full_null uses physical types (#11554)
  • respect 'hive_partitioning' argument in parquet (#11551)
  • fix parquet deserialization Overflow error by using i64 offset types when promoting Arrow Lists to LargeLists (#11549)
  • streamline is_in handling of mismatched dtypes and fix a minor regression (#11533)
  • catch use of non equi-joins in SQL interface and raise appropriate error (#11526)
  • rework SQL join constraint processing to properly account for all USING columns (#11518)
  • literal hash (#11508)
  • Fix lazy schema for cut/qcut when allow_breaks=True (#11287)
  • correct output schema of hive partition and projection at scan (#11499)
  • correct projection pushdown in hive partitioned read (#11486)
  • fix for write_csv when using non-default "quote" char (#11474)
  • fix deserialization of parquets with large string list columns causing stack overflow (#11471)
  • Fix SQL ANY and ALL behaviour (#10879)
  • address multiple issues caused by implicit casting of is_in values to the column dtype being searched (#11427)
  • raise on invalid sort_by group lengths (#11423)
  • fix outer join on bools (#11417)
  • fix categorical collect (#11414)
  • Free bitmap when slicing into a non-null array (#11405)
  • async parquet. (#11403)
  • Fix edge-case where the Array dtype could (internally) be considered numeric (#11398)
  • Fix empty check when building a list (#11378)
  • more cloud urls (#11361)
  • ensure cloud globbing can deal with spaces (#11360)
  • recognize more cloud urls (#11357)
  • Fix Series.__contains__ for None values and implement is_in for null Series (#11345)
  • don't panic on multi-nodes in streaming conversion (#11343)
  • ensure trailing quote is written for temporal data when CSV quote_style is non-numeric (#11328)
  • fix empty Series construction edge-case with Struct dtype (#11301)
  • add missing feature flags on tests (#11305)
  • set partitions independent of thread pool (#11304)
  • parse sign for decimal properly (#11302)
  • consume duplicates in rolling_by window (#11261)
  • handle url encoded paths in objectpath creation (#11240)
  • use POOL when writing csv (#11222)
  • is_in for bool evaluate has_false incorrectly (#11217)
  • fix nullable filter mask in group_by (#11207)
  • replace n-th in filter (#11206)
  • fix translation of Series-nested datetime/date values for scan_pyarrow predicates (#11195)
  • impl hash for more function expr (#11182)
  • list.join's separator can be expression (#11167)
  • Add some missing expr type hint for series (#11171)
  • Make pl.struct serializable (#11169)
  • Fix rust test for logical plan optimizer for categoricals (#11135)
  • propagate null value for str/binary starts/ends_with and contains (#11141)

πŸ› οΈ Other improvements

  • optimize asof_join and allow null/string keys (#11712)
  • Add Development and Releases sections to the documentation (#11932)
  • use ahash from crates.io release (#11964)
  • move unique_counts to ops (#11963)
  • fix take return dtype in group context. (#11949)
  • move moment to ops (#11941)
  • fix some typos and add polars-business to curated plugin list (#11916)
  • prepare for multiple files in a node (#11918)
  • load 40x40 avatar from github and add loading=lazy attribute. (#11886)
  • Fix Cargo warning for parquet2 dependency (#11882)
  • Allow manual trigger for docs deployment (#11881)
  • rename new_from_owned_with_null_bitmap (#11828)
  • add section about plugins (#11855)
  • fix incorrect example of valid time zones (#11873)
  • Bump docs dependencies (#11852)
  • add missing polars-ops tests to CI (#11859)
  • Update doc comments for with_column to reflect that columns can be updated (#11840)
  • Move round to ops (#11838)
  • arrow: remove unused arithmetic code and remove doctests (#11820)
  • Move diff to polars-ops (#11818)
  • remove redundant if branch in nested parquet (#11814)
  • Move ewma to polars-ops (#11794)
  • Make some functions in dsl::mod non-anonymous (#11799)
  • Move cum_agg to polars-ops (#11770)
  • more granular polars-ops imports (#11760)
  • Make all emw function expr non-anonymous (#11638)
  • clarify polars-arrow <=> arrow2 license (#11755)
  • Version polars-arrow with the other crates (#11738)
  • fill missing fill_null strategies (#11751)
  • Minor fix in code example in section Coming from Pandas (#11745) (#11745)
  • Update group_by_dynamic example (#11737)
  • merge nano-arrow/polars-arrow (#11719)
  • Improving the documentation of the SQL expressions (#11708)
  • *_horizontal dependent on reduce_expr to expression architecture (#11685)
  • update document of folds (#11705)
  • update rustc and fix future (#11696)
  • better align help command output following addition of some longer options (#11681)
  • sum_horizontal to expression architecture (#11659)
  • Cleanup the match block for date inference (#11677)
  • Adding feature annotation (#11671)
  • add note about use of polars-lts-cpu for macOS x86-64/rosetta (#11660)
  • improve rank implementation, especially around nulls (#11651)
  • Bring cloud monikers in line with the ones in is_cloud_url (#11629)
  • Rename .list.lengths and .str.lengths (#11613)
  • Make backwardfill and forwardfill function expr non-anonymous (#11630)
  • Make all expr in dt namespace non-anonymous (#11627)
  • Fix changelog for language-specific breaking changes (#11617)
  • avoid nightly rust for case conversion (#11610)
  • Make value_counts and unique_counts function expr non-anonymous (#11601)
  • Make arg_min(max), diff in list namespace non-anonymous (#11602)
  • Rename write_csv parameter quote to quote_char (#11583)
  • use a generic consistent total ordering, also for floats (#11468)
  • Move mode operation from core to ops crate (#11543)
  • fix lints (#11555)
  • use single threaded take under certain values size (#11539)
  • fix some features (#11529)
  • move (hor_)str_concat to polars-ops (#11488)
  • minor changes in peak-min/max (#11491)
  • align cloud url regex in rust and python (#11481)
  • move AnonymousScan into Scan node (#11502)
  • move repeat_by to polars-ops (#11461)
  • upgrade to nightly-10-02 (#11460)
  • Update contributing guide to include memory requirement (#11458)
  • remove unused order_by attribute (#11434)
  • cleanup sort_by expresion impl (#11431)
  • large windows runner for release (#11370)
  • Fix error message reference to infer_schema_length (#11358)
  • move rank to polars-ops (#11349)
  • unify display for namespaced function expr (#11342)
  • Fix some cargo manifest warnings (#11327)
  • Use GITHUB_TOKEN to get contributor information for docs (#11321)
  • Add disable_string_cache (#11020)
  • remove default auto-explode for map_many_private (#11270)
  • Add API links for Rust user guide examples (#11294)
  • update a few dependencies (#11283)
  • move scan helpers to separate module (#11279)
  • update sponsors (#11271)
  • bump chrono to 0.4.31 (#11258)
  • bind all remaining method in StringNameSpace to function expr (#11229)
  • Make some list function expr non-anonymous (#11230)
  • remove lz4_flex feature (#11253)
  • remove unnecessary transmute (#11250)
  • move (almost) all join related code from polars-core to polars-ops. (#11228)
  • Mention the performant feature only once (#11223)
  • remove unneeded indirection (#11233)
  • remove unneeded mutex around object-store (#11224)
  • bind struct.rename_fields to function expr (#11215)
  • fix un-compilable rust example in user guide. (#11214)
  • add various missing expression doc-comments (#11213)
  • Fix user_guide of str.split (#11185)
  • New take implementation (#11138)
  • Fix rust test for logical plan optimizer for categoricals (#11135)

Thank you to all our contributors for making this release possible! @ByteNybbler, @Cheukting, @Fokko, @Hofer-Julian, @JulianCologne, @LaurynasMiksys, @MarcoGorelli, @Rohxn16, @SeanTroyUWO, @TheDataScientistNL, @Walnut356, @aberres, @alexander-beedie, @alicja-januszkiewicz, @andysham, @billylanchantin, @bowlofeggs, @c-peters, @cmdlineluser, @dannyvankooten, @dependabot, @dependabot[bot], @ewoolsey, @jhorstmann, @jonashaag, @jrycw, @mcrumiller, @messense, @nameexhaustion, @orlp, @petrosbar, @ptiza, @rancomp, @reswqa, @ritchie46, @rjthoen, @romanovacca, @sd2k, @shenker, @squnit, @stinodego, @svaningelgem, @thomasjpfan, @uchiiii, @universalmind303 and Romano Vacca