Polars: rs-0.29.0 Release

Release date:
May 8, 2023
Previous version:
rs-0.28.0 (released March 29, 2023)
Magnitude:
51,370 Diff Delta
Contributors:
33 total committers
Data confidence:
Commits:

430 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored April 20, 2023

Top Contributors in rs-0.29.0

ritchie46
stinodego
alexander-beedie
MarcoGorelli
universalmind303
borchero
rben01
ghuls
zundertj
josh

Directory Browser for rs-0.29.0

We haven't yet finished calculating and confirming the files and directories changed in this release. Please check back soon.

Release Notes Published

🏆 Highlights

  • Out-of-core unique (#8573)

⚠️ Breaking changes

  • Rename concat_lst to concat_list (#8597)
  • Schema improvements (#8286)
  • don't create duplicate pivot names (#8002)
  • rename toggle_string_cache to enable_string_cache (#7970)
  • change top_k(descending) -> bottom_k (#7969)
  • in sort, top_k, sort_by, and arg_sort_by, raise if descending is a sequence and its length doesn't match the number of columns to sort by (#7957)

🚀 Performance improvements

  • elide function calls in AnyValue::eq (#8725)
  • add fused multiply add optimization for expressions (#8690)
  • use expression for dot product (#8686)
  • improve nested grouptuples related code (#8618)
  • buffer spill partitions in ooc sort. ~10/20% (#8616)
  • improve OOC sort performance during partition phase (#8590)
  • remove some unnecessary calls and matches (#8490)
  • less naive count (#8473)
  • parallelize almost all flattens (#8468)
  • optimize horizontal min/max (#8463)
  • reinstate old behavior in numeric group-tuples (#8445)
  • remove false sharing in perfect hash table >2x (#8432)
  • further optimised conversions to python date/datetime (#8417)
  • optimize join inner materialization of single keys (#8405)
  • parallelize sorted group tuple materialization (#8387)
  • improve materialization of huge cardinality group tuples (#8382)
  • improve group_tuples materialization (#8375)
  • use online variance kernel for aggregation (#8306)
  • add specialized boolean aggregation for min/max (#8294)
  • fail fast on non-inferable strings in strptime if no fmt is provided (#8111)
  • make chunks search more resilient (#8229)
  • SIMD accelerated arg_min/arg_max (via argminmax) (#8074)
  • speed up csv parsing for slower datetimes formats (#8213)
  • arr.eval run on groupby expression engine when possible (#8199)
  • FromParalleIter<Option<str>> for Utf8Chunked ~1.9x (#8058)
  • speed up from_par_iter Option<bool> ~2.5x (#8057)
  • parallelize numeric ChunkedArray materialization ~2x. (#8053)
  • parallelize into_groups materialization ~-25% (#8036)
  • use a trusted anyvalue builder (#8001)
  • numeric grouptuples with nulls hash in single pass ~25% (#7980)
  • use perfect hash table for categoricals (#7951)
  • improve group_tuples of high cardinality data ~10% (#7938)
  • use streaming instead of partitioned groupby (#7907)
  • don't auto-stream groupby (#7906)
  • rechunk before aggs (#7903)
  • don't re-allocate groups in sorted to_dummies (#7897)

✨ Enhancements

  • add support for DISTINCT keyword in SQL select clauses (#8740)
  • support any day of the week in 'start_by' in groupby_dynamic (#8720)
  • add support for USING clause in SQL join operations (#8731)
  • add support for HAVING clause to SQL GROUP BY operations (#8704)
  • streaming unions (#8676)
  • expression cache (#8674)
  • rolling covariance and correlation (#8671)
  • Add dt.to_string alias for dt.strftime (#8290)
  • use temp dir for ooc spills (#8614)
  • make ooc-sort resilient against chunk_size (#8588)
  • Set strptime default strict/exact=true (#8587)
  • Out-of-core unique (#8573)
  • Add to_date, to_datetime, to_time to String namespace (#8579)
  • more detailed error message on failure to cast List dtype (#8583)
  • don't trigger unreachable code if no dtype is set (#8532)
  • accept expressions in groupby_dynamic/rolling (#8528)
  • expose quantile/mean for duration (#8491)
  • require explicitly sorted flag for upsample (#8488)
  • allow for _saturating suffix in duration strings (#8479)
  • let duration string accept "1mo_saturating" (#8469)
  • add dt.month_start and dt.month_end (#8435)
  • add SQL support for cumulative functions (#8457)
  • add str_slice method to StringNameSpace (#8427)
  • allow negative 'arange' expression (#8413)
  • warn if argument is not explicitly sorted (#8409)
  • Schema improvements (#8286)
  • add support for SQL "IN" expr (#8396)
  • cli output mode & sql read_json (#8336)
  • rename 'csv-file' to 'csv' (#8101)
  • preserve time zone in combine (#8263)
  • add use_earliest argument to replace_time_zone for dealing with ambiguous datetimes (#8087)
  • SQL CTE's (#8208)
  • add duration cumsum and remainder (#8219)
  • better algorithm for streaming unique (#8003)
  • Add approx distinct count via approx_unique() (#7937)
  • adopt FunctionExpr for cat namespace (#8173)
  • DatetimeArgs ergonomics (#8133)
  • Remove Seek constraint from IpcStreamReader and SerReader (#8166)
  • implement FunctionExpr for bound and round methods (#8172)
  • display skipped row if same number of rows (#8170)
  • move all boolean expressions into BooleanFunction enum (#8132)
  • rewrite log expressions to make them serializable (#8126)
  • make unique expr serde and cmp (#8153)
  • adopt FunctionExpr for abs to allow for serialization (#8129)
  • adopt FunctionExpr for cum* functions (#8130)
  • support negative index in pct_change (#8137)
  • add log1p to list of mathematical functions (#8102)
  • expand list of tz-aware formats which can be auto-inferred (#8085)
  • clearer error message if strptime without a fmt specified fails (#8086)
  • infer tz-aware formats with try_parse_dates in read_csv (#8084)
  • feat(python, rust)! make 'mo' interval raise if the target date does not exist (#8078)
  • auto-infer fmt for tz-aware date strings (#7405)
  • multiple sql contexts & optional sql highlighting in cli (#8072)
  • implement arg_sort for struct dtype (#8051)
  • support struct in df.unique (#7976)
  • change top_k(descending) -> bottom_k (#7969)
  • optimize away nested unions in lp (#7861)
  • Add seed argument to rank for random (#7913)
  • auto-infer detecting time-zone-awareness of fmt argument in strptime; deprecate tz_aware argument (#7886)
  • deal with null values in cut/qcut (#7878)
  • support datetime/date subclasses (e.g. FreezeGun) (#7819)

🐞 Bug fixes

  • groupby_dynamic was unnecessarily failing on ambiguous local datetime (#8737)
  • ensure count aggregation has proper length when spilling (#8735)
  • fix return value of std for single-element sequence with ddof=1 (#8730)
  • don't take logical plan during streaming fmt (#8711)
  • Don't upcast in round() for f32 when decimal is 0 (#8706)
  • block predicate containing shifts and windows after sort (#8670)
  • ensure perfect hash table processes the nulls (#8668)
  • Reading more tiny CSVs than workers in parallel will deadlock (#8441)
  • respect maintain_order in partitioned groupby (#8653)
  • fix explode null series (#8654)
  • fix categorical agg type (#8645)
  • allow list<null> -> list<cat> (#8636)
  • maintain sorted info on top-k and empty sort (#8615)
  • maintain sortedness in date -> datetime cast (#8606)
  • fix determining of supertype for tz-aware and tz-naive datetimes (#8585)
  • fix csv reader with new line in header (#8580)
  • correct for nested offsets in json serialization (#8584)
  • fix wrong dtype init in streaming groupby (#8574)
  • fix categorical/string_cache fill_null panic (#8562)
  • fix window function contention in binary expression (#8544)
  • fix StructChunked not_equal comparator/operator (#8547)
  • fix struct pyarrow ffi (#8543)
  • don't trigger unreachable code if no dtype is set (#8532)
  • keep sorted info on agg_first and simple singleton… (#8526)
  • unset fast_unique coming from arrow (#8521)
  • correct sign-reversed scale on DecimalChunked to Python Decimal conversion (fixes #8423) (#8508)
  • don't error on cast if column is not projected (#8495)
  • ensure window function succeeds on empty frame (#8492)
  • don't set verbose on union (#8487)
  • check literal/group length before claiming agg sta… (#8486)
  • fix error message of offset_by if offsetting by negative number of months (#8464)
  • fix sorted warning (#8462)
  • fix features serde and dtype-struct not compiling together (#8439)
  • respect dtype in anonymous list builder in case of… (#8428)
  • infer supertype in json serde (#8411)
  • duration on empty df (#8403)
  • don't inadvertently set Series initialised with nested tuple data as Object dtype (#8401)
  • use physical in streaming unique global table (#8390)
  • recursively bubble up all dtypes in list cast (#8386)
  • is_in struct logical types (#8378)
  • fix nested null parquet read (#8372)
  • fix logical type in ListChunked::new_from_index (#8367)
  • bubble up logical type in recursive list cast (#8356)
  • implement clone_inner for all series (#8357)
  • fix fill_null for categorical (#8353)
  • time.cast(str) as strftime (#8351)
  • fix logical dtypes in parallel list collection (#8349)
  • improve logical types of explode operation (#8348)
  • logical type in anonymous list builders (#8346)
  • escape csv header names if they contain special chars (#8331)
  • nested struct/list/categorical logical/physical (#8334)
  • fix deserialize empty list (#8326)
  • fix coalesce schema (#8324)
  • don't do null propagation (#8322)
  • ensure invalid list eval raises (#8317)
  • pass name to struct construction in aggregation (#8299)
  • Use three slashes for doc comments (#8284)
  • improve nested list construction (#8278)
  • Fix DataFrame.sum returning empty column names (#8283)
  • always sort in top_k fast path (#8275)
  • don't use fast paths for sorted join if there are … (#8272)
  • fix boolean par materialization (#8257)
  • improve null/empty list construction (#8255)
  • fix offsets in parallel utf8 materialization (#8254)
  • nested struct logical type consistency (#8249)
  • keep literal state if elementwise function is applied (#8195)
  • decimal ensure backed arrow arrays have correct dtype (#8193)
  • ensure cached nodes are initialized once (#8103)
  • validate map lenghts (#8147)
  • fix row-wise init of UInt64 values that exceed Int64 upper bound (#8146)
  • implement list<null> constructor (#8143)
  • add all primitives to av_buffer builder (#8140)
  • struct is_in (#8139)
  • fix wrong display name of binary expressions (#8131)
  • lazy: fix boolean sum schema (#8108)
  • don't exponentially grow error messages (partial fix). (#8081)
  • check element count in multi-column explode (#8050)
  • set lower limit for chunk_size (#8048)
  • impl to_static for struct (#8037)
  • all/any empty sets (#8012)
  • struct null_count, cast string, tranpose and describe (#8009)
  • fix pivot and transpose of struct data (#8005)
  • don't create duplicate pivot names (#8002)
  • fix chunked literals in expression engine (#7973)
  • in sort, top_k, sort_by, and arg_sort_by, raise if descending is a sequence and its length doesn't match the number of columns to sort by (#7957)
  • concat object types (#7958)
  • fix decimal conversion alignment (#7954)
  • Fix lazy encode schema (#7912)
  • respect skip_nulls in apply for temporal types (#7908)
  • fix lit agg (#7904)
  • disable ooc groupby (#7901)
  • fix abs logical type (#7895)
  • fix boolean min/max output type and null handling (#7894)
  • validate groupby_dynamic inputs (#7876)
  • correct for chunks in arg_where (#7873)
  • fix nested logical/physical list (#7872)
  • fix arbitrary nested logical types (#7869)
  • don't use fxhash in sink_sorted fast path (#7849)
  • parquet stats & all kernel (#7846)

🛠️ Other improvements

  • remove unnecessary feature flag requirement for start_by=monday in groupby_dynamic (#8716)
  • remove some branches (#8688)
  • streaming pipeline creation (#8656)
  • simplify replace_time_zone (#8644)
  • make slice attribute in UnionOptions consistent with … (#8639)
  • document the dispatcher (#8637)
  • Rename concat_lst to concat_list (#8597)
  • remove unreachable/duplicated code in get_supertype (#8592)
  • change partition strategy (#8561)
  • remove some unnecessary calls and matches (#8490)
  • improve sorted warning/ fix tests (#8484)
  • bubble up time_iter errors (#8467)
  • Minor update to strptime (#8345)
  • use concat_owned_array_unchecked when possible (#8274)
  • Rename strptime/strftime args (#8221)
  • change sampling ratio for groupby strategy (#8223)
  • Rename Expr.list to implode (#8165)
  • introduce FieldsMapper utility class for obtaining FunctionExpr schema (#8175)
  • don't panic on err in offset_by (#8210)
  • remove unused list_construction (#8197)
  • split dsl paragraph header (#8162)
  • feature flag guards (#8117)
  • use map_private where applicable to reduce code duplication (#8128)
  • remove unnecessary to_string (#8083)
  • docs(rust) Add note about -1 to show all rows. (#8080)
  • Fixed a bunch of clippy warnings (#7967)
  • rename toggle_string_cache to enable_string_cache (#7970)
  • Include license files in polars-error and polars-row crates (#7930)
  • quantile typo in qcut (#7936)
  • Improve Duration::parse docs (#7918)
  • improve shift and fill performance in case of periods >= ca.len() (#7843)

Thank you to all our contributors for making this release possible! @DeflateAwning, @JoonHong-Kim, @LdRoW, @MarcoGorelli, @Newtoniano, @StefanBRas, @alexander-beedie, @alonme, @ankane, @avimallu, @ayemjay, @borchero, @cgevans, @chitralverma, @clickingbuttons, @dependabot, @dependabot[bot], @ghuls, @grantmcdermott, @jonashaag, @josh, @jvdd, @lorentzenchr, @mcrumiller, @mzjp2, @n8henrie, @pgimalac, @rben01, @ritchie46, @stinodego, @uchiiii, @universalmind303, @utkarshgupta137, @zaynetro and @zundertj