Polars: rs-0.43.0 Release

Release date:
September 11, 2024
Previous version:
rs-0.42.0 (released August 14, 2024)
Magnitude:
36,752 Diff Delta
Contributors:
44 total committers
Data confidence:
Commits:

578 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored August 23, 2024
Authored August 21, 2024
Authored August 22, 2024
Authored August 19, 2024
Authored August 22, 2024
Authored August 21, 2024
Authored August 15, 2024
Authored August 18, 2024
Authored August 28, 2024
Authored August 24, 2024
Authored August 23, 2024
Authored August 21, 2024
Authored August 15, 2024
Authored August 20, 2024
Authored September 11, 2024

Top Contributors in rs-0.43.0

orlp
nameexhaustion
coastalwhite
ritchie46
MarcoGorelli
stinodego
mcrumiller
adamreeve
alexander-beedie
deanm0000

Directory Browser for rs-0.43.0

All files are compared to previous version, rs-0.42.0. Click here to browse diffs between other versions.

Loading File Browser...

Release Notes Published

πŸ† Highlights

  • Add support for IO[bytes] and bytes in scan_{...} functions (#18532)
  • Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)

πŸš€ Performance improvements

  • Back arrow arrays with SharedStorage which can have non-refcounted static slices (#18666)
  • Don't traverse file list twice for extension validation (#18620)
  • Remove cloning of ColumnChunkMetadata (#18615)
  • Add upfront partitioning in ColumnChunkMetadata (#18584)
  • Enable Parquet parallel=prefiltered for auto (#18514)
  • Change PlSmallStr impl from Arc<str> to compact_str (#18508)
  • Added optimizer rules for is_null().all() and similar expressions to use null_count() (#18359)
  • Parquet do not copy uncompressed pages (#18441)
  • Several large parquet optimizations (#18437)
  • Batch Plain Parquet UTF-8 verification (#18397)
  • Partition metadata for parquet statistic loading (#18343)
  • Fix accidental quadratic parquet metadata (#18327)
  • Lazy decompress Parquet pages (#18326)
  • Don't rechunk aligned chunks in owned_binary_chunk_align (#18314)
  • Batch DELTA_LENGTH_BYTE_ARRAY decoding (#18299)
  • Slice pushdown for SimpleProjection (#18296)
  • Use direct path for time/timedelta literals (#18223)
  • Speedup ndjson reader ~40% (#18197)

✨ Enhancements

  • Add support for IO[bytes] and bytes in scan_{...} functions (#18532)
  • Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)
  • Make expressions containing Python UDFs serializable (#18135)
  • Support Serde for IRPlan (#18433)
  • Respect input time zone if input is pandas Timestamp (#18346)
  • Add POLARS_BACKTRACE_IN_ERR for debugging (#18333)
  • IR serde (#18298)
  • Improve decimal_comma error message (#18269)
  • Support pre-signed URLs for cloud scan (#18274)
  • Support empty structs (#18249)
  • Allow float in interpolate_by by column (#18015)

🐞 Bug fixes

  • Scalar checks (#18627)
  • Scanning hive partitioned files where hive columns are partially included in the file (#18626)
  • Enable "polars-json/timezones" feature from "polars-io" (#18635)
  • Use Buffer<T> in ObjectSeries, fixes variety of offset bugs (#18637)
  • Properly slice validity mask on pl.Object series (#18631)
  • Indicative error in list.gather when wrong indices type is supplied (#18611)
  • Fix group first value after group-by slice (#18603)
  • Functions for streaming require streaming feature (#18602)
  • Allow for date/datetime subclasses (e.g. pd.Timestamp, FreezeGun) in pl.lit (#18497)
  • Fix UnitVec inline clone and with_capacity (#18586)
  • Ensure result name of pow matches schema in grouped context (#18533)
  • Decimal mean agg dtype was incorrect in IR (#18577)
  • Fix output type for list.eval in certain cases (#18570)
  • Fix map_elements for List return dtypes (#18567)
  • Do not remove double-sort if maintain_order=True (#18561)
  • Empty any_horizontal should be false, not true (#18545)
  • Fix type inference error in map_elements for List types (#18542)
  • Added proper handling of file.write for large remote csv files (#18424)
  • Handle Parquet projection pushdown with only row index (#18520)
  • Properly raise on invalid selector expressions (#18511)
  • Wrong output column name in or and xor operations (#18512)
  • Various schema corrections (#18474)
  • Don't drop objects on empty buffers (#18469)
  • Add missing chunk align in pipe sink (#18457)
  • Expr.sign should preserve dtype (#18446)
  • Enable CSE in eager if struct are expanded (#18426)
  • Treat explode as gather (#18431)
  • Fencepost error in debug assertion in splitfields (#18423)
  • Unsoundness in CSV SplitFields (#18413)
  • Parquet nested values that span several pages (#18407)
  • Support reading empty parquet files (#18392)
  • Recurse on map field during type conversion (#15075)
  • Allow search_sorted on boolean series (#18387)
  • Mark Expr.(lower|upper)_bound as returning scalar (#18383)
  • Fix broken feature gate for ParquetReader (#18376)
  • Fix compressed ndjson row count (#18371)
  • Use correct column names when there are no value columns in unpivot (#18340)
  • Parquet several smaller issues (#18325)
  • Fix group-by slice on all keys (#18324)
  • Compute joint null mask before calling rolling corr/cov stats (#18246)
  • Several scan_parquet(parallel='prefiltered') problems (#18278)
  • Json feature flag missing imports (#18305)
  • Check groups in group-by filter (#18300)
  • Make json readers ignore BOM character (#18240)
  • Parquet delta encoding for 0-bitwidth miniblocks (#18289)
  • Arguments for upsample only have to be sorted within groups (#18264)
  • Use appropriate bins in hist when bin_count specified (#16942)
  • Raise suitable error on unsupported SQL set op syntax (#18205)
  • Fix invalid state due to cached IR (#18262)
  • Fix failed AWS credential load from '~/.aws/credentials' due to formatting (#18259)
  • Fix panic streaming parquet scan from cloud with slice (#18202)
  • Consistently round half-way points down in dt.round (#18245)
  • Fix duplicate column output and panic for include_file_paths (#18255)
  • Fix unit null rank (#18252)
  • Use physical for row-encoding (#18251)

πŸ“– Documentation

  • Fix multiprocessing docs regarding fork method check (#18563)
  • Pre-compute plugin_path before defining plugin (#18503)
  • Fix BinViewChunkedBuilder arguments (#17277) (#18439)
  • Add date_range and datetime_ranges examples without eager=True (#18379)
  • Document POLARS_BACKTRACE_IN_ERR env var (#18354)
  • Document DataFrame.__getitem__ and Series.__getitem__ (#18309)
  • Improve decimal_comma error message (#18269)
  • Clarify coalesce behaviour in join_asof (#18273)
  • Add note to Expr.shuffle differentiating from df method (#18266)

πŸ“¦ Build system

  • Remove extension-module from polars-python (#18554)
  • Bump Rust toolchain to nightly-2024-08-26 (#18370)

πŸ› οΈ Other improvements

  • Push down max row group height calc to file metadata (#18674)
  • Re-use already decoded metadata for first path (new-parquet-source) (#18656)
  • Remove duplicate byte range calc from new parquet source (#18655)
  • Fix a bunch of tests for new-streaming (#18659)
  • Rename MemSlice::from_slice -> MemSlice::from_static (#18657)
  • Don't raise on multiple same names in ie_join (#18658)
  • Split parquet_source.rs in new-streaming (#18649)
  • Check predicates in join_where (#18648)
  • Feature gate iejoin (#18646)
  • Scan from BytesIO in new-streaming parquet source (#18643)
  • Rename MetaData -> Metadata (#18644)
  • Change join_where semantics (#18640)
  • Fix unimplemented panics to give todo!s for AUTO_NEW_STREAMING (#18628)
  • Remove extra schema traits (#18616)
  • One simplify expression module and keep utility local (#18621)
  • Check number of binary comparisons in join_where predicates (#18608)
  • Raise on suffixed predicate in join_where (#18607)
  • Fix Python docs build (#18605)
  • Fix nan-ignoring max/min in new-streaming (#18593)
  • Correctly support more types in new-streaming sum (#18580)
  • Bump NodeTraverser major version (#18576)
  • Fix mean reduction in new-streaming (#18572)
  • Rename data_type -> dtype (#18566)
  • Refactor ArrowSchema to use polars_schema::Schema<D> (#18564)
  • Remove NotifyReceiver from new-streaming parquet source (#18540)
  • Refactor Schema to use generic struct from new polars-schema crate (#18539)
  • Temporarily pin NumPy in CI to address dependency resolving issue (#18544)
  • Fix and extend AnyValue comparison (#18534)
  • Remove top-level metadata from ArrowSchema (#18527)
  • Add FromIterator impls for PlSmallStr (#18509)
  • Update PlSmallStr comment (#18518)
  • Change PlSmallStr impl from Arc<str> to compact_str (#18508)
  • Make expressions containing Python UDFs serializable (#18135)
  • Allow polars to pass cargo check on windows (#18498)
  • Remove From<&&str> for PlSmallStr (#18507)
  • Change naming to new benchmark setup (#18473)
  • More refactor for PlSmallStr (#18456)
  • Split Reduction into it plus ReductionState (#18460)
  • Remove a string allocation in Parquet (#18466)
  • Unify internal string type (#18425)
  • Remove network call in hf docs (#18454)
  • Remove old streaming flag if we're going into new streaming (#18438)
  • Address spurious hypothesis test failure (#18434)
  • Add pl.length() reduction and small new-streaming fixes (#18429)
  • Fencepost error in debug assertion in splitfields (#18423)
  • Group arguments in conversion in a Context (#18418)
  • Turn all Binary/Utf8 into BinaryView/Utf8View in Parquet (#18331)
  • Recursively evaluate is_elementwise for function expressions (#18385)
  • Various small fixes for the new streaming engine (#18384)
  • Temporarily add ability to disable parquet source node (#18378)
  • Improve dot formatting of new-streaming parquet source (#18367)
  • Fix the required version of rust in README.md (#18357)
  • Only instantiate used portion of graph (#18337)
  • Fix new_streaming parameter (#18342)
  • Add parquet source node to new streaming engine (#18152)
  • Disable common sub-expr elim for new streaming engine (#18330)
  • Remove unused Parquet indexes (#18329)
  • Lower arbitrary expressions in the new streaming engine (#18315)
  • Expose many more function expressions to python IR (#18317)
  • Add Graphviz physical plan visualization for new streaming engine (#18307)
  • Add DataFrame::new_with_broadcast and simplify column uniqueness checks (#18285)
  • Add output_schema to all PhysNodes (#18272)
  • Change fn schema to fn collect_schema (#18261)
  • Add multiplexer node to new streaming engine (#18241)
  • Add feature gates for polars-python crate (#18232)
  • Split py-polars crate (#18204)
  • Update the required version of rust in README.md (#18203)
  • Add itertools in utils (#18213)
  • Use or_else for raising (#18206)
  • Remove unused Parquet source files (#18193)

Thank you to all our contributors for making this release possible! @0xbe7a, @BartSchuurmans, @ChayimFriedman2, @MarcoGorelli, @StepfenShawn, @WbaN314, @adamreeve, @agossard, @alexander-beedie, @alonme, @barak1412, @cgbur, @coastalwhite, @corwinjoy, @deanm0000, @dependabot, @dependabot[bot], @eitsupi, @henryharbeck, @ion-elgreco, @jqnatividad, @krasnobaev, @liufeimath, @markxwang, @mcrumiller, @megaserg, @nameexhaustion, @orlp, @philss, @r-brink, @ritchie46, @skellys, @squnit, @stinodego, @sunadase, @thomascamminady and @wence-