Polars: rs-0.39.0 Release

Release date:
April 14, 2024
Previous version:
rs-0.38.3 (released March 19, 2024)
Magnitude:
17,909 Diff Delta
Contributors:
37 total committers
Data confidence:
Commits:

219 Commits in this Release

Ordered by the degree to which they evolved the repo in this version.

Authored March 26, 2024
Authored April 10, 2024
Authored April 12, 2024

Top Contributors in rs-0.39.0

ritchie46
orlp
stinodego
alexander-beedie
reswqa
MarcoGorelli
CanglongCl
mcrumiller
itamarst
JamesCE2001

Directory Browser for rs-0.39.0

All files are compared to previous version, rs-0.38.3. Click here to browse diffs between other versions.

Loading File Browser...

Release Notes Published

πŸ† Highlights

  • Full plan CSE (#15264)

πŸ’₯ Breaking changes

  • rename memmap -> memory_map as like Python (#15642)
  • pref(rust!, python): Unify sort with SortOptions and SortMultipleOptions (#15590)
  • Update the argument name from dims to dimensions in reshape (#15561)
  • Allow specifying Hive schema in read/scan_parquet (#15434)
  • Raise error when schema_overrides contains nonexistent column name (#15290)
  • Rename Chunk to RecordBatch (#15298)
  • Refactor AnyValue supertype logic (#15280)
  • Rename group_by_rolling to rolling and improve related error messages (#14765)
  • Rename ChunkedArray.try_apply to try_apply_values (#14947)
  • Rename parameter by to group_by in DataFrame.upsample/group_by_dynamic/rolling (#14840)

πŸš€ Performance improvements

  • Fix cross join batch size when one of the DataFrames is tiny (#14347)
  • Fix binview growable complexity O(n*m) -> O(n) (#15628)
  • Remove extra thread spawn from row group fetcher (#15626)
  • Use vertical parallelism if input is chunked for Filter,Select,WithColumns (#15608)
  • read_ipc memory usage tests, and writing fix (#15599)
  • Refactor CSV serialization to not go thorough AnyValue (#15576)
  • don't use dynamic dispatch in visitors (#15607)
  • Improve Bitmap construction performance (#15570)
  • join by row-encoding (#15559)
  • Replace std::thread spawn with tokio block_in_place (#15517)
  • speed up offset_by when a single offset is passed (#15493)
  • Avoid allocation in the hot path for struct JSON serialization (#15449)
  • avoid double-allocation in rolling_apply_agg_window (#15423)
  • Make LogicalPlan immutable (#15416)
  • Add non-order preserving variable row-encoding (#15414)
  • Use row-encoding for multiple key group by (#15392)
  • load bits one word at a time for BitmapIter (#15333)
  • Ipc exec multiple paths (#15040)
  • add SIMD support for if-then-else kernels (#15131)

✨ Enhancements

  • add Expr.dt.add_business_days and Series.dt.add_business_days (#15595)
  • Add str.head and str.tail (#14425)
  • Extended BytecodeParser to handle additional math functions, and imports from the global namespace (#15627)
  • Push down is_between expressions to Arrow (#15180)
  • add holidays argument to business_day_count (#15580)
  • change default to write parquet statistics (#15597)
  • Expressify to_integer (#15604)
  • Optimizer; remove double SORT and redundant projections (#15573)
  • Add null_on_oob parameter to expr.array.get (#15426)
  • support weekend argument in business_day_count (#15544)
  • Enable is_first/last_distinct for not nested non-numeric list (#15552)
  • Turn off cse if cache node found (#15554)
  • Tag concat list as elementwise (#15545)
  • Support list group-by of non numeric lists (#15540)
  • add business_day_count function (#15512)
  • Add SQL support for MEDIAN aggfunc (#15519)
  • Implement string, boolean and binary dtype in top_k (#15488)
  • Add SQL support for TRUNCATE TABLE command (#15513)
  • Add SQL support for GREATEST and LEAST (#15511)
  • Allow specifying Hive schema in read/scan_parquet (#15434)
  • Implements agg_list for NullChunked (#15439)
  • Supports explode_by_offsets for decimal (#15417)
  • Add null_on_oob parameter to expr.list.get (#15395)
  • CSV-writer escape carriage return (#15399)
  • Remove 'FileCacher' optimization (#15357)
  • check input type in entropy (#15351)
  • Implements arr.n_unique (#15296)
  • CSE don't scan share if predicate pushdown predicates don't match (#15328)
  • Remove cached nodes when finished (#15310)
  • Full plan CSE (#15264)
  • Add IR for expressions. (#15168)
  • Warn if map_elements is called without return_dtype specified (#15188)
  • Rename group_by_rolling to rolling and improve related error messages (#14765)
  • Rename ChunkedArray.try_apply to try_apply_values (#14947)
  • Implement strict AnyValue construction for temporal types (#15146)

🐞 Bug fixes

  • Return appropriate data type for time mean and median (#14471)
  • Support index upsampling (#13621)
  • Fix issue in write_excel that could lead to incorrect spanning range determination (#15631)
  • Output correct dtype for mean_horizontal on a single column (#15118)
  • Recompute RowIndex schema after projection pd (#15625)
  • Mean of boolean in streaming group_by incorrectly always gave NULL (#15616)
  • Include cloud creds in cache key (#15609)
  • Fix elementwise-apply if any input is AggregatedScalar (#15606)
  • Explode list should take validity into account (#15572)
  • use larger recursive stack in debug mode (#15593)
  • SQL interface "off-by-one' indexing error with GROUP BY clauses that use position ordinals (#15584)
  • Enable missing features in polars-time (#15558)
  • Handle quoted identifiers when registering CTEs in the SQL engine (#15564)
  • Decompress moved out of schema initialization (#15550)
  • Turn off cse if cache node found (#15554)
  • Resolve function names and prune all aliases. (#15522)
  • list.get should take validity into account (#15516)
  • block decimal in streaming (#15520)
  • group_by partitioned with literal Series panic (#15487)
  • Initialize validity for GroupsProxy::Slice windows (#15509)
  • Fix struct name resolving (#15507)
  • pow return type evaluation (#15506)
  • Allow selectors inside frame-level .filter() (#15445)
  • Don't prune alias in AnonymousFunction subtree (#15453)
  • Fix deadlock in async parquet scan (#15440)
  • datetime operations (e.g. .dt.year) were raising when null values were backed by out-of-range integers (#15420)
  • Ensure Binary -> Binview cast doesn't overflow the buffer size (#15408)
  • Don't prune alias in function subtree (#15406)
  • Return 0 for n_unique() in group-by context when group is empty (#15289)
  • Unset UpdateGroups after group-sensitive expression (#15400)
  • to_any_value should supports all LiteralValue type (#15387)
  • Hash failure combining hash of two numeric columns containing equal values (#15397)
  • Add FixedSizeBinary to arrow field conversion (#15389)
  • Conversion of expr_ir in partition fast path (#15388)
  • sort for series with unsupported dtype should raise instead of panic (#15385)
  • Return correct dtype for s.clear() when dtype is Object (#15315)
  • ensure first datapoint is always included in group_by_dynamic (#15312)
  • Non-exhaustive patterns: arrow-schema::DataType in polars-arrow (#15250)
  • use dynamic stacks for problematic recursive functions (#15355)
  • Raise error when schema_overrides contains nonexistent column name (#15290)
  • Fix cache dot visualization (#15311)
  • Properly propagate strict flag when constructing a Struct Series from any values (#15302)
  • ensure eq for BinaryViewArray checks all elements (#15268)
  • Raise when join projects name with suffix that doesn't exist (#15256)
  • fix kurtosis/skew (#15137)
  • Ensure ooc_start is set (#15255)
  • Fix bug where rolling operations were ignoring check_sorted in some cases (#15227)
  • Fix lazy schema for rle expression (#15248)
  • incorrect negative offset in multi-byte string slicing (#15140)
  • do not clamp negative offsets to start of array prematurely (#15242)
  • allow null index in list.get and array.get (#15239)
  • properly support nulls_last + descending (#15212)
  • Block rounding/truncating to negative durations (#15175)
  • Make parse_url work on windows with object_store (#15191)
  • divide by zero in download speed computation (#15182)

πŸ“– Documentation

  • Add legacy CPU install instructions in user guide (#13676)
  • Various minor updates to User Guide's SQL intro section (#15557)
  • Add outer_coalesce join strategy in the user guide (#15405)
  • Improve docs for Series::new with AnyValue input (#15306)
  • Fix formatting in Series::from_any_values_and_dtype docs (#15244)
  • Correct the definition of an expression in the user guide (#14750)

πŸ“¦ Build system

  • Fix a feature gate for lz4 compression in polars-parquet (#15565)
  • Update Rust toolchain (#15353)

πŸ› οΈ Other improvements

  • rename memmap -> memory_map as like Python (#15642)
  • fixup failing test due to offset deprecation in upsample (#15636)
  • use bound api (#15630)
  • Don't run streaming group-by in partitionable gb (#15611)
  • pref(rust!, python): Unify sort with SortOptions and SortMultipleOptions (#15590)
  • remove try_binary_elementwise_values (#15592)
  • remove raw pointers from visitors. (#15579)
  • rename to IR (#15571)
  • Update the argument name from dims to dimensions in reshape (#15561)
  • Rename ALogicalPlan to FullAccessIR (#15553)
  • Set up CodSpeed (#15537)
  • make dsl immutable and cheap to clone (#15394)
  • use recursive crate, add missing recursive tag (#15393)
  • Update CODEOWNERS (polars-sql) (#15384)
  • Update Rust toolchain (#15353)
  • Update CODEOWNERS (#15352)
  • remove try_apply_values (#15336)
  • always use non-legacy float_sum for mean (#15343)
  • remove legacy bitmap module (#15335)
  • More clippy in Makefile (#15340)
  • Rename Cache[count] to Cache[cache_hits] (#15300)
  • Cleanup file_caching optimization call (#15299)
  • Rename Chunk to RecordBatch (#15298)
  • Refactor AnyValue supertype logic (#15280)
  • reuse message parsing in IPC (#15265)
  • remove 'fast-projection' node (#15253)
  • cleanup column names in optimizer (#15252)
  • remove left_most_input_name from expr ir (#15251)
  • add AlignedBitmapSlice (#15171)
  • Refactor AnyValue construction for Categorical/Enum dtype (#15220)
  • Move ConsecutiveCountState into support module (#15186)
  • Run non-benchmark tests in benchmark workflow (#15207)
  • Add wrapping_abs to arithmetic kernel (#15210)
  • remove raw buffers from BinViewArray (#15206)
  • Enable RUST_BACKTRACE=1 in the CI test suite (#15204)
  • Rename parameter by to group_by in DataFrame.upsample/group_by_dynamic/rolling (#14840)
  • Set dual license for polars-arrow and polars-parquet (#15173)
  • remove parts of legacy bit_util (#15169)
  • remove legacy arrow compute (#15164)

Thank you to all our contributors for making this release possible! @CanglongCl, @ChayimFriedman2, @Fokko, @JamesCE2001, @MarcoGorelli, @NedJWestern, @Sol-Hee, @TrevorWinstral, @alexander-beedie, @braaannigan, @c-peters, @cmdlineluser, @cojmeister, @deanm0000, @dependabot, @dependabot[bot], @douglas-raillard-arm, @eitsupi, @filabrazilska, @henryharbeck, @i-aki-y, @itamarst, @kszlim, @leoforney, @mbuhidar, @mcrumiller, @mickvangelderen, @nameexhaustion, @orlp, @ozgrakkurt, @petrosbar, @reswqa, @ritchie46, @rob-sil, @sportfloh, @stinodego, @thomaslin2020 and @yutannihilation