Points of Interest: User Defined Aggregate Functions In Spark Dataframes

PlaceIQ > Blog  > Points of Interest: User Defined Aggregate Functions In Spark Dataframes

Points of Interest: User Defined Aggregate Functions In Spark Dataframes

By Paul Brenner, Data Scientist, PlaceIQ

 

Why hello there friend. Here at PlaceIQ there is nothing that the Data Science team loves more than Scala, Spark, and Zeppelin notebooks [1]. From there, things get murkier. Some of our data scientists fear the glorious shining light of the future (DataFrames) and still swim primarily in the muddy pools of the past (RDDs). No thank you. This post is exclusively about DataFrames.

 

According to a quick sampling of the internet, the normal thing to do is to write an introductory post that is broadly applicable to anyone starting out with a technology. Guess what? This post isn’t that. If you need an intro to DataFrames you are just going to have to go here, here, maybe here, or I don’t know, basically anywhere else on the Google except this post. Instead I want to dive headfirst into the dark scary depths and introduce a topic that is not well documented: User Defined Aggregate Functions aka UDAFs.

 

UDAFs are functions that can be called during a groupBy to calculate… something… about the rows in each group. The benefit of learning to write UDAFs is obvious: it allows you to use UDAFs [2]. Actually, for most people that is not the benefit of learning to write UDAFs. Honestly,