flatten_df
version flattens the dataframe at every layer level, using a stack to avoid recursive calls:StructType
StructType
and a DataFrame
select(..)
statement by walking through the DataFrame.schema
.Array[Column]
. Every time the function hits a StructType
, it would call itself and append the returned Array[Column]
to its own Array[Column]
.df.select('Parent.Child')
and this returns a DataFrame with the values of the child column and is named Child. But if you have identical names for attributes of different parent structures, you lose the info about the parent and may end up with identical column names and cannot access them by name anymore as they are unambiguous. flattenSchema
separately: select()
, which would return a DataFrame with columns named by the child of the last level, I mapped the original column names to themselves as strings, then after selecting Parent.Child
column, it renames it as Parent.Child
instead of Child
(I also replaced dots with underscores for my convenience):DataFrame#flattenSchema
method to the open source spark-daria project.flattenSchema()
method.explode
: I typically reserve explode
for flattening a list. For example if you have a column idList
that is a list of Strings, you could do:flattenedId
(no longer a list)