I would like to know the most efficient way to generate column index
to unique identify a record within each group of label:
+-------+-------+-------+
| label | value | index |
+-------+-------+-------+
| a | v1 | 0 |
+-------+-------+-------+
| a | v2 | 1 |
+-------+-------+-------+
| a | v3 | 2 |
+-------+-------+-------+
| a | v4 | 3 |
+-------+-------+-------+
| b | v5 | 0 |
+-------+-------+-------+
| b | v6 | 1 |
+-------+-------+-------+
My actual data is very large and each group of label has the same number of records. Column index will be used for Pivot. I could do the usual sort + for-loop incremental + check if cur<>pre then reset index, etc but a faster and more efficient way is always welcome.
EDIT: got my answer from the suggested question:
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
df = df.withColumn("index",
F.row_number().over(
Window.partitionBy("label").orderBy("value"))
)
Thank you for all your helps!