I have a task to write SQL to sanitise and de-duplicate an imported data-set.
pk1 | pk2 | timestamp | value1 | value2 | value3 | etc
-----+-----+-----------+--------+--------+--------+-----
1 | 2 | 123 | 1 | 2 | 5 | ...
1 | 2 | 124 | 1 | 2 | 4 | ...
1 | 2 | 125 | 1 | 2 | 3 | ... Either this row
1 | 2 | 125 | 1 | 2 | 2 | ... Or this row (arbitrary)
3 | 2 | 123 | 1 | 2 | 5 | ...
3 | 2 | 123 | 1 | 2 | 4 | ...
3 | 2 | 124 | 1 | 2 | 3 | ...
3 | 2 | 125 | 1 | 2 | 2 | ... Only this row
The two pk fields are the composite primary key.
The timestamp field identifies when the data was generated.
I need one row per pk1, pk2, with highest timestamp taking priority. There can still be duplicates (1, 2, 125 appears twice in the set above), at which point an arbitrary row should be selected and a field set so signify that this was an arbitrary selection.
I have answers for MySQL and RDBMS that support ANALYTICAL_FUNCTIONS()...
MySQL :
SELECT
import.*,
CASE WHEN COUNT(*) = 1 THEN 0 ELSE 1 END AS AS duplicate_warning
FROM
import
INNER JOIN
(
SELECT pk1, pk2, MAX(timestamp) AS timestamp
FROM import
GROUP BY pk1, pk2
)
AS import_lookup
ON import_lookup.pk1 = import_lookup.pk1
AND import_lookup.pk2 = import_lookup.pk2
AND import_lookup.timestamp = import_lookup.timestamp
GROUP BY
import.pk1,
import.pk2
ANALYTICAL_FUNCTIONS() :
SELECT
sorted_import.*
FROM
(
SELECT
import.*,
CASE WHEN
COUNT(*) OVER (PARTITION BY pk1, pk2, timestamp) = 1
AND
MAX(timestamp) OVER (PARTITION BY pk1, pk2) = timestamp
THEN
0
ELSE
ROW_NUMBER() OVER (PARTITION BY pk1, pk2 ORDER BY timestamp DESC)
END AS duplicate_warning
FROM
import
)
AS sorted_import
WHERE
sorted_import.duplicate_warning IN (0, 1)
How can this be accomplished using SQLite?
One restriction (I don't make these rules) : Can't use temp tables or auto-increment fields.