2
votes

Génération de numéros de série uniques / aléatoires à l'aide de group_by dans R à l'aide de dplyr

Je voudrais générer des numéros uniques (en série ou aléatoires) regroupés par certaines colonnes en utilisant R.

Un exemple de jeu de données est fourni ci-dessous

df <- df %>%
dplyr::group_by(fact_code, style_, buyer) %>%
dplyr::mutate(style_serial = sample(1:6000, n(), replace = FALSE))

En utilisant les données ci-dessus , Je voudrais créer une variable, par exemple, style_serial qui ressemble à ceci:

fact_code  style_         item             buyer style_serial
1206       -23            LADIES TANK TOP  652   10
1206       -23            LADIES TANK TOP  652   10
1206       -23            LADIES TANK TOP  652   10   
1214       593935_592435  SS T-SHIRT       254   2
1214       593935_592435  SS T-SHIRT       254   2 
1214       593935_592435  SS T-SHIRT       254   2
7022       1572472        T-SHIRT          338   100
7022       1572472        T-SHIRT          338   100
7022       1572472        T-SHIRT          338   100

Autrement dit, créer une variable qui suppose une valeur unique regroupée par les colonnes fact_code, style_ , article et acheteur. J'ai essayé le code R suivant en utilisant le package dplyr :

fact_code  style_         item             buyer style_serial
1206       -23            LADIES TANK TOP  652   1
1206       -23            LADIES TANK TOP  652   2
1206       -23            LADIES TANK TOP  652   3   
1214       593935_592435  SS T-SHIRT       254   1
1214       593935_592435  SS T-SHIRT       254   2 
1214       593935_592435  SS T-SHIRT       254   3
7022       1572472        T-SHIRT          338   1
7022       1572472        T-SHIRT          338   2
7022       1572472        T-SHIRT          338   3

où df est le nom de l'exemple ci-dessus trame de données. Mais cela me donne une sortie inattendue:

df <- df %>%
dplyr::group_by(fact_code, style_, buyer) %>%
dplyr::mutate(style_serial = 1:n())

Cela ne me dérangerait pas si le style_serial est un ensemble aléatoire d'entiers, de sorte que les données ressemblent à ceci:

fact_code  style_         item             buyer style_serial
1206       -23            LADIES TANK TOP  652   1
1206       -23            LADIES TANK TOP  652   1
1206       -23            LADIES TANK TOP  652   1   
1214       593935_592435  SS T-SHIRT       254   2
1214       593935_592435  SS T-SHIRT       254   2 
1214       593935_592435  SS T-SHIRT       254   2
7022       1572472        T-SHIRT          338   3
7022       1572472        T-SHIRT          338   3
7022       1572472        T-SHIRT          338   3

Pour générer le tableau ci-dessus, j'exécute le code R suivant:

fact_code  style_         item             buyer
1206       -23            LADIES TANK TOP  652
1206       -23            LADIES TANK TOP  652
1206       -23            LADIES TANK TOP  652
1214       593935_592435  SS T-SHIRT       254
1214       593935_592435  SS T-SHIRT       254 
1214       593935_592435  SS T-SHIRT       254
7022       1572472        T-SHIRT          338
7022       1572472        T-SHIRT          338
7022       1572472        T-SHIRT          338

Cependant, je suis pas en mesure d'obtenir la sortie souhaitée.

L'objectif principal est de créer une variable dans ce cas, style_serial, qui suppose des valeurs uniques regroupées par un certain nombre de colonnes, c'est-à-dire dans ce cas fact_code, style_, item et acheteur.

Toute aide serait appréciée.

r dplyr dataframe

0 commentaires

4 Réponses :

2
votes

Nous pouvons utiliser group_indices de dplyr

df <- structure(list(fact_code = c(1206L, 1206L, 1206L, 1214L, 1214L, 
1214L, 7022L, 7022L, 7022L), style_ = c("-23", "-23", "-23", 
"593935_592435", "593935_592435", "593935_592435", "1572472", 
"1572472", "1572472"), item = c("LADIES TANK TOP", "LADIES TANK TOP", 
"LADIES TANK TOP", "SS T-SHIRT", "SS T-SHIRT", "SS T-SHIRT", 
"T-SHIRT", "T-SHIRT", "T-SHIRT"), buyer = c(652L, 652L, 652L, 
254L, 254L, 254L, 338L, 338L, 338L)), class = "data.frame", row.names = c(NA, 
-9L))

REMARQUE: les nombres sont aléatoires avec sample , si nous n'en avons pas besoin, supprimez la partie sample

v1 <- with(df, do.call(paste, df[1:3]))
df$style_serial <-  match(v1, unique(v1))

Ou en utilisant base R

df %>%
  mutate(style_serial = group_indices(.,fact_code, style_, buyer))

données

library(dplyr)
df %>%
   mutate(style_serial = sample(6000)[group_indices(.,fact_code, style_, buyer)])
# fact_code        style_            item buyer style_serial
#1      1206           -23 LADIES TANK TOP   652         5778
#2      1206           -23 LADIES TANK TOP   652         5778
#3      1206           -23 LADIES TANK TOP   652         5778
#4      1214 593935_592435      SS T-SHIRT   254          998
#5      1214 593935_592435      SS T-SHIRT   254          998
#6      1214 593935_592435      SS T-SHIRT   254          998
#7      7022       1572472         T-SHIRT   338         3018
#8      7022       1572472         T-SHIRT   338         3018
#9      7022       1572472         T-SHIRT   338         3018

1 commentaires

Ajout de group_indices à ma liste de "fonctions extrêmement situationnelles". Je n'avais aucune idée que cela existait, très élégant.

1
votes

Vous pouvez utiliser rleid depuis data.table , c'est-à-dire

library(dplyr)
df %>% 
 mutate(style = data.table::rleid(fact_code, style_, item))

0 commentaires

1
votes

Un moyen avec dplyr sans packages supplémentaires:

   fact_code        style_          item buyer style_serial
1:      1206           -23 LADIESTANKTOP   652            1
2:      1206           -23 LADIESTANKTOP   652            1
3:      1206           -23 LADIESTANKTOP   652            1
4:      1214 593935_592435     SST-SHIRT   254            2
5:      1214 593935_592435     SST-SHIRT   254            2
6:      1214 593935_592435     SST-SHIRT   254            2
7:      7022       1572472       T-SHIRT   338            3
8:      7022       1572472       T-SHIRT   338            3
9:      7022       1572472       T-SHIRT   338            3

Avec data.table uniquement:

library(data.table)

setDT(df)[, style_serial := .GRP, by = .(fact_code, style_, buyer)]

Résultat dans les deux cas:

df %>%
  mutate(
    style_serial = cumsum(
      coalesce(as.numeric(paste0(fact_code, style_, buyer) != lag(paste0(fact_code, style_, buyer))), 1)
      )
  )

0 commentaires

2
votes

Solution entièrement dplyr , en créant une table de recherche et en la joignant à la table de base.

df <- df %>% dplyr::left_join(
  df %>%
    dplyr::group_by(fact_code, style_, buyer) %>% 
    dplyr::summarise() %>% 
    dplyr::ungroup() %>% 
    dplyr::mutate(style_serial = row_number())
  )
#> Joining, by = c("fact_code", "style_", "buyer")
#>   fact_code        style_            item buyer style_serial
#> 1      1206           -23 LADIES TANK TOP   652            1
#> 2      1206           -23 LADIES TANK TOP   652            1
#> 3      1206           -23 LADIES TANK TOP   652            1
#> 4      1214 593935_592435      SS T-SHIRT   254            2
#> 5      1214 593935_592435      SS T-SHIRT   254            2
#> 6      1214 593935_592435      SS T-SHIRT   254            2
#> 7      7022       1572472         T-SHIRT   338            3
#> 8      7022       1572472         T-SHIRT   338            3
#> 9      7022       1572472         T-SHIRT   338            3

Si vous la voulez en tant que "one-liner":

serial_df <- df %>%
  dplyr::group_by(fact_code, style_, buyer) %>% 
  dplyr::summarise() %>% 
  dplyr::ungroup() %>% 
  dplyr::mutate(style_serial = row_number())

dplyr::left_join(df, serial_df)
#> Joining, by = c("fact_code", "style_", "buyer")
#>   fact_code        style_            item buyer style_serial
#> 1      1206           -23 LADIES TANK TOP   652            1
#> 2      1206           -23 LADIES TANK TOP   652            1
#> 3      1206           -23 LADIES TANK TOP   652            1
#> 4      1214 593935_592435      SS T-SHIRT   254            2
#> 5      1214 593935_592435      SS T-SHIRT   254            2
#> 6      1214 593935_592435      SS T-SHIRT   254            2
#> 7      7022       1572472         T-SHIRT   338            3
#> 8      7022       1572472         T-SHIRT   338            3
#> 9      7022       1572472         T-SHIRT   338            3

^{Créé le 2019-02-06 par le package reprex (v0.2.1)}

0 commentaires