1
votes

Comment diviser une valeur de colonne DataFrame au saut de ligne et créer une nouvelle colonne avec les 2 derniers éléments (lignes)

Je voudrais diviser une valeur de colonne avec des sauts de ligne et créer une nouvelle colonne avec les deux derniers éléments (lignes)

df.withColumn('last_2', split(df.s, '\r\n')[-2])

Cela ne fonctionne pas (aucune valeur): p>

df1 = spark.createDataFrame([
  ["001\r\nLuc  Krier\r\n2363  Ryan Road, Long Lake South Dakota"],
  ["002\r\nJeanny  Thorn\r\n2263 Patton Lane Raleigh North Carolina"],
  ["003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin"],
  ["004\r\nPhilippe  Schauss\r\n1 Im Oberdorf Allemagne"],
 ["005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam"]
]).toDF("s")

pyspark apache-spark-sql

2 commentaires

des exemples de données?

Ajout d'un échantillon simple. Les données d'origine contiennent plus de lignes.

3 Réponses :

-1
votes

Ceci est peut-être utile -

val sDF = Seq("""001\r\nLuc  Krier\r\n2363  Ryan Road, Long Lake South Dakota""",
      """002\r\nJeanny  Thorn\r\n2263 Patton Lane Raleigh North Carolina""",
      """003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin""",
      """004\r\nPhilippe  Schauss\r\n1 Im Oberdorf Allemagne""",
      """005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam""").toDF("""s""")

   val processedDF = sDF.withColumn("col1", slice(split(col("s"), """\\r\\n"""), -2, 2))
    processedDF.show(false)
    processedDF.printSchema()

    /**
      * +--------------------------------------------------------------------+-------------------------------------------------------------+
      * |s                                                                   |col1                                                         |
      * +--------------------------------------------------------------------+-------------------------------------------------------------+
      * |001\r\nLuc  Krier\r\n2363  Ryan Road, Long Lake South Dakota        |[Luc  Krier, 2363  Ryan Road, Long Lake South Dakota]        |
      * |002\r\nJeanny  Thorn\r\n2263 Patton Lane Raleigh North Carolina     |[Jeanny  Thorn, 2263 Patton Lane Raleigh North Carolina]     |
      * |003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin|[Teddy E Beecher, 2839 Hartland Avenue Fond Du Lac Wisconsin]|
      * |004\r\nPhilippe  Schauss\r\n1 Im Oberdorf Allemagne                 |[Philippe  Schauss, 1 Im Oberdorf Allemagne]                 |
      * |005\r\nMeindert I Tholen\r\nHagedoornweg 138 Amsterdam              |[Meindert I Tholen, Hagedoornweg 138 Amsterdam]              |
      * +--------------------------------------------------------------------+-------------------------------------------------------------+
      *
      * root
      * |-- s: string (nullable = true)
      * |-- col1: array (nullable = true)
      * |    |-- element: string (containsNull = true)
      */

1 commentaires

downvoter ... Veuillez préciser les raisons avant de voter contre une réponse

1
votes

Vous pouvez y parvenir simplement en utilisant la fonction substring_index comme

df1.withColumn('last2',f.substring_index('s','\r\n',-2)).drop('s').show(10,False)

+-----------------------------------------------------------+
|last2                                                      |
+-----------------------------------------------------------+
|Luc  Krier
2363  Ryan Road, Long Lake South Dakota        |
|Jeanny  Thorn
2263 Patton Lane Raleigh North Carolina     |
|Teddy E Beecher
2839 Hartland Avenue Fond Du Lac Wisconsin|
|Philippe  Schauss
1 Im Oberdorf Allemagne                 |
|Meindert I Tholen
Hagedoornweg 138 Amsterdam              |
+-----------------------------------------------------------+

J'espère que cela vous aidera

0 commentaires

0
votes

Oui, je suis également confronté au même problème avec l'indexation négative, mais l'indexation positive fonctionne. J'ai essayé la fonction Slice et cela a bien fonctionné. pouvez-vous essayer ceci:

import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([ ["001\r\nLuc Krier\r\n2363 Ryan Road, Long Lake South Dakota"], ["002\r\n\Jeanny Thorn\rn2263 Patton Lane Raleigh North Carolina"], ["003\r\nTeddy E Beecher\r\n2839 Hartland Avenue Fond Du Lac Wisconsin"], ["004\r\n\Philippe Schauss\r\n1 Im Oberdorf Allemagne"], ["005\r\n\Meindert I Tholen\r\nHagedoornweg 138 Amsterdam"] ]).toDF("s")
df_r = df1.withColumn('spl',F.split(F.col('s'),'\r\n'))
df_res = df_r.withColumn("res",F.slice(F.col("spl"),-1,1))

0 commentaires