2
votes

Comment remplacer toutes les balises html par une chaîne vide dans Golang

J'essaye de remplacer toutes les balises html telles que <div> </div> ... sur une chaîne vide ("") dans golang par un motif regex ^[^.\/]*$/g pour correspondre à toutes les balises fermées. ex: </div>

Ma solution:

package main

import (
    "fmt"
    "regexp"
)

const Template = `^[^.\/]*$/g`

func main() {
    r := regexp.MustCompile(Template)
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := r.ReplaceAllString(s, "")
    fmt.Println(res)
}

Mais affichez la même chaîne source. Qu'est-ce qui ne va pas? Veuillez aider. Remercier

Le résultat "afsdf4534534!@@!!#345345afsdf4534534!@@!!#" devrait: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"

3 commentaires

À votre avis, à quoi correspond ce modèle?

Votre regex n'a aucun sens et correspond à 0 fois, donc rien n'est remplacé. Utiliser une expression régulière pour faire correspondre les balises HTML est de toute façon une mauvaise idée.

@AdamSmith sr, ce motif correspondra à une chaîne de balise de fermeture non html. regex101.com/r/Qvg9cx/5 .

3 Réponses :

5
votes

si vous voulez remplacer tous les balises HTML, en utilisant une bande de balise html.

regex pour faire correspondre les balises HTML n'est pas une bonne idée.

package main

import (
    "fmt"
    "github.com/grokify/html-strip-tags-go"
)

func main() {
    text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    stripped := strip.StripTags(text)

    fmt.Println(text)
    fmt.Println(stripped)
}

0 commentaires

2
votes

Pour ceux qui sont venus ici à la recherche d'une solution rapide, il existe une bibliothèque qui fait cela: bluemonday .

Le package bluemonday fournit un moyen de décrire une liste blanche d'éléments et d'attributs HTML en tant que stratégie, et pour que cette stratégie soit appliquée aux chaînes non approuvées des utilisateurs pouvant contenir du balisage. Tous les éléments et attributs qui ne figurent pas sur la liste blanche seront supprimés.

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    // Do this once for each unique policy, and use the policy for the life of the program
    // Policy creation/editing is not safe to use in multiple goroutines
    p := bluemonday.StripTagsPolicy()

    // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
    html := p.Sanitize(
        `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
    )

    // Output:
    // Google
    fmt.Println(html)
}

https://play.golang.org/p/jYARzNwPToZ

0 commentaires

1
votes

Le problème avec RegEx

Il s'agit d'une méthode de remplacement RegEx très simple qui supprime les balises HTML du HTML bien formaté dans une chaîne.

strip_html_regex.go

> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8          51516             22726 ns/op
BenchmarkStripHtmlTags-8          230678              5135 ns/op

Remarque: cela ne fonctionne pas bien avec du HTML malformé . N'utilisez pas ça .

Une meilleure façon

Puisqu'une chaîne dans Go peut être traitée comme une tranche d'octets, il est facile de parcourir la chaîne et de trouver des parties qui ne sont pas dans une balise HTML. Lorsque nous identifions une partie valide de la chaîne, nous pouvons simplement prendre une tranche de cette partie et l'ajouter à l'aide d'un strings.Builder .

strip_html.go

afsdf4534534!@@!!#345345afsdf4534534!@@!!#

:: stripHTMLTags ::

Do something bold.
I broke this
This is broken link.
start this tag

:: stripHtmlRegex ::

Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.

Si nous exécutons ces deux fonctions avec le texte de l'OP et du HTML mal formé, vous verrez que le résultat n'est pas cohérent.

main.go

package main

import "fmt"

func main() {
    s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"

    res := stripHtmlTags(s)
    fmt.Println(res)

    // Malformed HTML examples
    fmt.Println("\n:: stripHTMLTags ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
    
    // Regex Malformed HTML examples
    fmt.Println(":: stripHtmlRegex ::\n")

    fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
    fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
    fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
    fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
}

Production:

package main

import (
    "strings"
    "unicode/utf8"
)

const (
    htmlTagStart = 60 // Unicode `<`
    htmlTagEnd   = 62 // Unicode `>`
)

// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
    // Setup a string builder and allocate enough memory for the new string.
    var builder strings.Builder
    builder.Grow(len(s) + utf8.UTFMax)

    in := false // True if we are inside an HTML tag.
    start := 0  // The index of the previous start tag character `<`
    end := 0    // The index of the previous end tag character `>`

    for i, c := range s {
        // If this is the last character and we are not in an HTML tag, save it.
        if (i+1) == len(s) && end >= start {
            builder.WriteString(s[end:])
        }

        // Keep going if the character is not `<` or `>`
        if c != htmlTagStart && c != htmlTagEnd {
            continue
        }

        if c == htmlTagStart {
            // Only update the start if we are not in a tag.
            // This make sure we strip out `<<br>` not just `<br>`
            if !in {
                start = i
            }
            in = true

            // Write the valid string between the close and start of the two tags.
            builder.WriteString(s[end:start])
            continue
        }
        // else c == htmlTagEnd
        in = false
        end = i + 1
    }
    s = builder.String()
    return s
}

Remarque: la méthode RegEx ne supprime pas toutes les balises HTML de manière cohérente. Pour être honnête, je ne suis pas assez bon chez RegEx pour écrire une chaîne de correspondance RegEx pour gérer correctement le décapage HTML.

Benchmarks

Outre l'avantage d'être plus sûr et plus agressif dans le décapage des balises HTML mal formées, stripHtmlTags est environ 4 fois plus rapide que stripHtmlRegex .

package main

import "regexp"

const regex = `<.*?>`

// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
    r := regexp.MustCompile(regex)
    return r.ReplaceAllString(s, "")
}

0 commentaires