Skip to content

parse_html adds unwanted tags like <html><head>...<body></html> #583

@qknight

Description

@qknight

I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.

But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.

/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
    if let Some(body) = find_body(&dom.document) {
        let new_document = Rc::new(markup5ever_rcdom::Node {
            data: NodeData::Document,
            parent: Cell::new(None),
            children: body.children.clone(),
        });
        process_handle(&new_document)
    } else {
        Err(ParseError::NoBodyInParsedHtml)
    }
}

// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
    match &handle.data {
        NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
        _ => {
            for child in handle.children.borrow().iter() {
                if let Some(body) = find_body(child) {
                    return Some(body);
                }
            }
            None
        }
    }
}

However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.

html-driver.rs test

#[test]
fn from_utf8() {
    let dom = driver::parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .one("<title>Test".as_bytes());
    let mut serialized = Vec::new();
    let document: SerializableHandle = dom.document.clone().into();
    serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
    assert_eq!(
        String::from_utf8(serialized).unwrap().replace(' ', ""),
        "<html><head><title>Test</title></head><body></body></html>"
    );
}

Update:

parse_fragment is also adding unwanted html.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions