Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace autop() (Trac #1479) #1479

Closed
elgg-gitbot opened this issue Feb 16, 2013 · 37 comments
Closed

Replace autop() (Trac #1479) #1479

elgg-gitbot opened this issue Feb 16, 2013 · 37 comments

Comments

@elgg-gitbot
Copy link

Original ticket http://trac.elgg.org/ticket/1479 on 40089705-04-01 by brettp, assigned to brettp.

Elgg version: 1.6

autop() needs to be replaced with an in-house function.

@elgg-gitbot
Copy link
Author

brettp wrote on 40766214-04-24

Upgrading this because of license concerns.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 40824331-09-17

Isn't WordPress GPLed? Assuming this does need to be created, I think the first step would be a set of unit tests. Does this mean we can't borrow wptexturize either?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 40824338-03-05

[http://svn.automattic.com/wordpress-tests/wp-testcase/test_includes_formatting.php Testcases!](near bottom)

@elgg-gitbot
Copy link
Author

brettp wrote on 40825765-03-30

Elgg is dual licensed GPL2 and MIT. We can't use any GPL2-only code. Thanks for the test cases, though!

@elgg-gitbot
Copy link
Author

trac user hellekin wrote on 41024129-09-17

autop() uses regular expressions to parse HTML. That means it's bound to fail against some misformatted, malicious or simply too complex code. I added some tests for the function, one showing how it breaks on improperly nested tags.

Elgg bundles HTMLawed (~ 428KB) but it seems less efficient than HTML Purifier (~1.1MB.) One or the other should provide a saner way to handle this functionality. I didn't review the code fully. but HTML Purifier seems to integrate a proper parser that will recover from badly written HTML and output standard XHTML.

@elgg-gitbot
Copy link
Author

brettp wrote on 41157837-09-30

Hate to do this, but I'm pushing this to 1.8.1. I've had a few goes at this with regex and agree with hellekin that it's probably best to use a true HTML parser.

This means the MIT version of 1.8 will ship without autop().

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.1 by brettp on 41157837-09-30

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41158117-05-03

This may go without saying: elgg_autop's primary concern should remain paragraph-wrapping and leave markup sanitization to separate functions. We want plugin authors to know they always must use the HTML sanitizer (whichever chosen) even if some formatting functions appear to "fix" some HTML.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41161652-11-17

I've created a more http://code.google.com/p/mrclay/source/browse/trunk/php/MrClay/AutoP.php resilient autop implementation that Elgg is free to use. It avoids the dangerous regex operations by doing most work in a DOMDocument object.

This is undoubtedly slower than wpautop (and may choke on extremely invalid HTML), but happily handles markup that the current wpautop breaks horribly. Since there's no real spec for wpautop, it's hard to duplicate the style of markup it outputs when it does work, but we can make more rigorous test cases.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41164318-08-31

I've made several improvements and new testcases. I think my next step is moving (old) autop to a class so we can test output/performance side-by-side. No matter the implementation, the trick is going to finding the right balance of speed/safety.

@elgg-gitbot
Copy link
Author

brettp wrote on 41164337-03-04

Thanks for all the research and patches for this! I had looked briefly into a DomDocument approach. Since it's enabled by default I think it's a reasonable requirement. Other thoughts?

As this is an output function speed is a larger concern than if it were an sanitation function. autop() was very regex heavy, using some pretty complicated ones. Have you done any profiling to see if there's a realistic performance difference between the two?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41165179-10-10

Replying to brettp:

As this is an output function speed is a larger concern

Absolutely.

Have you done any profiling to see if there's a realistic performance difference between the two?

I'm considering tapping into our site's autop function, storing all inputs for 24 hours, then converting the content characters to gibberish for privacy's sake. I'll probably do the same on our WordPress site. I think it's the only way to get a realistic set of inputs.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41167134-11-04

Encouragingly, on a [http://code.google.com/p/mrclay/source/browse/trunk/tests/php/MrClay/AutoP/typical-post.in.html "typical" post] test case I made, the AutoP class is a very tiny bit faster and uses less memory than this [http://code.google.com/p/mrclay/source/browse/trunk/php/MrClay/AutoP/WordPress.php wpautop class (ported from WP trunk)]. On less typical material (heavily nested elements, corner test cases) it takes about twice as long, but a) doesn't bork the markup, and b) generally adds only a few thousands of a second.

I think with some optimization work (using XPath to descend more = less node walking, less recursion) and more rigorous spec definition, this could come out a clear winner overall.

@elgg-gitbot
Copy link
Author

cash wrote on 41497448-08-02

This has been sitting for 4 months. Are holding back mainly because of performance concerns?

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41498284-11-30

Replying to cash:

holding back mainly because of performance concerns?

Yeah, I'm sorry. I think the last realization I had was that I'd have to parse the entire fragment twice to get better compat, and I'm a bit concerned that DOMDocument is an "unproven" solution for this kind of thing.

But generally my frustration with it was lack of decent spec of which malformations frequently caused by wpautop() should be considered "features", and basically what the hell the replacement should exactly do.

@elgg-gitbot
Copy link
Author

cash wrote on 41499002-11-21

Ok, we need to come up with a set of requirements, test cases (which we already have a good start on), and performance testing. Obviously, saying it should work like wpautop is not a good requirement.

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.9 by cash on 41499002-11-21

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 41499361-01-24

I've got a PHPUnit [http://code.google.com/p/mrclay/source/browse/trunk/tests/php/MrClay/AutoPTest.php test case], that runs through [http://code.google.com/p/mrclay/source/browse/trunk/tests/php/MrClay/AutoP/ these tests](*.in.html vs *.exp.html).

Also there's a script to test arbitrary fragments directly, and scripts to compare speed and mem-usage w/ wpautop along the test cases.

Right now you'd have to check out my svn trunk (and have PHPUnit setup) to run these, but I can certainly move them to github. Where should this work live?

@elgg-gitbot
Copy link
Author

brettp wrote on 41502129-09-17

Github is our preferred place now because it's so easy for co-development and code review, so if you're willing to migrate to there that'd help a lot!

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.9.0 by cash on 41784041-08-31

@elgg-gitbot
Copy link
Author

trac user boroda wrote on 42224879-11-01

I tested this class [http://code.google.com/p/mrclay/source/browse/trunk/php/MrClay/AutoP.php] at my local environment and works good for me and placed it to path "/engine/classes/Autop.php". Can you verified this class that will substitute function autop()?

@elgg-gitbot
Copy link
Author

trac user boroda wrote on 42224882-01-20

Patch attached.

@elgg-gitbot
Copy link
Author

Attachment added by trac user boroda on 42224883-08-21: output.php.patch

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42225080-12-18

boroda the problem is it's not functionally equivalent to autop(); it doesn't write invalid markup in the same way, very likely doesn't handle invalid markup in the same way, and undoubtedly uses more resources. I wrote this [http://www.mrclay.org/wpautop-functional-specification/ spec for what I think autop should do], but I'm not sure where to go. autop is a very crude hammer that breaks input in just the right way, but no one knows exactly what it does in every situation.

@elgg-gitbot
Copy link
Author

trac user boroda wrote on 42241027-08-06

Replying to mrclay:

boroda the problem is it's not functionally equivalent to autop(); it doesn't write invalid markup in the same way, very likely doesn't handle invalid markup in the same way, and undoubtedly uses more resources. I wrote this [http://www.mrclay.org/wpautop-functional-specification/ spec for what I think autop should do], but I'm not sure where to go. autop is a very crude hammer that breaks input in just the right way, but no one knows exactly what it does in every situation.

But I don't have a choice. Because I need substitute autop() to MIT license function with the same equivalent and when I tested your solution it works for me. For now it's the best solution.

@elgg-gitbot
Copy link
Author

brettp wrote on 42277496-06-25

boroda - The only MIT release is 1.7.4. You will not have an MIT release by replacing the autop() function in GPL releases. The releases need to come from elgg.org to have the MIT license.

@elgg-gitbot
Copy link
Author

trac user boroda wrote on 42277542-07-12

Replying to brettp:

boroda - The only MIT release is 1.7.4. You will not have an MIT release by replacing the autop() function in GPL releases. The releases need to come from elgg.org to have the MIT license.

As I understand next MIT release will be 1.9.0. Am I right?

@elgg-gitbot
Copy link
Author

brettp wrote on 42277594-09-08

Nothing is set in stone for the next MIT release. It all depends on how much time we have. Patches like what you're doing will help get it done faster, though.

@elgg-gitbot
Copy link
Author

trac user boroda wrote on 42277608-09-10

Replying to brettp:

Nothing is set in stone for the next MIT release. It all depends on how much time we have. Patches like what you're doing will help get it done faster, though.
I understand, thanks for your answer.

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42600452-02-07

I'll move forward and make a 1.9 PR for my solution and work the [http://code.google.com/p/mrclay/source/browse/trunk/tests/php/MrClay/AutoP/ tests I already have] into units. It works, and with more valid results than wpautop, I'm just nervous about the perf cost of DomDocument for all IO.

@elgg-gitbot
Copy link
Author

trac user srokap wrote on 42845626-08-25

I've put Steve's solution into pull request #420

@elgg-gitbot
Copy link
Author

trac user mrclay wrote on 42846334-08-26

Can't wait to test this. Thanks

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.x by trac user mrclay on 42846334-08-26

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.10 by brettp on 42929035-07-29

@elgg-gitbot
Copy link
Author

Milestone changed to Elgg 1.8.12 by brettp on 42934480-07-19

@elgg-gitbot
Copy link
Author

trac user Brett Profitt <brett.profitt@... wrote on 42948155-11-20

In [changeset:"43a395ae735777bfb5474c4f6a37dc1cd0818a37/elgg"]:

#!CommitTicketReference repository="elgg" revision="43a395ae735777bfb5474c4f6a37dc1cd0818a37"
Fixes #1479. Added ElggAutoP. Removing [\n\r] from test strings before compare to deal with differing whitespace between tags among PHP versions.

@elgg-gitbot
Copy link
Author

trac user srokap wrote on 42948156-03-19

In [changeset:"3bf72994688ad9292bf37444d80ab5ab1a002748/elgg"]:

#!CommitTicketReference repository="elgg" revision="3bf72994688ad9292bf37444d80ab5ab1a002748"
Fixes #1479 - Replaces WP autop with implementation from Steve Clay.

brettp added a commit to brettp/Elgg that referenced this issue Feb 21, 2013
…efore compare to deal with differing whitespace between tags among PHP versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant