Wikipedia:PHP script tech talk

From Wikipedia, the free encyclopedia

Simply put, isn't this page dead? I can't see what is still in pending and what is fixed. Also, the intro says "do not add bugs and feature requests here", most of stuff are about features and bugs.

Yup, this page is dead. Those curious should direct themselves to the Wikitech-l mailing list or set something up on meta under m:How to become a Wikipedia hacker.


This is the place to discuss bug fixes and planned feature on a more "technical" level. (See also the new wikitech-l mailing list.)

Please, do not add bugs and feature requests here: instead, see Wikipedia:PHP script for more details of how to report bugs.


Serious bugs

Things that should be repaired ASAP.


diff won't work

I can't get diffs. It could be the cache, if it is running here already. Somebody please fix this! --Magnus Manske

Yes, it's a cache bug. I checked a fix into CVS last night, but forgot to mention it here. (2002/2/8) --Brion VIBBER

#REDIRECTs that end in an eternal edit conflict

I'm probably stating the obvious here, but this appears to be caused by the page trying to redirect to a different page than has been specified. If this page doesn't exist (as is usually the case) then the page goes into edit mode when you click Save, which gives an edit conflict. If the page does exist, then no edit conflict occurs, but the redirect does not go to the expected place (as for Zundark/Old_Talk, which is redirected to user:Zundark but actually ends up at user:Zundark/Old_Talk). --Zundark, 2002 Feb 3
(2002/02/03 20:43 PST) Fix is in CVS for the problem when the redirected page does exist. (s/$this->$subPageTitle/$this->subPageTitle/) Doesn't seem to have fixed the doesn't-exist problem, I'll look at it some more. --Brion VIBBER



Volunteers wanted

These tasks need volunteers to hack'em!

Mask minor edits on Recent Changes

Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske

Fix the Recent Changes "(# changes)" counter

Might be fixed by a patch from Brion VIBBER and myself --Magnus Manske

  • I just looked in cvs (thats 2002/2/4 00:02 Amsterdam time) and it seems you still add a variable $addoriginal to the count. But I think that is silly because you should never count the current page if you are counting the changes. So just remove $addoriginal and the problem is solved. -- Jan Hidders (PS. Wouldn't it be nice if the sign-shortcut ~ ~ ~ would always be replaced with name and time? :-))
Hey, that's a good idea! Especially for the bug-report pages... Brion VIBBER (2002/2/4 15:18 PST)
Yes, unfortunately it's the only thing that made sense in my remark. :-/ What I should have said was the following. The variable $addoriginal should be 0 if the page did not already exist the previous day and current page is a minor edit and the user does not want to see minor edits. -- Jan Hidders (2001/2/5 8:45 GMT+1)

Fixing some parser bugs

Especially the <pre> tags.
I've replaced removeHTMLtags() with behavior more like the old usemod version; ie instead of forbidding a few tags, it allows only a small number. Thus, no <span>, <object> etc. However it still needs to be able to strip out unknown elements/parameters; I can still write naughty things like this. {this is a 'link' that isn't a link, and runs some JavaScript code} (2002/02/04 20:48 PST) --Brion VIBBER
Also, I commented out the line in subParseContents that makes &amp; followed by text that could be an entity into the entity. I suspect it was put in to fix pages that were getting over-escaped during editing, but that bug seems to be gone now and it just makes it hard to write the name of an entity. Ie, "&amp;" should *not* appear as just an ampersand, but an ampersand followed by "amp;". --BV

Brainstorming

Ideas for solutions needed here.

Speeding up the PHP script

  • Taking apart "specialPages.php"

This file is getting quite large, resulting in high compilation times. I suggest two steps :

  1. make pages like "special_userlogout.php" for each function ("userlogout", in this case)
  2. after that, change the include statement so it only includes the needed function. This, in turn, can include other shared functions

I started doing this now. --Magnus Manske

  • Caching of pages for reading only (Jimbo's idea).
    • Could be tricky. Would have to adapt to viewing preferences and newly created pages (red/blue links).
It may be possible to cache a 'common' almost-final version, which can then have a regexp run over it to set the link color and paragraph justification, and then inserted into the header/footer; this would at least save parsing the wiki page every time. Still need to deal with new pages though... Simplest way might be to run the "which pages link to this" check on a newly created page and expire the cached versions of anything that does. --Brion VIBBER
A lot of the regexp work could be skipped by using a quick preprocessor (ideally one that slaps in the header/footer without even looking at the text) and/or CSS. --Uriyan
Yup. CSS won't handle the difference between red links and [classic links]? for new pages, though. I recommend we change or eliminate one or the other. --Brion VIBBER

I take that back, CSS should do fine there. How does this sound:

   This is a <span class="newlinkedge">[</span><a href="foo" class="newlink">new
   link</a><span class="newlinkedge">]<a href="foo">?</a></span>.

where we define either:

   a.newlink { color: red; }
   .newlinkedge { display: none; }

or

   a.newlink { color: black; text-decoration: none; }
   .newlinkedge { }

in the style sheet? The text portion will still be clickable in the old-style case, though that could probably be "fixed" if desired. --Brion VIBBER

(2002/02/03 15:05 PST) I've changed the CVS version to use style sheets for the link colors, paragraph justification, and text/background color. (Try it at my test server, if you can find your way around the partially Esperanto-localized interface.) Keeps down the number of things that need to be changed if somebody wants to change the styles further, and should make the HTML-ized page guts cacheable.

Caching proposal

  1. Create a new field in the cur table named cur_cache (MEDIUMTEXT), empty by default
  2. When a pages is saved after edit, the cache is cleared
  3. When a page is viewed,
    1. and the cache has been used X times, it is cleared (enforced up-to-date)
    2. and the cache entry contains text, the cache is adapted to current user settings and displayed
    3. and the cache is empty, the text is rendered, displayed and stored in the cache field
  4. When a new page is created, the cache of all pages that link to the new page is cleared
  5. Pages with Lua error in Module:NUMBEROF at line 14: Parameter 1 is missing. See template documentation. are not cached

--Magnus Manske

Why not simply update the cache field every time the page is edited? You have to parse the page then anyway because it is presented after the edit. -- Jan Hidders PS. I know I'm getting annoying but can I say again that we first should measure which pages are eating up the server CPU time? Otherwise the implementation of caching might be a waste of time and effort that unnecessarily complicates the code.
On viewing an uncached page, the contents is rendered anyway. The result can be slightly altered and stored as the cache. Generating the cache upon saving means it will have to be rendered especially for that purpose, thus wasting resources. Also, when the cache is flushed, the page won't be cached again until after the next edit.
That said, we should of course check the special pages and improve their speed. I already (kinda) cached the Most Wanted. The other candidate is Orphans, but I don't know how to cache that; anyway, with de-orphanising progressing, the orphans list will get shorter, and the popularity of that page might drop.
Also, the Main Page has to run a database request each time it is viewed, to keep the an up-to-date article count.
Editing this page, connected with 10MBit to the Internet, wikipedia performs quite well; I know this will change once the US goes online ;) --Magnus Manske
I must be misunderstanding someting. After submitting I am always redirected to the page I just edited, right? So are you now suggesting that you would allow that I would then see an older cached version without the changes I just made? Isn't that a bit confusing for the writer? -- Jan Hidders
I suspect the sane behavior would be to clear the cache field when a page is saved (and the cache fields of any pages that link to it, if it's a new page). Then, when the page is loaded up again (for the edited page, that would be immediately), the empty cache is noticed and the page is rerendered and stored.
Yes, that's what I meant. Sorry for being unclear. I added a line to the proposal. --Magnus Manske
But why then do you need the rule for "enforce up-to-date"? -- Jan Hidders
Just a safety mechanism to ensure every page is updated once in a while. Might not be necessary, but it won't do much harm if set to a high value. --Magnus Manske
As far as the mainpage... how many pages actually use those {{blahblah}} things? Should we not cache pages that contain them? -- Brion VIBBER
No other pages have these, AFAIK. We could count the number of "{{" occurrences before and after variable replacement, and if it is unchanged, the page can be cached, otherwise not. --Magnus Manske

What eats server time

I suspect the RecentChanges viewings eat tremendous resources and are the bottleneck. The web server and the PHP script are usually snappy, because previewing, which doesn't need db access, is fast. Everybody working on the site calls RecentChanges every couple of minutes. The server sped up noticeably when Jimbo changed the default from 250 down to 50 pages. It seems that to create RecentChanges, we search through the whole cur table and then sort and present the latest n changes, correct? This means as the database grows, it will only get slower. How about this suggestion: add a new table recent_edits, which only stores information about the recent edits (page title, user name, comment, timestamp), so that we only need to dump out the first n entries from that table for every RecentChanges view (hopefully without need to sort them)? Maybe recent_edits doesn't even have to be a mysql table, just a list somehow that we prune down every once in a while. Or maybe even keep a ready made RecentChanges HTML page always up to date and serve it statically. 2/7/2002 AxelBoldt

I think this is a good idea; after all, the list of recent changes doesn't exactly change completely every time it's loaded; new things pop up on the top, and old things drop out of the range of interest on the bottom or, occasionally, in the middle. I'll try implementing this tonight... Brion VIBBER
On second thought, is this really necessary? Can't we index the table by cur_timestamp as well as by title/id? The database could then easily ignore all entries that had not been modified up to a certain point. I'm not really a MySQL guru though, I don't know how to implement this (or if there's some good reason why it can't be done). Cacheing the default display of RecentChanges is easy enough though, and ought to save a few cycles -- there are probably more views than there are changes. Brion VIBBER
Implemented caching for RecentChanges default settings. On my small test database (457 pages) I see a roughly 100% speedup on loading special:Recentchanges. (2002-02-08 00:33) Brion VIBBER

(2002/02/07 22:29 PST) I've added a 5-minute minimum wait between refreshes of WantedPages (see revision 1.2 of special_wantedpages.php and 1.14 of wikiTextEn.php). Good idea? Bad idea? Shouldn't affect legitimate users, but makes it slightly more difficult for malicious or accidental otver-refreshing to overwhelm the server. --Brion VIBBER


How about decentralizing the Wiki? If it was relatively easy to get different Wiki servers to talk to each other and do some automatic linking then it should speed things up significantly. I find the Wikipedia a great idea and I am prepared to provide some serverspace (because of my interests preferably the medical bit). --Mathis



  • Database operations & efficiency
    • I notice there are a lot of mysql_connect()/mysql_close() pairs in the code; depending on the page loaded, the database can be opened and closed from 5 to 12 times. This seems excessive to me... mysql_connect() doesn't open a new connection if one is left open, and the connection is automatically closed when the script finishes. Surely the overhead of opening and closing multiple connection is worse than the overhead of having one connection open for the whole fraction of a second that it takes to run the several database operations needed by each page? Taking them out doesn't seem to affect performance significantly on my test machine, but I have a small database and I'm the only one using it. (Indeed, it may even make sense to use persistent connections.) --Brion VIBBER (2002/02/06 02:37 PST)
      • I can't access the CVS from here. Why don't you outcomment the mysql_close() lines, and see what happens when Jimbo actually updates the running version (in a month or so;). If that works out, we could try the persistent connections, if not, we just remove the # again. --Magnus Manske
        • Okay, done. Guess we'll see what happens... mwoooh haa haa haa... Also, turns out I missed a lot of them in my initial count; revise that to "the database can be opened and closed from 10 to 21 times" per page. Brion VIBBER

We definitely should use persistent database connections (mysql_pconnect instead of mysql_connect). There's no point in constantly transmitting the same passwords and usernames. mysql_pconnect is a faster drop-in replacement for mysql_connect. It only speeds things up if php is running as an apache module, and I assume that's the case. 8/2/2002 AxelBoldt


  • Browser-specific page layout
I notice though, that the tables in the page layout are setting their border properties based on whether the user agent is Internet Explorer. This explains why the tables have thin black borders in Internet Explorer and no borders in Mozilla... Magnus, is there any reason for this? I'd prefer to replace the table with some CSS markup in any case. --Brion VIBBER
The reason is that I like the thin black lines in IE, but other browsers don't support that, they draw all lines black, which looks ugly (try it!). If you know how to change it, go right ahead :) --Magnus Manske
Done, checked in. Looks ever so slightly different in IE and Mozilla, but approximately the same as the previous behavior in IE. Also looks okay in Konqueror 2.2.1, ugly but visible in Opera 6 (some beta version I have), but still doesn't show in Netscape 4.x. Brion VIBBER 2002/02/04 15:21 PST
  • Optimize slow code parts (where? why?)
    • Do we actually know what the slow parts are? My gut feeling is that the Recent Changes page is the slowest, but it would be nice if we could do some actual measurements on a server that is serving only one client but has a large database. (Does somebody have a big SQL dump?) Anyway, presuming that it is, I looked at the code and I think it can be made much much more optimal by combining the two SQL queries into one. Right now it computes a JOIN and a GROUP BY in PHP which can be done far more efficiently by the database. However, it should then be possible to do a GROUP BY on the day, which is now hidden in the time stamp. So we would split this column into a day column and a time column. Do I have your permission to attempt this? (But first I would like to know if it is worth it, i.e., if the Recent Changes page plays a major part in the slowing down. I thought Magnus said something about a memory leak in Apache, so perhaps we should try to find that first.) -- Jan Hidders
      • My (old) suggestion for Recent Changes optimization was to make it a separate table containing the last 5000+ changes. (The table would store only the RC-related data, not the actual page contents. It could be simply added to by the edit function, and trimmed in daily/weekly maintenance.) This table would eliminate the need for each RC page to search the *entire* DB looking for the most-recently-updated pages. --Clifford Adams
        • It doesn't, it uses the indexes to do that. That's the whole point of using a database; they are usually very clever at these things. :-) I would like to add the remark that letting the database do the joins for you does often lead to a performance improvent of orders of magnitude. If I'm right this could be a major boost. -- Jan Hidders (2001/2/5 8:48 GMT+1)
          • But right now, as far as I can see, we don't index on cur_timestamp and we actually do search through the whole database for every RecentChanges request. That needs to be fixed asap. I agree that the RecentChanges code could be written a lot more elegantly using SQL joins. 2/8/2002 AxelBoldt
  • Eliminate the access count function ("This page has been accessed 6 times"). I have a feeling that the huge number of writes to the database may be killing performance. A much less way to do this would be to process the Apache log files daily or hourly with a separate script. --Clifford Adams
    • i think it'd be cool to make a new statistics module for logging all accesses in the db, this can be done as a mod to apache i belive, instead of sending hits to the log, you send em to a db, and using that for the hit counter, to keep db access sane using INSERT DELAYED would be a good compromise between realtime stats and efficiency, if you added database replication itd be one cool, scalable beast

Resolved issues

These could probably be deleted or moved to a separate "what were we thinking? / what did we do when this broke before?" page.

Pages with "wiki.phtml" subpages

  • Anone has an idea why this is? I couldn't reproduce it with my local copy. --Magnus Manske
I haven't seen any of these lately. I *think* it was fixed by using an absolute path for $THESCRIPT. --Brion VIBBER

130.94.122.xxx bug

  • This is serious, because of its potential for masking vandalism.
Possible fix submitted; I suspect that there's some kind of proxying going on at the server end. But I could be totally wrong. --Brion VIBBER
We'll see, it is in the mail now... --Magnus Manske
Fixed as of Feb 4 02.


Change password does not work

For more details, see Wikipedia:Bug Reports. Until this is fixed any users who change their password cannot log in (like me). --User:Chuck Smith

Strange. It works fine on my local copy. Tried to log in with your old password? --Magnus Manske
Apparently fixed sometime before 2002/2/8. Brion VIBBER