WikiDiff

I have put up a service that shows the changes made to the wiki in the last 24 hours. It runs every day at 10:15 am CET. You can look at it at http://pascal.scheffers.net/wikidiff/ but I'd like to put it right here in the wiki if that is possible. For that, the wikit would need some extra formatting rules to make use of the different colours.

Source of the wikidiff software is at http://pascal.scheffers.net/wikidiff/wikidiff.tcl.txt , and you'll also need: [L1 ], [L2 ], [L3 ].

Right now it just parses the cvs-diffs of the day before and dumps those into a page (generating to stdout html for the differences) (which uses a slightly added to wiki.css, so it looks the same as the wiki!). Comments please.

-- PS

Brian Theado - Very nice! I especially like the all diffs in one page approach. I encountered changes that interest me that never would have drawn me in if I had just seen the page title on the recent changes page.

I wrote some functionality to display diffs for an old version of wikit. The date at the bottom of each page is a hyperlink that when clicked shows the most recent change of the page. This functionality can be seen at http://tkoutline.sourceforge.net . Now, this old version of wikit stores changes within the wikit database. The newest versions of wikit stores the various versions of a page in an env(WIKIT_HIST) directory.

Based on my desire to upgrade my tkoutline wikit to the latest version without losing the diff functionality and on my desire to see similar functionality here on the Tcl'ers wiki, I have started some code (see below). It makes use of KBK's code from diff in tcl. All that's left is to figure out how to specify via the URL what diffs to display (The code I have in the tkoutline wiki uses the "^" symbol appended to the page URL to display the most recent change. I'd like a way to specify more than just the most recent change).

More comments at bottom of page...

 package require Diff ;# i.e. the diff in tcl code from https://wiki.tcl-lang.org/3108
 catch {namespace import list::longestCommonSubsequence::compare}

 # Helper function for the diff callbacks
 proc appendDiff {mode value} {
    variable diff
    set lastMode [lindex $diff end-1]
    if {$lastMode == $mode} {
        set oldValue [lindex $diff end]
        set diff [lreplace $diff end end $oldValue\n$value]
    } else {
        lappend diff $mode $value
    }
 }

 # The following three functions are callbacks for the diff function
 proc removed { index value } {
     variable diff
     appendDiff removed $value
 }
 proc added { index value } {
     variable diff
     appendDiff added $value
 }
 proc matched { index1 index2 value } {
    variable diff
    appendDiff matched $value
 }

 # Returns the contents of the given file
 proc getFile {fileName} {
    set fd [open $fileName]
    set contents [read $fd]
    close $fd
    return $contents
 }

 # Converts a version as specified below into a list index
 proc getVersionIndex {version} {
    if {[string index $version 0] == "-"} {
        return end$version
    } else {
        return $version
    }
 }
 # Versions can be specified as an absolute positive version number
 # starting at zero and counting up.  A version relative to the most
 # recent can be specified with a negative number.
 # i.e. To see the most recent change: getWikiPageDiff $id -1 -0
 # This function returns a list in the format of chunktype text pairs
 # where chunktype is one of matched, added, removed
 #
 # TODO: It would be nice to be able to express "give me the difference
 # between the page as it is now and how it was 24 hours before the
 # most recent change"
 proc getWikiPageDiff {id version1 {version2 -0}} {
    variable diff
    set ewh $::env(WIKIT_HIST)
    set versIdx1 [getVersionIndex $version1]
    set versIdx2 [getVersionIndex $version2]
    set versions [lsort [glob $ewh/$id*]]
    set list1 [split [getFile [lindex $versions $versIdx1]] \n]
    set list2 [split [getFile [lindex $versions $versIdx2]] \n]
    set diff {}
    compare $list1 $list2 matched removed added
    return $diff
 }
 package require cgi
 proc displayHtmlDiff {id version1 {version2 -0}} {
    # Special colors for the various diff pieces
    array set options {
        added bgcolor=\"#ffffaf\"
        removed bgcolor=\"#cfffcf\"
        matched {}
    }

    # Legend
    cgi_table width=200 size=-1 {
        cgi_table_row $options(removed) {
            cgi_td [cgi_font size=-1 "Removed"]
        }
        cgi_table_row $options(added) {
            cgi_td [cgi_font size=-1 "Added"]
        }
    }
    hr noshade

    # Display the entire page with the differences embedded within
    cgi_table width="600" {
        foreach {mode value} [getWikiPageDiff $id $version1 $version2] {
            cgi_table_row $options($mode) { 
                cgi_td [lindex [Wikit::Expand_HTML $value] 0]
            }
        }
    }
    return
 }

22nov01 jcw - Agree with Brian - great to see these things happen now. From a brief email exchange with Pascal some thoughts (no more than that, really):

  • Yes, all-in-one-page really makes it easy to skim for what's important.
  • Idea: omit all diffs over say 25 lines, that keeps the page nicely limited, it may even entice people to stick to small(er) more concise comments.
  • How far back should the summary diff go? I'd think that a summary, listing diffs with what was on the page 3 days ago, makes it easy to track things and bridge the weekend. Only one diff per page number, summarizing multiple changes all in one, might work IMO.

That sort of raises the issue how much diffing is needed in all. While access to all diffs on each page is technically feasible, I'd be inclined to think it would confuse/overwhelm/distract more than offering just a bit of diffs. Last day, week, month - perhaps? Three links per page?

The history is now a separate subsystem on mini.net - the wiki stores the latest page version only, while all changes get archived and simmer down into the CVS historical archive once a day. It's sort of a collect-and-sweep daily cron job. What seems to work well is that latest and daily-snapshot versions are both efficiently available (static page accesses in fact - though "current" is HTML, whereas CVS daily-snapshot files are in raw wiki input format).

Whatever diff mechanism we come up with ought to keep those two sides of this wiki as lossely coupled as possible IMO.

Let me also add that there is the start of a remote sync/update mechanism. If you get a copy of the wiki and run it locally, there is the option of updating it from the CVS daily snapshot, and that mechanism is quite efficient, so it ought to scale once fully ready. That means one can have a local copy of the Tclers' Wiki, and easily track it, while using it as a local Tk app (with much snappier search capability than the web can offer). To try it, get a copy of wikit.tkd (from the usual wikit.gz url), and do:

   tclkitsh wikit.kit wikit.tkd -update http://mini.net/tclhist

I'm mentioning this here (it was also mentioned on the tclerswiki mailing list), to emphasize that we need to plan so things remain open-ended when diffs get brought into the picture. This update mechanism, for example, *only* uses the tclhist/ area.


LV After playing with this, it appears to be a download only mode - that is, changes to one's local wikit are not synchronized back to the wikit. Is that in fact the intent?


23nov02 ps - Well, after a night processing ideas, there are two things I certainly want to do. The first is not use the diff output of tclhist, but grab the current and the previous version(s!) and run... the next bit I thought of. Namely, a lot of changes on the wiki are typo changes. Those should be displayed inside the line, not the way it is now - remove entire line, add entire line. That would combine nicely with the diff in tcl code. I am going to try that today.

That would also make it less bound to tclhist, because I'll go grab all pages through [getPage pageId revisionId]. Simply reimplement that function, and [getPageVersionList] which should return a list like tclhist/index: {{pageId revisionId lastChange} {...}}. I walk over previous version until I find the one that was online three days ago (or a similar period). I could also base the backward in time traversing on how many lines of diff are produced, and then stopping at either 20 lines, three (or more?) days, etc. I agree that three days is a good timespan, otherwise the page would only be interesting for the super-regular wiki users.

Talking about scaling, seeing how [cvs commit oneSpecificFile] is very much a light weight call on a local repository it may be a good idea to call [cvs commit] for each page alteration (possibly spawning a separate process to do that). That would make it possible to implement a 'roll back' function. But more importantly, today anyway, give the possibility to do event nicer change reporting, especially on the busy pages.

As for the 'over 25 lines', I have already added that to my local version. By the looks of it, we'll probably want to keep that.

A diff history per page could be a good thing, although I agree that it may be a bad idea to provide yet more links at the bottom of each page. I find the tkoutline 'click on the last update' counter intuitive, I expected that would bring me to the recent changes page... I would prefer a [page history] link.

I'd also like to propose keeping a historical archive for these pages, or maybe even a special version that lists diffs per day (or week?) and provide those for the community. These can, quite trivially, be generated starting from the first day that tclhist was started. I think it'd be nice to see how the wiki actually evolved. As in http://www.equi4.com/docs/vancouver/page-edits.png ? -jcw

Pascal, have a look at [L4 ] as an example of a version summary that can help you pick the right CVS version without iterated fetching - the second item is the unix-seconds modtime... jcw

23nov02 jcw - As to immediate CVS updates - yes, but there is a reason for the daily approach: the update mechanism I mentioned. If we update instantly, then people will start to update many times a day, which may bring down the server (or hit my - large, but finite - bandwidth allowance). If we keep updates as a daily thing, it'll be useless to hit update all the time. It may sound silly - but I really believe in limiting focus with each tool we have. The wiki is a repository, not a discussion forum. As I'm proving by typing this - it does work as such, but I still think we should see the wiki as a knowledge base, the chat for intense discussion, email for other exchanges, comp.lang.tcl for one-to-many posts, etc. The wiki is at about 750 hits/day on page 4 right now. Given that it is a public resource with no obstacles to use, what happens if it hits 10x or 100x that activity? I'm sort of trying to outrun that rat race, by trying to have a local mode copy which works well as resource before things get out of hand. One reason for creating the chat, was to maintain more focus on the wiki as repository - maybe we need more such subsystems...

Another way out is to work towards a set of mirrors. All I want to point out is that we need to be cautious with making the wiki work phenomenally well for everyone and every purpose - it could kill the whole thing overnight. So... yes, all for more frequent CVS, immediate one day if possible, but perhaps we can evolve slowly towards it?

I admit that having CVS lag a bit makes diffs less than perfect. Then again, on a page which I follow closely for a while, I tend to pick out changes easily. It's the dozens of pages which I only read occasionaly (yet do have an interest in) where your diff summary really is a great step forward. Well, my .02 ...

23Nov02 Brian Theado - From my perspective, there are two different issues. One is providing diff functionality for the Tcler's Wiki and the other is providing diff functionality for people who are using wikit for their own wiki. The first issue faces challenges like high-traffic, large number of changes, large number of pages, desire to be a repository and not a discussion forum. The second issues has challenges of making the functionality easy to distribute and install. The rate of change and amount of traffic are likely small. Maybe the solution isn't the same for both. There probably should be synergies between them, though. Is the tclhist code available somewhere?

jcw - Sure... here it is:'

 #! /usr/bin/env tclkit
 cd /sites/mini.net/tclhist
 set dir /data/whist
 set old /data/warch
 foreach x [lsort -dictionary [glob -nocomplain $dir/*]] {
   set y [file tail $x]
   lassign [split $y -] id date who
   puts -nonewline "  $y\t"
   file copy -force $x $id
   file mtime $id $date
   if {![file exists /cvs/twhist/$id,v]} { catch { exec cvs add $id } }
   exec cvs ci -m $y $id
   file rename $x $old
   file delete index
   puts OK
 }

Brian Theado - I was actually referring to the cgi code that drives http://mini.net/tclhist . Whoops, bit too long for here - emailed... jcw

I admit that having CVS lag a bit makes diffs less than perfect. - One thing that would make it easier to live with IMO, would be to provide a link to the diff summary on the Recent Changes page and place the link relative to where the diffs pertain. In other words, make it plain to the user that the changes above the diff summary link are not included within the diff summary, but the changes below are.

jcw - Great idea!


PS 24nov02 - Well, I've been hacking at the diffing engine. Sadly, the diff in tcl is just too slow. I wanted to get word level diffing, and after some serious hacking (like 10 hours or so) it works. Mostly... Go to my page [L5 ]. There are some issues with adding and removing white space that aren't displayed correctly, I can't find the problem. I still do only one day of changes, not three as suggested. I think the page would just become too long. Let's just keep an archive and point people to that for previous changes.

My new implementation uses the external unix diff(1) command (see code [L6 ]) to seek out the changes. For that to work on a word level, I take the wiki page and put each word on a single line. Diff then tells me which words have changed. And I highlight those. That is somewhat easier said than done, though :)

There's probably bugs in that code that will hide some types of changes.

What's next?

Brian Theado - I think the next step should be to make changes to the code for wikit's Recent Changes page. If the CVS sweeping job is executed arouce 0:00 GMT and the diff summary job is launched immediately after, then a link to the diff summary can be place on the Recent Changes page next to each date. With the day lag that CVS has every date on the page except for today's date would have a link.

Another idea, which may add too much clutter to the Recent Changes page is to add anchors to the diff summary page. Then for each page in the Recent Changes have a link that will jump directly to that page's diff.

25nov02 jcw - Pascal's page is shaping up nicely! The least I can do is put the job and page as is on this site - and run right after the CVS job (currently 3am Central, 9am GMT). Probably best done only when things are stable, otherwise we'd just keep each other busy with updates.

The recent changes page tagging ideas are easy if it can be done with what wiki markup handles today. More sophisticated tricks (#tag's) would be a bit more involved.

25nov02 ps - All the code should be reachable from my url, especially don't forget to get the wikit.css file, which has the required span.* and pre.* entries. However, those colours are only possible if you can trigger <span> tags from the wiki markup (or with plain, untreated html, obviously). One way to do this could be something like this:

 Some changed text ''''old words'''':old ''''and something new'''':new

The four quotes trigger a <span> tag, and the :new indicates the 'class' for the <span class="new">. This will probably not break any existing formatting, a search for four consecutive quotes will probably only find this page. The only thing that leaves out is preformatted blocks, where wiki formatting does not take effect. I don't know how hard it would be to exempt this formatting rule from the 'wiki doesn't touch the preformatted section' rule (if desirable).

Another thing to note, though, it that because <span> can be nested and can change the font-face to fixed width, we don't _need_ to trigger the preformatted code rule - except it is easier to do it that way.

I don't know which is best, I didn't write the wiki...

jcw - Why hyper-generalize? Why not generate the page as a normal static one, just as you do?

ps - Well, it would be extremely nice to also have the ability to show a page in full, formatted by the wiki with colourised differences between two specific versions. But, come to think of it, that can also be done if the wiki first generates the html output for both versions and hands that do the diffing engine. But apart from that, I guess that just generating a static page is best and simplest.


LV Would you consider in the cases where the page diff is too large to display on the main page, creating a hyperlink to a page containing the differences marked up? That way someone can still see the differences in the nice markup if they wish?

Ooo ooo - I know - how about the wikidiff code available as a stand alone application that someone could invoke, with the URL of a wikit page, and regardless of length, the resulting diff generated?

Another change to consider might be if the changed text exceeds some number of characters (I'm looking at the huge diff with the fractal mountain page right now...) that it too turn into a hyperlink of its own page. Wow - I just realized - wikidiff just saved us! That large output was trying to tell me that the page was truncated at change 4! It looks like someone's web browser truncates long pages...

PS There are a lot of interesting tricks possible, and most are fairly trivial to do - including showing a fully formatted/normally wiki page with differences between arbitrary versions. That will especially be useful on the really large changes, as those will never look good on the wikidiff page, only on the annotated full page.

And, as a matter of fact, wikidiff is a stand alone application right now. It uses /tclhist/ to get its information. I will probably make myself a playground on my own server where you can request the diffs between specific page versions and what not.

And I'll add that number of characters limit too. That is a bit more sane than only lines.

26Nov02 Brian Theado - I notice the wikidiff code currently only displays the last modification for each page. Since it is a daily summary, I think it should show all the changes in the last day for each page. I think it is frequently the case that a page sees multiple edits in a day.

escargo - I have noticed the same thing, and express the same desire.

ps - Yes, I know. Already working on that.

escargo - Looking at the latest differences, I noticed that where there are multiple changes, they are highlighted, but at a guess only the originator of the last change has his or her IP address attached to the name of the page. It will be interesting to see what the best way to align changes with the originator where there are multiple changes.

ps - Yes, I also know that. Developing in public soo much feels like having people watch over your shoulders ;) I guess I should either list all updating IPs or none at all.

escargo - Developing code in a fishbowl can be interesting. Let me add my voice to LV for a way to get to diffs that are deemed too large to show directly? There have been a couple of instances where I have wanted to know in more detail than what the summary, valuable as it is, currently shows. 8 Dec 2003 - I ran into that same issue again. Even if differences of more than XXX lines are not displayed in the summary page, it would still be very nice to see them in an optional extended diff page. Sometimes the very fact that so many lines have changed indicates that something significant is different. It becomes more pressing to see what really changed.


RS 2003-01-10 - Bug note: the following clipping contains rests of gt; entities that seem to have been incompletely substituted:

 Most of the seven hardware keys are intercepted by CE, so not usable in Tk
 ... bindings - except for the big center navigation key (over the speaker) which
 ... produces gt; and then two key
 ... events on each push: gt; when pushed centrally,
 ... and gt; centrally; gt;/gt;/gt;/gt;
 ... in any of the cursor directions. directions,
 ... plus another nondeterministic event from gt;,gt;, and some
 ... accented Eurolatin letters. I use the command 

PS Fixed. [regsub -all {>} $text {&gt;} text] should have been: [regsub -all {>} $text {\&gt;} text]


Lars H, 2006-06-08: In the Synchronizing System Time page, the <s for standard> seems to make it verbatim through WikiDiff and thus select a strikethrough font in the HTML. </s> (Let's see if that turns it off.)

Also, it is possible to improve the identification of what has been inserted and what has been removed? In lists such as

  • A
  • C
  • D

if one inserts a "* B" item between A and C, it is common that WikiDiff thinks "B (newline) *" has been inserted instead; in a way it's equivalent, but harder to read.


[ Category Wikit | Category Tcler's Wiki ]