User:Msm/extract.pl

From Wikipedia, the free encyclopedia

info[edit]

Extract.pl is tool for extracting namespaces from wikipedia sql dump.

I modified script from wp:Database_download#Importing_sections_of_a_dump. Now it's generic and more easily usable.

usage[edit]

usage ./extract.pl namespace_nr [-p prefix] < infile > outfile
extracts wikipedia namespace from database dump
        namespace_nr - namespace number
        prefix - database tables prefix, for table names with prefix

examples[edit]

extracting namespace 4

bzip2 -dc <date>_cur_table.sql.bz2 | ./extract.pl 4 > help_4.sql

extracting namespace 12, for use with configuration with prefix mw_ (on 6.2. 2005 for use with Mediawiki betas)

bzip2 -dc <date>_cur_table.sql.bz2 | ./extract.pl 12 -p mw_ > help_4.sql

version history[edit]

  • v0.1 - initial version
  • v0.2 - corrected typo that caused only namespace 4 was extracted (thanks 216.123.160.18)

extract.pl[edit]

#!/usr/bin/perl
# v0.2
# 
# modified script from http://en.wikipedia.org/wiki/Wikipedia:Database_download#Importing_sections_of_a_dump
#
# http://en.wikipedia.org/wiki/User:Msm/extract.pl
#
# history:
# v0.2 - corrected typo that caused only namespace 4 was extracted
# 
$table = 'cur';

if ($ARGV[0] eq '-h') {
	print "usage $0 namespace_nr [-p prefix] < infile > outfile\n";
	print "extracts wikipedia namespace from database dump\n";
	print "\tnamespace_nr - namespace number\n";
	print "\tprefix - database tables prefix, for table names with prefix\n";
	exit;
}


if (not $ARGV[0] =~ /\d+/) {
	print "first parameter must be namespace number, see $0 -h\n";
	exit;
}

$namespace = $ARGV[0];

if ($ARGV[1] eq '-p') {
	$prefix = $ARGV[2];
	if (not $prefix =~ /\w+/) {
		print "bad prefix, see $0 -h\n";
		exit;
	}	
	$table = $prefix . $table;
}


while (<STDIN>) {
	s/^INSERT INTO cur VALUES //gi;
	s/\n// if (($j++ % 2) == 0);
	s/(\'\d+\',\'\d+\'\)),(\(\d+,\d+,)/$1\;\n$2/gs;
	foreach (split /\n/) {
		next unless (/^\(\d+,$namespace,\'/);
		s/^\(\d+,\d+,/INSERT INTO $table \(cur_namespace,cur_title,cur_text,cur_comment,cur_user,
		cur_user_text,cur_timestamp,cur_restrictions,cur_counter,cur_is_redirect,cur_minor_edit,
		cur_is_new,cur_random,cur_touched,inverse_timestamp\) VALUES \($namespace,/;
		s/\n\s+//g;
		s/$/\n/;
		print;
	}
}