CLI¶
Usage¶
waybacktweets¶
Retrieves archived tweets CDX data from the Wayback Machine, performs necessary parsing, and saves the data.
USERNAME: The Twitter username without @.
waybacktweets [OPTIONS] USERNAME
Options
- -c, --collapse <collapse>¶
Collapse results based on a field, or a substring of a field. XX in the timestamp value ranges from 1 to 14, comparing the first XX digits of the timestamp field. It is recommended to use from 4 onwards, to compare at least by years.
- Options:
urlkey | digest | timestamp:XX
- -f, --from <DATE>¶
Filtering by date range from this date. Format: YYYYmmdd
- -t, --to <DATE>¶
Filtering by date range up to this date. Format: YYYYmmdd
- -l, --limit <INTEGER>¶
Query result limits.
- -o, --offset <INTEGER>¶
Allows for a simple way to scroll through the results.
- -mt, --matchtype <matchtype>¶
Results matching a certain prefix, a certain host or all subdomains.
- Options:
exact | prefix | host | domain
- -v, --verbose¶
Shows the log.
Arguments
- USERNAME¶
Required argument
Collapsing¶
The Wayback Tweets command line tool recommends the use of three types of “collapse”: urlkey
, digest
, and timestamp
field.
urlkey
: (str) A canonical transformation of the URL you supplied, for example,org,eserver,tc)/
. Such keys are useful for indexing.digest
: (str) TheSHA1
hash digest of the content, excluding the headers. It’s usually a base-32-encoded string.timestamp
: (datetime) A 14 digit date-time representation in theYYYYMMDDhhmmss
format. We recommendYYYYMMDD
.
However, it is possible to use it with other options. Read below text extracted from the official Wayback CDX Server API (Beta) documentation.
Note
A new form of filtering is the option to “collapse” results based on a field, or a substring of a field. Collapsing is done on adjacent CDX lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are “too dense” or when looking for unique captures.
To use collapsing, add one or more collapse=field
or collapse=field:N
where N
is the first N
characters of field to test.
Ex: Only show at most 1 capture per hour (compare the first 10 digits of the
timestamp
field). Given 2 captures20130226010000
and20130226010800
, since first 10 digits2013022601
match, the 2nd capture will be filtered out:http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10
The calendar page at web.archive.org uses this filter by default: http://web.archive.org/web/*/archive.org
Ex: Only show unique captures by
digest
(note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected):Ex: Only show unique urls in a prefix query (filtering out captures except first capture of a given url). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
URL Match Scope¶
The CDX Server can return results matching a certain prefix, a certain host or all subdomains by using the matchType
param.
The package waybacktweets
uses the pathname /status
followed by the wildcard ‘*’ at the end of the URL to retrieve only tweets. However, if a value is provided for this parameter, the search will be made from the URL twitter.com/<USERNAME>.
Read below text extracted from the official Wayback CDX Server API (Beta) documentation.
Note
For example, if given the url: archive.org/about/ and:
matchType=exact
(default if omitted) will return results matching exactly archive.org/about/matchType=prefix
will return results for all results under the path archive.org/about/matchType=host
will return results from host archive.orgmatchType=domain
will return results from host archive.org and all subhosts *.archive.org
The matchType may also be set implicitly by using wildcard ‘*’ at end or beginning of the url:
If url is ends in ‘/*’, eg url=archive.org/* the query is equivalent to url=archive.org/&matchType=prefix
If url starts with ‘*.’, eg url=*.archive.org/ the query is equivalent to url=archive.org/&matchType=domain
(Note: The domain mode is only available if the CDX is in SURT-order format.)