API

Request

Requests data from the Wayback Machine API.

class waybacktweets.api.request.WaybackTweets(username, collapse=None, timestamp_from=None, timestamp_to=None, limit=None, offset=None, matchtype=None)

Class responsible for requesting data from the Wayback CDX Server API.

Parameters:
  • username (str) – The username associated with the tweets.

  • collapse (str, optional) – The field to collapse duplicate lines on.

  • timestamp_from (str, optional) – The timestamp to start retrieving tweets from.

  • timestamp_to (str, optional) – The timestamp to stop retrieving tweets at.

  • limit (int, optional) – The maximum number of results to return.

  • offset (int, optional) – The number of lines to skip in the results.

  • matchtype (str, optional) – Results matching a certain prefix, a certain host or all subdomains.

get()

Sends a GET request to the Internet Archive’s CDX API to retrieve archived tweets.

Return type:

Optional[Dict[str, Any]]

Returns:

The response from the CDX API in JSON format, if successful. Otherwise, None.

Parse

Parses the returned data from the Wayback CDX Server API.

class waybacktweets.api.parse.TweetsParser(archived_tweets_response, username, field_options)

This class is responsible for the overall parsing of archived tweets.

Parameters:
  • archived_tweets_response (List[str]) – The response from the archived tweets.

  • username (str) – The username associated with the tweets.

  • field_options (List[str]) – The fields to be included in the parsed data. For more details on each option, visit Field Options.

_add_field(key, value)

Appends a value to a list in the parsed data structure.

Parameters:
  • key (str) – The key in the parsed data structure.

  • value (Any) – The value to be appended.

Return type:

None

_process_response(response)

Processes the archived tweet’s response and adds the relevant CDX data.

Parameters:

response (List[str]) – The response from the archived tweet.

Return type:

None

parse(print_progress=False)

Parses the archived tweets CDX data and structures it.

Parameters:

print_progress (bool) – A boolean indicating whether to print progress or not.

Return type:

Dict[str, List[Any]]

Returns:

The parsed tweets data.

class waybacktweets.api.parse.TwitterEmbed(tweet_url)

This class is responsible for parsing tweets using the Twitter Publish service.

Parameters:

tweet_url (str) – The URL of the tweet to be parsed.

embed()

Parses the archived tweets when they are still available.

This function goes through each archived tweet and checks if it is still available. If the tweet is available, it extracts the necessary information and adds it to the respective lists. The function returns a tuple of three lists:

  • The first list contains the tweet texts.

  • The second list contains boolean values indicating whether each tweet is still available.

  • The third list contains the URLs of the tweets.

Return type:

Optional[Tuple[List[str], List[bool], List[str]]]

Returns:

A tuple of three lists containing the tweet texts, availability statuses, and URLs, respectively. If no tweets are available, returns None.

class waybacktweets.api.parse.JsonParser(archived_tweet_url)

This class is responsible for parsing tweets when the mimetype is application/json.

Note: This class is in an experimental phase.

Parameters:

archived_tweet_url (str) – The URL of the archived tweet to be parsed.

parse()

Parses the archived tweets in JSON format.

Return type:

str

Returns:

The parsed tweet text.

Export

Exports the parsed archived tweets.

class waybacktweets.api.export.TweetsExporter(data, username, field_options)

Class responsible for exporting parsed archived tweets.

Parameters:
  • data (Dict[str, List[Any]]) – The parsed archived tweets data.

  • username (str) – The username associated with the tweets.

  • field_options (List[str]) – The fields to be included in the exported data. For more details on each option, visit Field Options.

_create_dataframe()

Creates a DataFrame from the transposed data.

Return type:

DataFrame

Returns:

The DataFrame representation of the data.

static _datetime_now()

Returns the current datetime, formatted as a string.

Return type:

str

Returns:

The current datetime.

static _transpose_matrix(data, fill_value=None)

Transposes a matrix, filling in missing values with a specified fill value if needed.

Parameters:
  • data (Dict[str, List[Any]]) – The matrix to be transposed.

  • fill_value (Optional[Any]) – The value to fill in missing values with.

Return type:

List[List[Any]]

Returns:

The transposed matrix.

save_to_csv()

Saves the DataFrame to a CSV file.

Return type:

None

save_to_html()

Saves the DataFrame to an HTML file.

Return type:

None

save_to_json()

Saves the DataFrame to a JSON file.

Return type:

None

Visualize

Generates an HTML file to visualize the parsed data.

class waybacktweets.api.visualize.HTMLTweetsVisualizer(username, json_path, html_file_path=None)

Class responsible for generating an HTML file to visualize the parsed data.

Parameters:
  • username (str) – The username associated with the tweets.

  • json_path (Union[str, List[str]]) – The path of the JSON file or the JSON data itself.

  • html_file_path (str, optional) – The path where the HTML file will be saved.

static _json_loader(json_path)

Reads and loads JSON data from a specified file path or JSON string.

Parameters:

json_path (Union[str, List[str]]) – The path of the JSON file or the JSON data itself.

Return type:

List[Dict[str, Any]]

Returns:

The content of the JSON file or data.

generate()

Generates an HTML string that represents the parsed data.

Return type:

str

Returns:

The generated HTML string.

save(html_content)

Saves the generated HTML string to a file.

Parameters:

html_content (str) – The HTML string to be saved.

Return type:

None

Utils

Utility functions for handling HTTP requests and manipulating URLs.

waybacktweets.utils.utils.check_double_status(wayback_machine_url, original_tweet_url)

Checks if a Wayback Machine URL contains two occurrences of “/status/” and if the original tweet does not contain “twitter.com”.

Parameters:
  • wayback_machine_url (str) – The Wayback Machine URL to check.

  • original_tweet_url (str) – The original tweet URL to check.

Return type:

bool

Returns:

True if the conditions are met, False otherwise.

waybacktweets.utils.utils.check_pattern_tweet(tweet_url)

Extracts the URL from a tweet URL with patterns such as:

  • Reply: /status//

  • Link: /status///

  • Twimg: /status/https://pbs

Parameters:

tweet_url (str) – The tweet URL to extract the URL from.

Return type:

str

Returns:

Only the extracted URL from a tweet.

waybacktweets.utils.utils.check_url_scheme(url)

Corrects the URL scheme if it contains more than two slashes following the scheme.

This function uses a regular expression to find ‘http:’ or ‘https:’ followed by two or more slashes. It then replaces this with the scheme followed by exactly two slashes.

Parameters:

url (str) – The URL to be corrected.

Returns:

The corrected URL.

waybacktweets.utils.utils.clean_tweet_url(tweet_url, username)

Cleans a tweet URL by ensuring it is associated with the correct username.

Parameters:
  • tweet_url (str) – The tweet URL to clean.

  • username (str) – The username to associate with the tweet URL.

Return type:

str

Returns:

The cleaned tweet URL.

waybacktweets.utils.utils.clean_wayback_machine_url(wayback_machine_url, archived_timestamp, username)

Cleans a Wayback Machine URL by ensuring it is associated with the correct username and timestamp.

Parameters:
  • wayback_machine_url (str) – The Wayback Machine URL to clean.

  • archived_timestamp (str) – The timestamp to associate with the Wayback Machine URL.

  • username (str) – The username to associate with the Wayback Machine URL.

Return type:

str

Returns:

The cleaned Wayback Machine URL.

waybacktweets.utils.utils.delete_tweet_pathnames(tweet_url)

Removes any pathnames from a tweet URL.

Parameters:

tweet_url (str) – The tweet URL to remove pathnames from.

Return type:

str

Returns:

The tweet URL without any pathnames.

waybacktweets.utils.utils.get_response(url, params=None)

Sends a GET request to the specified URL and returns the response.

Parameters:
  • url (str) – The URL to send the GET request to.

  • params (dict, optional) – The parameters to include in the GET request.

Return type:

Tuple[Optional[Response], Optional[str], Optional[str]]

Returns:

The response from the server.

Raises:
  • ReadTimeoutError – If a read timeout occurs.

  • ConnectionError – If a connection error occurs.

  • HTTPError – If an HTTP error occurs.

  • EmptyResponseError – If the response is empty.

waybacktweets.utils.utils.is_tweet_url(twitter_url)

Checks if the provided URL is a Twitter status URL.

This function checks if the provided URL contains “/status/” exactly once, which is a common pattern in Twitter status URLs.

Parameters:

twitter_url (str) – The URL to check.

Return type:

bool

Returns:

True if the URL is a Twitter status URL, False otherwise.

waybacktweets.utils.utils.semicolon_parser(string)

Replaces semicolons in a string with %3B.

Parameters:

string (str) – The string to replace semicolons in.

Return type:

str

Returns:

The string with semicolons replaced by %3B.

waybacktweets.utils.utils.timestamp_parser(timestamp)

Parses a timestamp into a formatted string.

Parameters:

timestamp (str) – The timestamp string to parse.

Returns:

%M:%S”, or None if the timestamp could not be parsed.

Return type:

The parsed timestamp in the format “%Y/%m/%d %H

Config

Configuration module.

Manages global configuration settings throughout the application.

waybacktweets.config.config.config = _Config(verbose=True)

Global configuration instance.

waybacktweets.config.config.verbose

Determines if verbose logging should be enabled.

Type:

bool