Post by hrbrmstr's Daily Drop

127d

Pipet; Need For Speed; Poisoning AI Scrapers

A midweek visit to see #4 means I’m still “recovering” from a total of seven hours of intraday driving. So, y’all get a potpouri of resources, today, since I’m way too tired to knit together a themed Drop.

TL;DR

(This is an AI-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)

I can report that the VS Codium plugin I made worked super well today!

Pipet is a Golang-based command-line tool for web scraping and data extraction, operating in three primary modes: HTML parsing, JSON parsing, and client-side JavaScript evaluation. Pipet
Need For Speed includes two resources: speedtest-rs (a Rust implementation of a speed test tool) and LibreSpeed (a web-based tool for measuring internet connection speeds). speedtest-rs, LibreSpeed
Poisoning AI scrapers involves using techniques like generating garbled content deterministically and serving alternative versions to detected AI scrapers, as demonstrated by Tim McCormack’s project. Poisoning AI scrapers

Pipet

Photo by Pixabay on Pexels.com

Pipet is a net and well-thought-out Golang-based command-line tool designed for web scraping and data extraction. It operates in three primary modes: HTML parsing, JSON parsing, and client-side JavaScript evaluation. Rather than do the webops on its own, Pipet leverages existing utilities like curl and integrates seamlessly with Unix pipes, letting us extend its built-in capabilities in ways we’re all pretty much used to.

Pipet can be used for various data extraction tasks, including (their words) “tracking shipments, monitoring concert ticket availability, observing stock price changes, and extracting any online information”.

They have examples, but let’s make another one. This Bash script wraps around pipet to get the news headlines from BBC’s nigh plaintext “On This Day” page for the current day:

#!/usr/bin/env bashtemp_file=$(mktemp)month_name=$(date +%B)cat <<EOF >"${temp_file}"curl http://news.bbc.co.uk/onthisday/low/dates/stories/${month_name,,}/$(date +%-d)/default.stma[href^="/onthisday/low/dates/stories"] spanEOFcommand pipet --json "${temp_file}" | command jq -r '.[0][0] | map(select(length > 0)) | .[]'

That Pipet file will:

grab the HTML
target all the /onthisday links
extract the text from the span elements

I feed the JSON output to jq for cleaning. Here’s a demo:

$ ./on-this-day.sh1995: OJ Simpson verdict: 'Not guilty'1975: London's Spaghetti House siege ends1944: Poles surrender after Warsaw uprising1981: IRA Maze hunger strikes at an end1952: Tea rationing to end1979: Anti-racists tackle South African rugby tourists

NOTE: A far more robust version of that script can be had at https://rud.is/dl/on-this-day-robust.sh.

The README is fairly extensive with details on how to use Pipet’s headless mode, and advanced JSON filtering mode (when the curl responses are JSON).

Since I have some boilerplate Go template projects for quickly creating custom scrapers, I’m not sure I’ll be using Pipet much, but it’s definitely a neat tool that others may find useful.

Need For Speed

Photo by Pixabay on Pexels.com

I came across the two resources in this section:

speedtest-rs
LibreSpeed (GH)

after reading this post on “Homelab uplink monitoring“.

speedtest-rs is a Rust implementation of a tool similar to the popular Python speedtest-cli, designed to measure internet connection speeds. The project was originally a learning exercise for the author to explore Rust and its ecosystem. It’s evolved a bit and is designed primarily for lower-end residential connections using “HTTP Legacy Fallback“.

Here’s a run from my pseudo-high bandwidth abode:

$ ./speedtest-rsRetrieving speedtest.net configuration...Retrieving speedtest.net server list...Testing from Comcast Cable (37.141.13.125)...Selecting best server based on latency...Hosted by Optimum Online (White Plains, NY) [272.08 km]: 31.164 msTesting download speed..............Download: 503.23 Mbit/sTesting upload speed..............Upload: 39.62 Mbit/sWARNING: This tool may not be accurate for high bandwidth connections! Consider using a socket-based client alternative.

(Ookla, the company behind speedtest.net, has their own non-FOSS CLI tool that’s native and available for many platforms. It’s TCP-based and capable of handling higher bandwidths. While not open-source, it’s supported by Ookla and can be used for non-commercial purposes.)

LibreSpeed is a web-based tool that provides a simple and straightforward way to measure internet connection speeds. It’s designed to be easy to use and provides a clear and concise interface for folks to view their internet connection speeds. It’s self-hostable (with very minimal requirements) and has its own CLI tool as well.

Poisoning AI Scrapers

(I had to sneak one ThursdAI post in here.)

Our AI overlords are raking in billions (but are all still losing money whilst also killing the planet because that’s how late stage capitalism “works”). While I have yet to finish deploying a network of “tarpits” designed to slow down and poison AI scrapers, Tim McCormack has gone and done it!

In “Poisoning AI scrapers“, Tim covers his path towards a project to deter AI companies from using his blog content for training large language models without permission. To achieve this, he implements a system that serves garbled versions of his blog posts to detected AI scrapers.

His approach involves several components. First, for content generation, he uses a Dissociated Press algorithm implemented in Rust to generate nonsensical content that looks superficially normal. The algorithm takes the original blog post as input and produces garbled text while maintaining some structural elements.

For content storage, the system generates an alternative version of each blog post (named “swill.alt.html“) and stores this alongside the regular post content. Scraper detection and content serving is handled using Apache httpd .htaccess rules with mod_rewrite to detect AI scrapers based on User-Agent strings. When a scraper is detected, the server serves the garbled “swill” version instead of the real content.

Key aspects of the system includes generating garbled content deterministically using the post’s SHA256 hash as a seed, regenerating alternative content only for draft posts to maintain stable “swill” versions for published posts, and excluding comments from the garbled versions to avoid associating others’ names with nonsensical content.

Tim ACKs that this approach alone won’t significantly impact LLM training. However, he sees it as a fun exercise, a way to practice programming skills, and potentially an inspiration for others to implement similar measures. The post concludes by inviting discussion on alternative technical approaches and other poisoning techniques, while discouraging broader debates about AI ethics or the merits of LLMs.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev ☮️

https://dailydrop.hrbrmstr.dev/2024/10/03/drop-539-2024-10-01-toss-up-thursday/

Edited 126d ago

0 0 0 View Post & Replies See Original