HN comments for: The semantic web is now widely adopted

openrisk

44 replies

10h5m

2024-08-21 08:26:21 UTC

The semantic web standards are sorely lacking (for decades now) a killer application. Not in a theoretical universe of decentralized philosopher-computer-scientists but in the dumbed down, swipe-the-next-30sec-video, adtech oligopolized digital landscape of walled gardens. Providing better search metadata is hardly that killer app. Not in 2024.

The lack of adoption has, imho, two components.

1. bad luck: the Web got worse, a lot worse. There hasn't been a Wikipedia-like event for many decades. This was not pre-ordained. Bad stuff happens to societies when they don't pay attention. In a parallel universe where the good Web won, the semantic path would have been much more traveled and developed.

2. incompleteness of vision: if you dig to their nuclear core, semantic apps offer things like SPARQL queries and reasoners. Great, these functionalities are both unique and have definite utility but there is a reason (pun) that the excellent Protege project [1] is not the new spreadsheet. The calculus of cognitive cost versus tangible benefit to the average user is not favorable. One thing that is missing are abstractions that will help bridge that divide.

Still, if we aspire to a better Web, the semantic web direction (if not current state) is our friend. The original visionaries of the semantic web where not out of their mind, they just did not account for the complex socio-economics of digital technology adoption.

[1] https://protege.stanford.edu/

rakoo

10 replies

7h28m

2024-08-21 11:02:47 UTC

Over on lobste.rs, someone cited another article retracing the history of the Semantic Web: https://twobithistory.org/2018/05/27/semantic-web.html

An interesting read in itself, and also points to Cory Doctorow giving seven reasons why the Semantic Web will never work: https://people.well.com/user/doctorow/metacrap.htm. They are all good reasons and are unfortunately still valid (although one of his observations towards the end of the text has turned out to be comically wrong, I'll let you read what it is)

Your comment and the two above links point to the same conclusion: again and again, Worse is Better (https://en.wikipedia.org/wiki/Worse_is_better)

kayo_20211030

5 replies

6h15m

2024-08-21 12:15:42 UTC

Every time I read a post like this I'm inclined to post Doctorow's Metacrap piece in response. You got there ahead of me. His reasoning is still valid and continues to make sense to me. Where do you think he's "comically wrong"?

unconed

1 replies

5h54m

2024-08-21 12:36:53 UTC

The implicit metrics of quality and pedigree he believed were superior to human judgement have since been gamified into obsolescence by bots.

kayo_20211030

0 replies

5h3m

2024-08-21 13:28:10 UTC

I think that the jury is still out on that one. Human judgement is too often colored by human incentives. I still think there's an opportunity for mechanical assessments of quality and pedigree to excel, and exceed what humans can do; at least, at scale. But, it'll always be an arms race and I'm not convinced that bots are in it except in the sense of lying through metadata, which brings us back to the assessment of quality and pedigree - right/wrong, good/bad, relevant/garbage.

pessimizer

1 replies

5h43m

2024-08-21 12:47:38 UTC

Link counting being reliable for search. After going through people's not-so-noble qualities and how they make the semantic web impossible, he declares counting links as an exception. It was to a comical degree not an exception.

kayo_20211030

0 replies

5h14m

2024-08-21 13:17:01 UTC

Yes. There is that. Ignobility wins out again.

monknomo

0 replies

2h11m

2024-08-21 16:19:55 UTC

item 2.6 kneecapped item 3

openrisk

1 replies

6h16m

2024-08-21 12:15:18 UTC

An interesting read in itself...

Indeed a good read, thanks for the link!

[Cory Doctorow's] seven insurmountable obstacles

I think his context is the narrower "Web of individuals" where many of his seven challenges are real (and ongoing).

The elephant in the digital room is the "Web of organizations", whether that is companies, the public sector, civil society etc. If you revisit his objections in that light they are less true or even relevant. E.g.,

People lie

Yes. But public companies are increasingly reporting online their audited financials via standards like iXBRL and prescribed taxonomies. Increasingly they need to report environmental impact etc. I mentioned in another comment common EU public procurement ontologies. Think also the millions of education and medical institutions and their online content. In institutional context lies do happen, but at a slightly deeper level :-)

People are lazy

This only raises the stakes. As somebody mentioned already, the cost of navigating random API's is high. The reason we still talk about the semantic web despite decades of no-show is precisely the persistent need to overcome this friction.

People are stupid

We are who we are individually, but again this ignores the collective intelligence of groups. Besides the hordes of helpless individuals and a handful of "big techs"(=the random entities that figured out digital technology ahead of others) there is a vast universe of interests. They are not stupid but there is a learning curve. For the vast part of society the so-called digital transformation is only at its beginning.

rakoo

0 replies

1h24m

2024-08-21 17:06:54 UTC

You have a very charitable view of this whole thing and I want to believe like you. Perhaps there is a virtuous cycle to be built where infrastructure that relies on people being more honest helps change the culture to actually be more honest which makes the infrastructure better. You don't wait for people to be nice before you create the gpl, the gpl changes mindsets towards opening up which fosters a better culture for creating more.

It's also very important to think in macro systems and societies, as you point out, rather than at the individual level

domh

0 replies

6h38m

2024-08-21 11:53:08 UTC

Thanks for sharing that Doctorow post, I had not seen that before. While the specific examples are of course dated (hello altavista and Napster), it still rings mostly true.

PaulHoule

0 replies

3h35m

2024-08-21 14:56:18 UTC

One major problem RDF has is that people hate anything with namespaces. It's a "freedom is slavery" kind of thing. People will accept it grudgingly if Google says it will help their search rankings or if you absolutely have to deal with them to code Java but 80% of people will automatically avoid anything if it has namespaces. (See namespaces in XML)

Another problem is that it's always ignored the basic requirements of most applications like:

1. Getting the list of authors in a publication as refernces to authority records in the right order (Dublin Core makes the 1970 MARC standard look like something from the Starship Enterprise)

2. Updating a data record reliably and transactionally

3. Efficiently unioning graphs for inference so you can combine a domain database with a few database records relevant to a problem + a schema easily

4. Inference involving arithemtic (Godel warned you about first-order logic plus arithmetic but for boring fields like finance, business, logistics that is the lingua franca, OWL comes across as too heavyweight but completely deficient at the same time and nobody wants to talk about it)

things like that. Try to build an application and you have to invent a lot of that stuff. You have the tools to do it and it's not that hard if you understand the math inside and out but if you don't oh boy.

If RDF got a few more features it would catch up with where JSON-based tools like

https://www.couchbase.com/products/n1ql/

were 10 years ago.

vasco

5 replies

9h18m

2024-08-21 09:13:32 UTC

There hasn't been a Wikipedia-like event for many decades

I'll give you two examples: Internet Archive. Let's Encrypt.

KolmogorovComp

2 replies

9h11m

2024-08-21 09:20:01 UTC

Hardly a good reference, Internet Archive is older than Wikipedia.

Vinnl

1 replies

5h21m

2024-08-21 13:10:03 UTC

Wikipedia itself is only a little over two decades old. I don't think anyone would parse "many decades" as "two decades".

There's also OpenStreetMap, exactly two decades old and thus four years younger than Wikipedia.

bawolff

0 replies

4h41m

2024-08-21 13:50:27 UTC

Wikipedia itself is only a little over two decades old

The world wide web (but not the internet) is only 3 decades old!

conzept

0 replies

7h52m

2024-08-21 10:39:17 UTC

Not true: Wikidata, Open Alex, Europeana, ... and many smaller projects making use of all that data, such as my project Conzept (https://conze.pt)

Retr0id

0 replies

8h57m

2024-08-21 09:34:03 UTC

Let's Encrypt is very good but it's not exactly a web app, semantic-web or otherwise.

recursivedoubts

4 replies

4h28m

2024-08-21 14:03:12 UTC

The semantic web has been, in my opinion, a category error. Semantics means meaning and computers/automated systems don't really do meaning very well and certainly don't do intention very well.

Mapping the incredible success of The Web onto automated systems hasn't worked because the defining and unique characteristic of The Web is REST and, in particular, the uniform interface of REST. This uniform interface is wasted on non-intentional beings like software (that I'm aware of):

https://intercoolerjs.org/2016/05/08/hatoeas-is-for-humans.h...

Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

It just hasn't worked out the way that people expected, and that's OK.

dboreham

2 replies

3h25m

2024-08-21 15:06:01 UTC

I take the other side of this trade, and have since c. 1980. I say that semantics is a delusion our brains creates. Doesn't really exist. Or conversely is not the magical thing we think it is.

recursivedoubts

0 replies

3h23m

2024-08-21 15:07:38 UTC

man

lo_zamoyski

0 replies

55m

2024-08-21 17:35:57 UTC

How are you oblivious of the performative contradiction that is that statement?

Please tell me you're not an eliminativist. There is nothing respectable about eliminativism. Self-refuting, and Procrustean in its methodology, denying observation it cannot explain or reconcile. Eliminativism is what you get when a materialist refuses or is unable to revise his worldview despite the crushing weight of contradiction and incoherence. It is obstinate ideology.

ftlio

0 replies

44m

2024-08-21 17:46:58 UTC

The semantic web has been, in my opinion, a category error.

Hard agree.

Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

I think about it as:

- Hypermedia controls were been deemphasized, leading to a ton of workarounds to REST

- REST is a perfectly suitable interface for AI Agents, especially to audit for governance

- AI is well suited to the task of mapping the web as it exists today to REST

- AI is well suited to mapping this layout ontologically

The semantic web is less interesting than what is traversable and actionable via REST, which may expose some higher level, reusable structures.

The first thing I can think of is `User` as a PKI type structure that allows us to build things that are more actionable for agents while still allowing humans to grok what they're authorized to.

ricardo81

3 replies

6h27m

2024-08-21 12:04:15 UTC

There's another element, trusting the data.

Often that may require some web scale data, like Pagerank but also any other authority/trust metric where you can say "this data is probably quality data".

A rather basic example, published/last modified dates. It's well known in SEO circles at least in the recent past that changing them is useful to rank in Google, because Google prefers fresh content. Unless you're Google or have a less than trivial way of measuring page changes, the data may be less than trustworthy.

lxgr

2 replies

3h57m

2024-08-21 14:33:42 UTC

Not even Google seems to be making use of that capability, if they even have it in the first place. I'm regularly annoyed by results claiming to be from this year, only to find that it's a years-old article with fake metadata.

ricardo81

0 replies

2h7m

2024-08-21 16:24:02 UTC

They are quite good at near content duplicate detection so I imagine it's within their capabilities. Whether they care about recency, maybe not as long as the user metrics say the page is useful. Maybe a fallacy about content recency.

You don't see many geocities style sites nowadays, even though there's many older sites with quality (and original) content. Maybe mobile friendliness plays into that though.

account42

0 replies

2h23m

2024-08-21 16:07:59 UTC

Yeah, dates in Google results have become all but useless. It's just another meaningless knob for SEOtards to abuse.

debarshri

3 replies

9h17m

2024-08-21 09:13:50 UTC

At TU delft, I was supposed to do my PhD in semantic web especially in the shipping logistics. It was funded by port of Rotterdam 10 years ago. Idea was to theorize and build various concepts around discrete data sharing, data discovery, classification, building ontology, query optimizations, automation and similar usecases. I decided not to pursue phd a month into it.

I believe in semantic web. The biggest problem is that, due to lack of tooling and ease of use, it take alot of effort and time to see value in building something like that across various parties etc. You dont see the value right away.

jsdwarf

1 replies

8h37m

2024-08-21 09:53:48 UTC

Funny you bring up logistics and (data) ontologies. I'm a PM at a logistics software company and I'd say the lack of proper ontologies and standardized data exchange formats is the biggest effort driver for integrating 3rd party carrier/delivery services such as DHL, Fedex etc.

It starts with the lack of a common terminology. For tool A a "booking" might be a reservation e.g. of a dock at a warehouse. For tool B the same word means a movement of goods between two accounts.

In terms of data integration things have gotten A LOT worse since EDIFACT is de facto deprecated. Every carrier in the parcel business is cooking their own API, but with insufficient means. I've come across things like Polish endpoint names/error messages or country organisations of big Parcel couriers using different APIs.

IMHO the EU has to step in here because integration costs skyrocket. They forced cellphone manufacturers to use USB-Cs for charging, why can't they force carriers to use a common API?

openrisk

0 replies

8h29m

2024-08-21 10:02:25 UTC

The EU is doing its part in some domains. There is e.g., the eProcurement ontology [1] that aims to harmonize public procurement data flows. But I suppose it helped alot that (by EU law) everybody is obliged to submit to a central repository.

[1] https://docs.ted.europa.eu/epo-home/index.html

PaulHoule

0 replies

3h41m

2024-08-21 14:50:25 UTC

Good choice. The semantic web really brought me to the brink.

The community has its head in the sands about... just about everything.

Document databases and SQL are popular because all of the affordances around "records". That is, instead of deleting, inserting, and updating facts you get primitives that let you update records in a transaction even if you don't explicitly use transactions.

It's very possible to define rules that will cut out a small piece of a graph that defines an individual "record" pertaining to some "subject" in the world even when blank nodes are in use. I've done it. You would go 3-4 years into your PhD and probably not find it in the literature, not get told about it by your prof, or your other grad students. (boy I went through the phase where I discovered most semantic web academics couldn't write hard SPARQL queries or do anything interesting with OWL)

Meanwhile people who take a bootcamp can be productive with SQL in just a few days because SQL was developed long ago to give the run-of-the-mill developer superpowers. (imagine how lost people were trying to develop airline reservation systems in the 1960s!)

austin-cheney

3 replies

9h44m

2024-08-21 08:47:15 UTC

A killer app is still not enough.

People can’t get HTML right for basic accessibility, so something like the semantic web would be super science that people will out of their way to intentionally ignore any profit upon so long as they can raise their laziness and class-action lawsuit liability.

PaulHoule

1 replies

3h38m

2024-08-21 14:53:37 UTC

I see RDF as a basis to build on. If I think RDF is pretty good but needs a way to keep track of provenance or temporality or something I can probably build something augmented that does that.

If it really works for my company and it is a competitive advantage I would keep quiet about it and I know of more than one company that's done exactly that. The standards process is so exhausting and you have to fight with so many systems programmers who never wrote an application that it's just suicide to go down that road.

BTW, RSS is an RDF application that nobody knows about

https://web.resource.org/rss/1.0/spec

you can totally parse RSS feeds with a RDF-XML parser and do SPARQL and other things with them.

ttepasse

0 replies

3h30m

2024-08-21 15:01:18 UTC

99% of the time you'll get an RSS 2.0 feed which is an XML format. Of course you can convert, but RSS 1.0 seems, like you said, forgotten from the world.

burningChrome

0 replies

2h33m

2024-08-21 15:57:56 UTC

> People can’t get HTML right for basic accessibility.

Not only has this gotten much worse; even when you put in the stop gaps for developers such as linters or other plugins, they willfully ignore them and will actually implement code they know is determinantal to accessibility.

DrScientist

2 replies

9h31m

2024-08-21 09:00:16 UTC

I think the problem with any sort of ontology type approach is the problem isn't solved when you have defined the one ontology to rule them all after many years of wrangling between experts.

As what you have done is spend many years generating a shared understanding of what that ontology means between the experts. Once that's done you have the much harder task for pushing that shared understanding to the rest of the world.

ie the problem isn't defining a tag for a cat - it's having a global share vision of what a cat is.

I mean we can't even agree on what is a man or a women.

openrisk

1 replies

9h1m

2024-08-21 09:30:05 UTC

You point out a real problem but it does not feel like an unsurmountable and terminal one. By that argument we would never have a human language unless everybody spoke the same language. Turns out once you have well developed languages (and you do, because they are useful even when not universal) you can translate between them. Not perfectly, but generally good enough.

Developing such linking tools between ontologies would be worthwhile if there are multiple ontologies covering the same domain, provided they are actually used (i.e., there are large datasets for each). Alas, instead of a bottom-up, organic approach people try to solve this with top-down, formal (upper-level) ontologies [1] and Leibnizian dreams of an underlying universality [2], which only adds to the cognitive load.

[1] https://en.wikipedia.org/wiki/Formal_ontology

[2] https://en.wikipedia.org/wiki/Characteristica_universalis

rapnie

0 replies

8h10m

2024-08-21 10:21:33 UTC

You point out a real problem but it does not feel like an unsurmountable and terminal one

In our spoken language the agents doing the parsing are human AI's (actual intelligences) able to deal with most of the finer nuances in semantics, and still making numerous errors in many contexts that lead to misunderstanding, i.e. parse errors.

There was this hand-waving promise in semantic web movement of "if only we make everything machine-readable, then .." magic would happen. Undoubtedly unlocking numerous killer apps, if only we had these (increasingly complex) linked data standards and related tools to define and parse 'universal meaning'.

An overreach, imho. Semantic web was always overpromising yet underdelivering. There may be new use cases in combinations of SM with ML/LLM but I don't think they'll be a vNext of the web anytime soon.

jl6

1 replies

8h12m

2024-08-21 10:19:28 UTC

Killer applications solve real problems. What is the biggest real problem on the web today? The noise flood. Can semantic web standards help with that? Maybe! Something about trust, integrity, and lineage, perhaps.

rakoo

0 replies

7h2m

2024-08-21 11:29:29 UTC

Semantic Web doesn't help with the most basic thing: how do you get information ? If I want to know when was the Matrix shot, where do I go ? Today we have for-profit centralized point to get all information, because it's the only way this can be sustainable. Semantic Web might make it more feasible, by instead having lots of small interconnected agents that trust each other, much like... a Web of Trust. Except we know where the last experiment went (nowhere).

jancsika

0 replies

45m

2024-08-21 17:46:05 UTC

There hasn't been a Wikipedia-like event for many decades.

Off the top of head...

OpenStreetMap was in 2004. Mastodon and the associated spec-thingy was around 2016. One/two decades is not the same as many decades.

Oh, and what about asm.js? Sure, archive.org is many decades old. But suddenly I'm using it to play every retro game under the sun on my browser. And we can try out a lot of FOSS software in the browser without installing things. Didn't someone post a blog to explain X11 where the examples were running a javascript implementation of the X window system?

Seems to me the entire web-o-sphere leveled up over the past decade. I mean, it's so good in fact that I can run an LLM clientside in the browser. (Granted, it's probably trained in part on your public musing that the web is worse.)

And all this while still rendering Berkshire Hathaway website correctly for many decades. How many times would the Gnome devs have broken it by now? How many upgrades would Apple have forced an "iweb" upgrade in that time?

Edit: typo

h4ck_th3_pl4n3t

0 replies

6h41m

2024-08-21 11:50:28 UTC

Say what you want, but Macromedia Dreamweaver came pretty close to being "that killer app". Microsoft attempted the same with Frontpage, but abandoned it pretty quickly as they always do.

I think that Web Browsers need to change what they are. They need to be able to understand content, correlate it, and distribute it. If a Browser sees itself not as a consuming app, but as a _contributing_ and _seeding_ app, it could influence the semantic web pretty quickly, and make it much more awesome.

Beaker Browser came pretty close to that idea (but it was abandoned, too).

Humans won't give a damn about hand-written semantic code, so you need to make the tools better that produce that code.

echelon

0 replies

5h3m

2024-08-21 13:27:50 UTC

Search and ontologies weren't the only goals. Microformats enabled standardized data markup that lots of applications could consume and understand.

RSS and Atom were semantic web formats. They had a ton of applications built to publish and consume them, and people found the formats incredibly useful.

The idea was that if you ran into ingestible semantic content, your browser, a plugin, or another application could use that data in a specialized way. It worked because it was a standardized and portable data layer as opposed to a soup of meaningless HTML tags.

There were ideas for a distributed P2P social network built on the semantic web, standardized ways to write articles and blog posts, and much more.

If that had caught on, we might have saved ourselves a lot of trouble continually reinventing the wheel. And perhaps we would be in a world without walled gardens.

cyanydeez

0 replies

7h23m

2024-08-21 11:08:18 UTC

i think you're confused. the killer app is everyone following the same format, and such, capitalists can extract all that information and sell LLMs that no one wants in place of more deterministic search and data products.

WolfOliver

0 replies

8h42m

2024-08-21 09:49:07 UTC

Graph Based RAG systems look promising https://www.ontotext.com/knowledgehub/fundamentals/what-is-g...

43 replies

12h21m

2024-08-21 06:10:33 UTC

The author gives two reasons why AI won't replace the need for metadata:

1: LLMs "routinely get stuff wrong"

2: "pricy GPU time"

1: I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project. And they get pretty hard stuff right 99% of the time already. This will only improve.

2: Loading the frontpage of Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code. In the past, this would have been seen as an impossible task to just show some links to articles. In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.

monero-xmr

10 replies

12h16m

2024-08-21 06:15:30 UTC

LLMs have no soul, so I like content and curation from real people

amarant

5 replies

11h5m

2024-08-21 07:26:07 UTC

Huh, it's not often you hear a religious argument in a technical discussion. Interesting viewpoint!

MrVandemar

4 replies

9h32m

2024-08-21 08:59:31 UTC

I don't see it as anything religious. I see the comment about something having an intrinsic, instinctive quality, which we can categorise as having "soul".

amarant

2 replies

7h30m

2024-08-21 11:00:54 UTC

That's even more interesting! The only non-religious meaning of soul I've ever heard is a music genre, but then English is my second language. I tried googling it and found this meaning I wasn't aware of:

emotional or intellectual energy or intensity, especially as revealed in a work of art or an artistic performance. "their interpretation lacked soul"

Is this the definition used? I'm not sure how a JSON document is supposed to convey emotional or intellectual energy, especially since it's basically a collection of tags. Maybe I also lack soul?

Or is there yet another definition I didn't find?

pessimizer

1 replies

5h14m

2024-08-21 13:16:45 UTC

It's early 20th century (and later) black American dialect to say things "have soul" or "don't have soul." In the West, Black Americans are associated with a mystical connection to the Earth, deeper understandings, and suffering.

So LLMs are not gritty and down and dirty, and don't get down. They're not the real stuff.

amarant

0 replies

2024-08-21 18:23:11 UTC

Mystical connection? Now you're back to religion.

If you wanna be down you gotta keep it real, and mysticism is categorically not that.

Eisenstein

0 replies

7h25m

2024-08-21 11:05:40 UTC

intrinsic, instinctive quality,

What are a few examples of things with an 'intrinsic, instinctive quality'?

doe_eyes

2 replies

12h10m

2024-08-21 06:21:05 UTC

The main problem is that the incentive for well-intentioned people to add detailed and accurate metadata is much lower than the incentive for SEO dudes to abuse the system if the metadata is used for anything of consequence. There's a reason why search engines that trusted website metadata went extinct.

That's the whole benefit of using LLMs for categorization: they work for you, not for the SEO guy... well, prompt injection tricks aside.

monero-xmr

1 replies

12h6m

2024-08-21 06:25:24 UTC

There is value-add if you can prove whatever content you are producing is from an authentic human, because I dislike LLM produced garbage

usrusr

0 replies

7h25m

2024-08-21 11:05:52 UTC

The point is that metadata lies. Intentionally, instead of just being coincidentally wrong. For example everybody who wants to spew LLM produced garbage in your face will go out of their way to attach metadata claiming the opposite. The value proposition of LLM categorization would be that the LLM looks at the same content as the eventual human (if, in fact, it does - which is a related but different problem)

tsimionescu

0 replies

11h46m

2024-08-21 06:44:55 UTC

All the web metadata I consume is organic and responsively farmed.

throwme_123

9 replies

12h1m

2024-08-21 06:30:24 UTC

For my part, I stopped reading at the free bashing of blockchain•.

Reminded me of the angst and negativity of these original "Web3" people, already bashing everything that was not in their mood back then.

• The crypto ecosystem is shady, I know, but the tech is great

ashkankiani

8 replies

11h48m

2024-08-21 06:42:52 UTC

As someone who stopped getting involved in blockchain "tech" 12 years ago because of the prevalence of scams and bad actors and lack of interesting tech beyond the merkle tree, what's great about it?

FWIW I am genuinely asking. I don't know anything about the current tech. There's something about "zero knowledge proofs" but I don't understand how much of that is used in practice for real blockchain things vs just being research.

As far as I know, the throughput of blockchain transactions at scale is miserably slow and expensive and their usual solution is some kind of side channel that skips the full validation.

Distributed computation on the blockchain isn't really used for anything other than converting between currencies and minting new ones mostly AFAIK as well.

What is the great tech that we got from the blockchain revolution?

throwme_123

7 replies

9h28m

2024-08-21 09:02:51 UTC

Scams and bad actors haven't changed sadly.

But zk-based really decentralized consensus now does 400 tps and it's extraordinary when you think about it and all the safety and security properties it brings.

And that's with proof-of-stake of course with decentralized sequencers for L2.

But I get that people here prefer centralized databases, managed by admins and censorship-empowering platforms. Your bank stack looks like it's designed for fraud too. Manual operations and months-long audits with errors, but that is by design. Thanks everyone for all the downvotes.

dspillett

6 replies

8h44m

2024-08-21 09:46:51 UTC

> But I get that people here prefer

For many of us it isn't that we think the status quo is the RightWay™ - we just aren't convinced that crypto as it currently is presents a better answer. It fixes some problems, but adds a number of its own that many of us don't think are currently worth the compromise for our needs.

As you said yourself:

> The crypto ecosystem is shady, I know, but the tech is great

That but is not enough for me to want to take part. Yes the tech is useful, heck I use it for other things (blockchains existed as auditing mechanisms long before crypto-currencies), but I'm not going to encourage others to take part in an ecosystem that is as shady as crypto is.

> Thanks everyone for all the downvotes.

I don't think you are getting downvoted for supporting crypto, more likely because you basically said “you know that article you are all discussing?, well I think you'll want to know that I didn't bother to read it”, then without a hint of irony made assertions of “angst and negativity”.

And if I might make a mental health suggestion: caring about online downvotes is seldom part of a path to happiness :)

nottorp

5 replies

8h13m

2024-08-21 10:18:12 UTC

The main problem with blockchain is identical to the one with LLMs. When snake oil salesmen try to apply the same solution to every problem, you stop wasting your time with those salesmen.

Both can be useful now and then, but the legit uses are lost in the noise.

And for blockchain... it was launched with the promise of decentralized currency. But we've had decentralized currency before in the physical world. Until the past few hundred years. Then we abandoned it in favor of centralized currency for some reason. I don't know, reliability perhaps?

dspillett

3 replies

4h52m

2024-08-21 13:38:56 UTC

> And for blockchain... it was launched with the promise of decentralized currency.

Cryptocurrencies were launched with that promise.

They are but one use¹ of block-chains / merkle-trees, which existed long before them².

----

[1] https://en.wikipedia.org/wiki/Merkle_tree#Uses

[2] 1982 for blockchains/trees as part of a distributed protocol as people generally mean when they use the words now³, hash chains/trees themselves go back at least as far as 1979 when Ralph Merkle patented the idea

[3] https://en.wikipedia.org/wiki/Blockchain#History

nottorp

2 replies

2h50m

2024-08-21 15:41:35 UTC

But if you put it that way neural networks were defined in the 70s too :)

dspillett

1 replies

1h42m

2024-08-21 16:49:17 UTC

Very much so. Is there a problem with that? To what time period would attribute their creation?

In fact it is only the 70s if you mean networks that learn via backprop & similar methods. Some theoretical work on artificial neurons was done in the 40s.

nottorp

0 replies

1h32m

2024-08-21 16:58:58 UTC

The point is whatever you said in defense of blockchain/crypto applies or does not apply to neural networks/LLMs in equal measure.

I for one fail to see the difference between these two kinds of snake oil.

Some theoretical work on artificial neurons was done in the 40s.

"The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts. The first hardware implementation was Mark I Perceptron machine built in 1957"

everforward

0 replies

1h1m

2024-08-21 17:29:45 UTC

Gold is and has been a decentralized currency for a very long time. It’s mostly just very inconvenient to transport.

Then we abandoned it in favor of centralized currency for some reason. I don't know, reliability perhaps?

The global economy practically requires a centralized currency, because the value of your currency vs other countries becomes extremely important for trading in a global economy (importers want high value currency, exporters want low).

It’s also a requirement to do financial meddling like what the US has been doing with interest rates to curb inflation. None of that is possible on the blockchain without a central authority.

atoav

5 replies

10h41m

2024-08-21 07:50:28 UTC

Let's hope you never write articles about court cases then: https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...

The alleged low error rate of 1% can ruin your day/life/company, if it hits the wrong person, regards the wrong problem, etc. And that risk is not adequately addressed by hand-waving and pointing people to low error rates. In fact, if anything such claims would make me less confident in your product.

1% error is still a lot if they are the wrong kind of error in the wrong kind of situation. Especially if in that 1% of cases the system is not just slightly wrong, but catastrophically mind-bogglingly wrong.

kqr

3 replies

10h14m

2024-08-21 08:17:15 UTC

This is the thing with errors and automation. A 1 % error rate in a human process is basically fine. A 1 % error rate in an automated process is hundreds of thousands of errors per day.

(See also why automated face recognition in public surveillance cameras might be a bad idea.)

atoav

0 replies

9h45m

2024-08-21 08:45:56 UTC

Exactly. If your system monitors a place like a halfway decent railway station half a million people per day is a number you could expect. Even with an amazingly low error rate of 1% that would result in 5000 wrong signals a day. If we make the assumption that the people are uniformly spread out througout a 24 hour cycle that means a false alarm every 20 seconds.

In reality most of the people are there during the day (false alarm every 10 seconds) and the error percentages are nowhere near 1%.

If you do the math to figure out the staff needed to react to those false alarms in any meaningful way you have to come to the conclusion that just putting people there instead of cameras would be a safer way to reach the goal.

Terr_

0 replies

8h48m

2024-08-21 09:43:27 UTC

Another part is that artificial systems can screw up in fundamentally different ways and modes compared to a human baseline, even if the raw count of errors is lower.

A human might fail to recognize another person in a photo, but at least they won't insist the person is definitely a cartoon character, or blindly follow "I am John Doe" written on someone's cheek in pen.

Retr0id

0 replies

8h26m

2024-08-21 10:04:52 UTC

Human error rates are also not a constant.

If you're about to publish a career-ending allegation, you're going to spend some extra time fact-checking it.

8organicbits

0 replies

6h15m

2024-08-21 12:16:19 UTC

Is product search a high risk activity? LLMs could be the right tool for building a product search database while also being libelously terrible for news reporting.

zaik

3 replies

11h53m

2024-08-21 06:37:46 UTC

Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code [...] to show some links to articles

Yes, and I hate it. I closed Reddit many times because the wait time wasn't worth it.

rfl890

2 replies

11h38m

2024-08-21 06:52:43 UTC

https://old.reddit.com ?

jeltz

0 replies

9h44m

2024-08-21 08:47:13 UTC

Gets buggier for every year.

dspillett

0 replies

9h2m

2024-08-21 09:29:19 UTC

That definitely seems to be getting less reliable these days. A number of times I've found it refusing to work, or redirecting me to the primary UI arbitrarily, a few months ago there was a time when you couldn't login via that UI (though logging in on main and going back worked for me).

These instances seem to be temporary bugs, but they show that it isn't getting any love (why would it? they only maintain it at all under sufferance) so at some point it'll no doubt be cut off as a cost cutting exercise during a time when ad revenue is low.

rapsey

3 replies

12h12m

2024-08-21 06:19:20 UTC

GPU compute price is dropping fast and will continue to do so.

philjohn

1 replies

10h35m

2024-08-21 07:55:44 UTC

But is it dropping faster than the needs of the next model that needs to be trained?

tossandthrow

0 replies

10h25m

2024-08-21 08:06:24 UTC

Short answer is yes.

Also, GPU pricing is hardly relevant. From now on we will see dedicated co-processors on the GPU to handle these things.

They will keep on keeping up with the demand until we meet actual physical limits.

dspillett

0 replies

8h53m

2024-08-21 09:38:26 UTC

The cost of GPU time isn't just the cost that you see (buying them initially, paying for service if they are not yours, paying for electricity if they are) but the cost to the environment. Data centre power draws are increasing significantly and the recent explosion in LLM model creation is part of that.

Yes, things are getting better per unit (GPUs get more efficient, better yet AI-optimised chipsets are an order more efficient than using GPUs, etc.) but are they getting better per unit of compute faster than the number of compute units being used is increasing ATM?

menzoic

2 replies

12h5m

2024-08-21 06:25:53 UTC

How does Product Chart use LLMs?

1 replies

11h41m

2024-08-21 06:49:42 UTC

We research all product data manually and then have AI cross-check the data and see how well it can replicate what the human has researched and whether it can find errors.

Actually, building the AI agent for data research takes up most of my time these days.

viraptor

0 replies

9h8m

2024-08-21 09:22:47 UTC

Have you seen https://superagent.sh/ ? It's an interesting one and not terrible in the test cases I tried. (Requires pretty specific descriptions for the fields though)

intended

2 replies

10h21m

2024-08-21 08:10:05 UTC

Only slightly tongue in cheek, but if your measure of success is Reddit, perhaps a better example may serve your argument?

ramon156

1 replies

10h10m

2024-08-21 08:21:28 UTC

The argument for "LLMs get it right 99% of the time" is also very generalized and doesn't take into account smaller websites

klabb3

0 replies

7h47m

2024-08-21 10:43:58 UTC

It’s baffling how defeatist and ignorant engineering culture has become when someone else’s non-deterministic, proprietary and non-debuggable code, running on someone else’s machine, that uses an enormous amount of currently VC-subsidized resources, is touted as a general solution to a data annotation problem.

Back in my day people used to bash on JavaScript. Today one can only dream of a world where JS is the worst of our engineering problems.

esjeon

0 replies

5h37m

2024-08-21 12:54:22 UTC

I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project.

In fact, what you're doing there is building a local semantic database by automatically mining metadata using LLM. The searching part is entirely based on the metadata you gathered, so the GP's point 1 is still perfectly valid.

In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.

Even with all that technological power, LLMs won't replace most simple-searching-over-index, as they are bad at adapting to ever changing datasets. They only can make it easier.

8organicbits

0 replies

6h20m

2024-08-21 12:11:06 UTC

Oh nice, Product Chart looks like a great fit for what LLMs can actually do. I'm generally pretty skeptical about LLMs getting used, but looking at the smart phone tool: this is the sort of product search missing from online stores.

Critically, if the LLM gets something wrong, a user can notice and flag it, then someone can manually fix it. That's 100x less work than manually curating the product info (assuming 1% error rate).

ubertaco

11 replies

6h3m

2024-08-21 12:28:06 UTC

Well, the immediate initial test failed for me: I thought, "why not apply this on one of my own sites, where I have a sort of journal of poetry I've written?"...and there's no category for "Poem", and the request to add Poem as a type [1] is at least 9 years old, links to an even older issue in an unreadable issue tracker without any resolution (and seemingly without much effort to resolve it), and then dies off without having accomplished anything.

[1] https://github.com/schemaorg/suggestions-questions-brainstor...

tossandthrow

7 replies

5h51m

2024-08-21 12:40:11 UTC

Having worked in this field for a bit, this uncovers an even more fundamental flaw: The idea that we can have a single static ontology.

codewithcheese

2 replies

5h44m

2024-08-21 12:46:51 UTC

Domain driven design is well aware that is not feasible to have a single schema for everything, they use bounded contexts. Is there something similar for the semantic web?

klntsky

0 replies

5h35m

2024-08-21 12:56:00 UTC

In the Semantic Web, things like ontologies and namespaces play a role similar to bounded contexts in DDD. There’s no exact equivalent, but these tools help different schemas coexist and work together

kitsune_

0 replies

5h38m

2024-08-21 12:53:29 UTC

Isn't that the point of RDF / Owl etc.?

lambdaba

1 replies

5h46m

2024-08-21 12:45:23 UTC

What kind of work do you do?

tossandthrow

0 replies

5h11m

2024-08-21 13:19:45 UTC

various. Notable I, some years ago, had a project that considered automatic consolidation of ontologies based on meta-ontologies and heuristics.

The idea being that everyone have their own ontology for the data they release and the system would make a consolidated ontology that could be used to automatic integration of data from different datasources.

regardless, that project did not get traction, so now it sits.

wslh

0 replies

25m

2024-08-21 18:06:13 UTC

Mostly, the problems of a semantic web are covered in the history of Cyc[1].

When I started to use LLMs I thought that was the missing link to convert content to semantic representations, even taking into account the errors/hallucinations within them.

[1] https://en.wikipedia.org/wiki/Cyc

maxerickson

0 replies

5h27m

2024-08-21 13:04:00 UTC

There is also the problem that structure doesn't guarantee meaning.

lukev

2 replies

5h49m

2024-08-21 12:42:12 UTC

That's only schema.org! Linked data is so much bigger than that.

Many ontologies have a "poem" type (for example dbpedia (https://dbpedia.org/ontology/Poem) has one), as well as other publishing or book-oriented ontologies.

lolinder

1 replies

4h49m

2024-08-21 13:42:15 UTC

Every time I've read up on semantic web it's been treated as more or less synonymous with schema.org. Are these other ontologies used by anything?

mdaniel

0 replies

1h44m

2024-08-21 16:47:09 UTC

My mental model of that question is: how would anyone know if an ontology was used by something? One cannot have a search facet in any engine that I'm aware of to search by namespace qualified nouns, and markup is only as good as the application which is able to understand it

tossandthrow

9 replies

12h38m

2024-08-21 05:52:58 UTC

In all honesty, llms are probably going to make all this entirely redundant.

As such semantic web was not a natural follower to what we had before, and not web 3.0.

peterlk

2 replies

12h25m

2024-08-21 06:06:05 UTC

The article addresses this point with the following:

It would of course be possible to sic Chatty-Jeeps on the raw markup and have it extract all of this stuff automatically. But there are some good reasons why not. > > The first is that large language models (LLMs) routinely get stuff wrong. If you want bots to get it right, provide the metadata to ensure that they do. > > The second is that requiring an LLM to read the web is throughly disproportionate and exclusionary. Everyone parsing the web would need to be paying for pricy GPU time to parse out the meaning of the web. It would feel bizarre if "technological progress" meant that fat GPUs were required for computers to read web pages.

tsimionescu

0 replies

11h56m

2024-08-21 06:34:59 UTC

The first point is moot, because human annotation would also have some amount of error, either through mistakes (interns being paid nothing to add it) or maliciously (SEO). Plus, human annotation would be multi-lingual, which leads to a host of other problems that LLMs don't have to the same extent.

The second point is silly, because there is no reason for everyone to train their own LLMs on the raw web. You'd have a few companies or projects that handle the LLM training, and everyone else uses those LLMs.

I'm not a big fan of LLMs, and not even a big believer in their future, but I still think they have a much better chance of being useful for these types of tasks than the semantic web. Semantic web is a dead idea, people should really allow it to rest.

tossandthrow

0 replies

10h36m

2024-08-21 07:55:12 UTC

While both of these points a valid today they are likely going to be invalidated going forward - assume that what you can conceive is technically possible will become technically possible.

In 5 years resource price is likely negligible and accuracy is high enough that you just trust it.

null_investor

2 replies

12h24m

2024-08-21 06:07:08 UTC

It's HN, most people don't read the article and jump into whatever conclusion they have at the moment despite not being an expert in the field.

xoac

0 replies

12h21m

2024-08-21 06:10:17 UTC

He had it summarized by chatgpt

tossandthrow

0 replies

10h34m

2024-08-21 07:56:38 UTC

As I already pointed out, none of the arguments the author brings up are really relevant. Resources and accuracy will not be a concern in 5 years.

What makes you think that I am not an expert btw?

It indeed seems like you appear to believ that what's written on the internet is true. So if someone writes that LLMs are not a contester to semantic web - then it might be true.

Could it be, that I merely challenge that author of the blog article and don't take his predictions for granted?

asymmetric

2 replies

12h29m

2024-08-21 06:02:23 UTC

Have you read the article? It addresses this point towards the end.

tossandthrow

0 replies

10h34m

2024-08-21 07:56:57 UTC

yes

tannhaeuser

0 replies

11h41m

2024-08-21 06:50:37 UTC

And it fails to address why SemWeb failed in its heyday: that there's no business case for releasing open data of any kind "on the web" (unless you're wikidata or otherwise financed via public money) the only consequence being that 1. you get less clicks 2. you make it easier for your competitors (including Google) to aggregate your data. And that hasn't changed with LLMs, quite the opposite.

To think a turd such as JSON-LD can save the "SemWeb" (which doesn't really exist), and even add CSV as yet another RDF format to appease "JSON scientists" lol seems beyond absurd. Also, Facebook's Open Graph annotations in HTML meta-links are/were probably the most widespread (trivial) implementation of SemWeb. SemWeb isn't terrible but is entirely driven by TBL's long-standing enthusiasm for edge-labelled graph-like databases (predating even his WWW efforts eg [1]), plus academia's need for topics to produce papers on. It's a good thing to let it go in the last decade and re-focus on other/classic logic apps such as Prolog and SAT solvers.

[1]: https://en.wikipedia.org/wiki/ENQUIRE

tsimionescu

8 replies

11h40m

2024-08-21 06:51:07 UTC

If even the semantic web people are declaring victory based on a post title and a picture for better integration with Facebook, then it's clear that Semantic Web as it was envisioned is fully 100% dead and buried.

The concept of OWL and the other standards was to annotate the content of pages, that's where the real values lie. Each paragraph the author wrote should have had some metadata about its topic. At the very least, the article metadata was supposed to have included information about the categories of information included in the article.

Having a bit of info on the author, title (redundant, as HTML already has a tag for that), picture, and publication date is almost completely irrelevant for the kinds of things Web 3.0 was supposed to be.

lynx23

3 replies

10h42m

2024-08-21 07:48:48 UTC

I had pretty much the same reacon while reading the article. "BlogPosting" isn't particularily informative. The rest of the metadata looked like it could/should be put in <meta> tags, done.

A very bad example if the intention was to demonstrate how cool and useful semweb is :-)

oneeyedpigeon

2 replies

9h35m

2024-08-21 08:56:33 UTC

The schema.org data is much more rich than meta tags, though. Using the latter, an author is just a string of text containing who-knows-what. The former lets you specify a name, email address, and url. And that's just for the Person type—you can specify an Organization too.

tsimionescu

0 replies

7h2m

2024-08-21 11:29:07 UTC

That's still just tangential Metadata. The point of a semantic web would be to annotate the semantic content of text. The vision was always that you can run a query like, say, "physics:particles: proton-mass", over the entire web, and it would retrieve parts of web pages that talk about the proton mass.

rakoo

0 replies

6h52m

2024-08-21 11:38:40 UTC

Which was already possible with RDF. It is hard to not see JSON-LD as anything other than "RDF but in JSON because we don't like XML".

jerf

1 replies

5h6m

2024-08-21 13:24:39 UTC

Yeah, this is hiking the original Semantic Web goal post over the horizon, across the ocean, up a mountain, and cutting it down to a little stump downhill in front of the kicker compared to the original claims. "It's going to change the world! Everything will be contained in RDF files that anyone can trivially embed and anyone can run queries against the Knowledge Graph to determine anything they want!"

"We've achieved victory! After over 25 years, if you want to know who wrote a blog post, you can get it from a few sites this way!"

I'd call it damning with faint success, except it really isn't even success. Relative to the promises of "Semantic Web" it's simply a failure. And it's not like Semantic Web was overpromised a bit, but there were good ideas there and the reality is perhaps more prosaic but also useful. No, it's just useless. It failed, and LLMs will be the complete death of it.

The "Semantic Web" is not the idea that the web contains "semantics" and someday we'll have access to them. That the web has information on it is not the solution statement, it's the problem statement. The semantic web is the idea that all this information on the web will be organized, by the owners of the information, voluntarily, and correctly, into a big cross-site Knowledge Graph that can be queried by anybody. To the point that visiting Wikipedia behind the scenes would not be a big chunk of formatted text, but a download of "facts" embedded in tuples in RDF and the screen you read as a human a rendered result of that, where Wikipedia doesn't just use self-hosted data but could grab "the Knowledge Graph" and directly embed other RDF information from the US government or companies or universities. Compare this dream to reality and you can see it doesn't even resemble reality.

Nobody was sitting around twenty years ago going "oh, wow, if we really work at this for 20 years some people might annotate their web blogs with their author and people might be able to write bespoke code to query it, sometimes, if we achieve this it will have all been worth it". The idea is precisely that such an act would be so mundane as to not be something you would think of calling out, just as I don't wax poetic about the <b> tag in HTML being something that changes the world every day. That it would not be something "possible" but that it would be something your browser is automatically doing behind the scenes, along with the other vast amount of RDF-driven stuff it is constantly doing for you all the time. The very fact that someone thinks something so trivial is worth calling out is proof that the idea has utterly failed.

tsimionescu

0 replies

26m

2024-08-21 18:05:16 UTC

Beautifully said.

I'll also add that I wouldn't even call what he's showing "semantic web", even in this limited form. I would bet that most of the people who add that metadata to their pages view it instead as "implenting the nice sharing link API". The fact that Facebook, Twitter and others decided to converge on JSON-LD with a schema.org schema as the API is mostly an accident of history, rather than someone mining the Knowledge Graph for useful info.

oneeyedpigeon

0 replies

9h41m

2024-08-21 08:50:18 UTC

title (redundant, as HTML already has a tag for that)

Note that `title` isn't one of the properties that BlogPosting supports. It supports `headline`, which may well be different from the `<title/>`. It's probably analogous to the page's `<h1/>`, but more reliable.

jll29

0 replies

10h0m

2024-08-21 08:30:48 UTC

The blog post does not address why the Semantic Web failed:

1. Trust: How should one know that any data available marked up according to Sematic Web principles can be trusted? This is an even more pressing question when the data is free. Sir Berners-Lee (AKA "TimBL") designed the Semantic Web in a way that makes "trust" a component, when in truth it is an emergent relation between a well-designed system and its users (my own definition).

2. Lack of Incentives: There is no way to get paid for uploading content that is financially very valuable. I know many financial companies that would like to offer their data in a "Semantic Web" form, but they cannot, because they would not get compensated, and their existence depends on selling that data; some even use Semantic Web standards for internal-only sharing.

3. A lot of SW stuff is either boilerplate or re-discovered formal logic from the 1970s. I read lots of papers that propose some "ontology" but no application that needs it.

npunt

8 replies

12h13m

2024-08-21 06:18:31 UTC

The argument about LLMs is wrong, not because of reasons stated but because semantic meaning shouldn't solely be defined by the publisher.

The real question is whether the average publisher is better than an LLM at accurately classifying their content. My guess is, when it comes to categorization and summarization, an LLM is going to handily win. An easy test is: are publishers experts on topics they talk about? The truth of the internet is no, they're not usually.

The entire world of SEO hacks, blogspam, etc exists because publishers were the only source of truth that the search engine used to determine meaning and quality, which has created all the sorts of misaligned incentives that we've lived with for the past 25 years. At best there are some things publishers can provide as guidance for an LLM, social card, etc, but it can't be the only truth of the content.

Perhaps we will only really reach the promise of 'the semantic web' when we've adequately overcome the principal-agent problem of who gets to define the meaning of things on the web. My sense is that requires classifiers that are controlled by users.

atoav

4 replies

11h7m

2024-08-21 07:24:25 UTC

Yet LLMS fail to make these simple but sometimes meaningful differentiation. See for example this case in which a court reporter is described as being all the things he reported about by Copilot: a child molester, a psychatric escapee, a widow cheat. Presumably because his name was in a lot of articles about said things and LLMS simply associate his name with the crimes without making the connection that he could in fact be simply the messenger and not the criminal. If LLMS had the semantic understanding that the name on top/bottom of a news article is the author, it would not have made that mistake.

https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...

mandmandam

2 replies

8h51m

2024-08-21 09:39:56 UTC

Humans do something very similar, fwiw. It's called spontaneous trait association: https://www.sciencedirect.com/science/article/abs/pii/S00221...

thuuuomas

1 replies

6h34m

2024-08-21 11:57:07 UTC

fwiw

What do you think this sort of observation is worth?

mandmandam

0 replies

4h30m

2024-08-21 14:00:53 UTC

Really depends on what sort of person you are I guess.

Some people appreciate being shown fascinating aspects of human nature. Some people don't, and I wonder why they're on a forum dedicated to curiosity and discussion. And then, some people get weirdly aggressive if they're shown something that doesn't quite fit in their worldview. This topic in particular seems to draw those out, and it's fascinating to me.

Myself, I thought it was great to learn about spontaneous trait association, because it explains so much weird human behavior. The fact that LLMs do something so similar is, at the very least, an interesting parallel.

npunt

0 replies

9h44m

2024-08-21 08:47:37 UTC

Absolutely! Today's LLMs can sometimes(/often?) enormously suck and should not be relied upon for critical information. There's a long way to go to make them better, and I'm happy that a lot of people are working on that. Finding meaning in a sea of information is a highly imperfect enterprise regardless of the tech we use.

My point though was that the core problem we should be trying to solve is overcoming the fundamental misalignment of incentives between publisher and reader, not whether we can put a better schema together that we hope people adopt intelligently & non-adversarially, because we know that won't happen in practice. I liked what the author wrote but they also didn't really consider this perspective and as such I think they haven't hit upon a fundamental understanding of the problem.

pickledoyster

1 replies

10h19m

2024-08-21 08:11:41 UTC

My guess is, when it comes to categorization and summarization, an LLM is going to handily win. An easy test is: are publishers experts on topics they talk about? The truth of the internet is no, they're not usually.

LLMs are not experts either. Furthermore, from what I gather, LLMs are trained on:

The entire world of SEO hacks, blogspam, etc

npunt

0 replies

9h42m

2024-08-21 08:49:10 UTC

This is an excellent rebuttal. I think it is an issue that can be overcome but I appreciate the irony of what you point out :)

peoplefromibiza

0 replies

10h0m

2024-08-21 08:31:19 UTC

because semantic meaning shouldn't solely be defined by the publisher

LLMs are not that great at understanding semantics though

hoosieree

6 replies

6h45m

2024-08-21 11:45:39 UTC

If Web 3.0 is already here, where is it, then? Mostly, it's hidden in the markup.

I feel like this is so obvious to point out that I must be missing something, but the whole article goes to heroic lengths to avoid... HTML. Is it because HTML is difficult and scary? Why invent a custom JSON format and a custom JSON-to-HTML compiler toolchain than just write HTML?

The semantics aren't hidden in the markup. The semantics are the markup.

wepple

1 replies

6h28m

2024-08-21 12:02:45 UTC

I think that’s what we’re doing today, and it’s a phenomenal mess.

The typical HTML page these days is horrifically bloated, and whilst it’s machine parsable, it’s often complicated to actually understand what’s what. It’s random nested divs and unified everything. All the way down.

But I do wonder if adding context to existing HTML might be better than a whole other JSON blob that’ll get out of sync fast.

hoosieree

0 replies

5h49m

2024-08-21 12:42:11 UTC

I'm just not convinced that swapping out "<ol></ol>" for "[]" actually addresses any of the problems.

hanniabu

1 replies

3h44m

2024-08-21 14:46:39 UTC

Web 1.0 = read

Web 2.0 = read/write

Web 3.0 = read/write/own

DarkNova6

0 replies

3h28m

2024-08-21 15:03:36 UTC

You could make the case that we already are in Web 3.0, or that we have regressed into Web 1.0 territory.

Back in actual Web 2.0, the internet was not dominated by large platforms, but more spread out by ppl hosting their own websites. Interaction was everywhere and the spirit resolved around "p2p exchange" (not technologically speaking).

Now, most traffic goes over large companies which own your data, tell you what to see and severely limit genuine exchange. Unless you count out the willingness of "content monkeys", that is.

What has changed? The internet has settled for a lowest-common denominator and moved away from a space of tech savy people (primarily via the arrival of smartphones). The WWW used to be the unowned land in the wild west, but has now been colonized by an empire from another world.

Lutger

1 replies

4h3m

2024-08-21 14:27:47 UTC

I must have missed your point, isn't the answer obviously that HTML is very, very limited and intended as a way to markup text? Semantic data is a way to go further and make machine-readable what actually is inside that text: recipes, places, people, posts, animals, etc, etc and all their various attributes and how they relate to each other.

Basically, what you are saying is already rdf/xml, except that devs don't like xml so json-ld came along as a man-machine-friendlier way to do rdf/xml.

There are also various microdata formats that allow you to annotate html in a way the machines can parse it as rdf. But that can be limited in some cases if you want to convery more metadata.

rchaud

0 replies

1h1m

2024-08-21 17:30:26 UTC

Why should anybody do that though? It doesn't benefit individual users, it benefits web scrapers mostly. Search bots are pretty sophisticated at parsing HTML so it isn't an issue there.

swiftcoder

5 replies

11h24m

2024-08-21 07:06:48 UTC

As much as I like the ideas behind the semantic web, JSON-LD feels like the least friendly of all semantic markup options (compared to something like, say, microformats)

MrVandemar

2 replies

6h14m

2024-08-21 12:17:12 UTC

Microformats feel like they're ugly retrofitted kludges, where it would have been way more elegant if in among all the crazy helter-skelter competing development of HTML, someone thought to invent a <person> tag, maybe a <organisation> tag. That would have solved a few problems that <blink> certainly didn't.

swiftcoder

0 replies

4h52m

2024-08-21 13:38:42 UTC

I mean, is anything actually stopping one from adding something like those tags today? Web components use custom tags all the time

fabianholzer

0 replies

5h7m

2024-08-21 13:24:28 UTC

They certainly are retrofitted, but the existing semantic tags are largely abandoned for div soups that are beaten into shape and submission by lavish amounts of JS and a few sprinkles of CSS (and the latter often as CSS-in-JS). For microformats there is at least a little ecosystem already, and the vendor-driven committees don't need to be involved.

giantrobot

1 replies

6h9m

2024-08-21 12:21:59 UTC

I think the main issue with microformats is most CMSes don't really have a good way of adding them. You need a very capable rich editor to add semantic data inline or edit the output HTML by hand. Simple markup like WikiText and Markdown don't support microformat annotation.

JSON-LD in a page's header is much easier for a CMS to present to the page author for editing. It can be a form in the editing UI. Wordpress et al have SEO plugins that make editing the JSON-LD data pretty straightforward.

swiftcoder

0 replies

4h50m

2024-08-21 13:40:52 UTC

That's a good point. I adopted microformats in a static site generator, with a handful of custom shortcodes. It would be much harder to adopt in a WYSIWYG context

gostsamo

5 replies

11h59m

2024-08-21 06:32:21 UTC

So much jumping to defend llms as the future. I'd like to point that llms hallucinate, could be injected, and often lack context which well structured metadata can provide. At least, I don't want for an llm to hollucinate the author's picture and bio based on hints in the article, thank you very much.

I don't think that one is necessarily better than the other, but imagining that llms are a silver bullet when another trending story in the front pages is about prompt injection used against the slack ai bots sounds a bit over optimistic.

IshKebab

4 replies

11h36m

2024-08-21 06:55:06 UTC

Sure but do hallucinations matter then much just for categorisation? Hardly the end of the world if they make up a published date occasionally.

And prompt injection is irrelevant because the alternative we're considering is letting publishers directly choose the metadata.

gostsamo

3 replies

10h43m

2024-08-21 07:47:41 UTC

Prompt injection is highly relevant because you end up achieving the same as the publisher choosing the metadata, but on a much higher price for the user. Price which needs to be paid by each user separately instead of using one already generated.

LLMs are much better when the user adapts the categories to their needs or crunches the text to pull only the info relevant to them. Communicating those categories and the cutoff criteria would be an issue in some contexts, but still better if communication is not the goal. Domain knowledge is also important, because nitch topics are not represented in the llm datasets and their abilities fail in such scenarios.

As I said above, one is not necessarily better than the other and it depends on the use cases.

IshKebab

2 replies

7h55m

2024-08-21 10:36:09 UTC

Prompt injection is highly relevant because you end up achieving the same as the publisher choosing the metadata, but on a much higher price for the user.

How does price affect the relevance of prompt injection? That doesn't make sense.

nitch

Niche. Pronounced neesh.

gostsamo

1 replies

6h42m

2024-08-21 11:48:45 UTC

My question is: how price does not matter? If you are given the choice to pay either a dollar or a million dollars for the same good from an untrustworthy merchant, why would you pay the million? And the difference between parsing a json and sending a few megabytes of a webpage to chatgpt is the same if not bigger. For a dishonest seo engineer it does not matter if they will post boasting metadata or a prompt convincing chatgpt in the same. The difference is for the user.

I don't mind the delusions of most people, but the idea that llms will deal with spam if you throw a million times more electricity against it is what makes the planet burning.

IshKebab

0 replies

1h8m

2024-08-21 17:23:03 UTC

Price matters, but you said prompt injection is relevant because of price. Maybe a typo...

Devasta

5 replies

12h16m

2024-08-21 06:15:18 UTC

Before JSON-LD there was a nest of other, more XMLy, standards emitted by the various web steering groups. These actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.

If archival systems and library's are using XML, wouldn't it be preferable to follow their lead and whatever standards they are using? Since they are the ones who are going to use this stuff most, most likely.

If nothing else, you can add a processing instruction to the document they use to convert it to HTML.

whartung

1 replies

12h5m

2024-08-21 06:25:44 UTC

The format really isn’t much of an issue. From an information point of view, the content of the different formats are identical, and translation among them is straightforward.

Promoting JSON-LD potentially makes it more palatable to the modern web creators, perhaps increasing adoption. The bots have already adapted.

cess11

0 replies

10h30m

2024-08-21 08:01:12 UTC

You're aware of straightforward translations to and from E-ARK SIP and CSIP? Between what formats?

As far as I can tell archivists don't care about "modern web creators", and they likely shouldn't, since archiving is a long term project. I know I don't, and I'm only building software for digital archiving.

tannhaeuser

1 replies

10h21m

2024-08-21 08:10:27 UTC

If by that the author means JSON-LD has replaced MarcXML, BibTex records, and other bibliographic information systems, then that's very much not the case.

AlecSchueler

0 replies

9h47m

2024-08-21 08:44:32 UTC

They recognise that in the quoted paragraph. The JSON-LD thing was only about the open web:

[MarcXML, BibTex etc] actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.

_heimdall

0 replies

6h29m

2024-08-21 12:02:22 UTC

If nothing else, you can add a processing instruction to the document they use to convert it to HTML.

Like XSLT?

est

4 replies

9h15m

2024-08-21 09:15:50 UTC

I've playing with RSS feeds recently, suddently it occured to me, XML can be transformed into anything with XSL, for static hosting personal blogs, I can save articles into the feeds directly, then serve frontend single-page application with some static XSLT+js. This is content-presentation separation at best.

Is JSON-LD just reinventation of this?

ttepasse

1 replies

3h21m

2024-08-21 15:10:37 UTC

Back in the optimistic 2000s there was the brief idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff from HTML, e.g. microformats, HTML meta, FOAF, etc, and then transforming it into RDF or other things:

https://www.w3.org/TR/grddl/

mcswell

0 replies

40m

2024-08-21 17:51:00 UTC

But why? Isn't most of the information you can extract from those tags stuff that's pretty obvious, like title and author (the examples the linked page uses)? How do you extract really useful information using that methodology, supporting searches that answer queries like "110 volt socket accepting grounding plugs"? Of course search engines can (and do) get such info, but afaik it doesn't require or use XSLT beyond extracting the plain text.

rakoo

0 replies

6h48m

2024-08-21 11:43:18 UTC

This is content-presentation separation at best.

The idea is the best, but arguably the implementation is lacking.

Is JSON-LD just reinventation of this?

Yup. It's "RDF/XML but we don't like XML"

martin_a

0 replies

7h22m

2024-08-21 11:08:40 UTC

That is exactly the thought behind SGML/XML and its derivatives. XSL is kind of clumsy but very powerful and the most direct way to transform documents.

JSON-LD to me looks more like trying to glue different documents together, its not about the transformation itself.

kkfx

3 replies

12h8m

2024-08-21 06:23:01 UTC

Ehm... The semantic web as an idea was/is a totally different thing: the idea is the old libraries of Babel/Bibliotheca Universalis by Conrad Gessner (~1545) [1] or the ability to "narrow"|"select"|"find" just "the small bit of information I want". Observing that a book it's excellent to develop and share a specific topic, it have some indexes to help directly find specific information but that's not enough, a library of books can't be traversed quick enough to find a very specific bit of information like when John Smith was born and where.

The semantic web original idea was the interconnection of every bit of information in a format a machine can travel for a human, so the human can find any specific bit ever written with little to no effort without having to humanly scan pages of moderately related stuff.

We never achieve such goal. Some have tried to be more on the machine side, like WikiData, some have pushed to the extreme the library science SGML idea of universal classification not ended to JSON but all are failures because they are not universal nor easy to "select and assemble specific bit of information" on human queries.

LLMs are a, failed, tentative of achieve such result from another way, their hallucinations and slow formation of a model prove their substantial failure, they SEEMS to succeed for a distracted eye perceiving just the wow effect, but they practically fails.

Aside the issue with ALL test done on the metadata side of the spectrum so far is simple: in theory we can all be good citizens and carefully label anything, even classify following Dublin Core at al any single page, in practice very few do so, all the rest do not care, or ignoring the classification at all or badly implemented it, and as a result is like an archive with some missing documents, you'll always have holes in information breaking the credibility/practical usefulness of the tool.

Essentially that's why we keep using search engines every day, with classic keyword based matches and some extras around. Words are the common denominator for textual information and the larger slice of our information is textual.

[1] https://en.wikipedia.org/wiki/Bibliotheca_universalis

DrScientist

2 replies

9h5m

2024-08-21 09:26:06 UTC

The problem I find with semantic search is first I have to read and understand somebody elses definitions before I can search within the confines of the ontology.

The problem I have with ML guided search is the ML takes web average view of what I mean, which sometimes I need to understand and then try and work around if that's wrong. It can become impossible to find stuff off the beaten track.

The nice thing about keyword and exact text searching with fast iteration is it's my mental model that is driving the results. However if it's an area I don't know much about there is a chicken and egg problem of knowing which words to use.

kkfx

1 replies

7h46m

2024-08-21 10:45:01 UTC

Personally I think the limitation of keyword search it's not in the model per se but in the human langue: we have synonymous witch are relatively easy to handle but we also have gazillion of different way to express the very same concept that simply can't be squeezed in some "nearby keyword list".

Personally I notes news, importing articles in org-mode, so I have a "trail" of the news I think are relevant in a timeline, sometimes I remember I've noted something but I can't find it immediately in my own notes with local full-text search on a very little base compared to the entire web, simply because a title does express something with very different words than another and at the moment of a search I do not think about such possible expression.

For casual searches we do not notice, but for some specific searches emerge very clear as a big limitation, however so far LLMs does not solve it, they are even LESS able to extract relevant information, and "semantic" classifications does not seems to be effective either, a thing even easier to spot if you use Zotero and tags and really try to use tags to look for something, in the end you'll resort on mere keyword search for anything.

That's why IMVHO it's an unsolved so far problem.

DrScientist

0 replies

4h9m

2024-08-21 14:22:00 UTC

For me the search problem isn't so much about making sure I get back all potentially relevant hits ( more than I could ever read ) , it's how I get the specific ones I want...

So effective search more about excluding than including.

Exact phrases or particular keywords are great tools here.

Note there is also a difference between finding an answer to a particular question and finding web pages around a particular topic. Perhaps LLM's are more useful for the former - where there is a need to both map the question to an embedding, and summarize the answer - but for the latter I'm not interested in a summary/quick answer, I'm interested in the source material.

Sometimes you can combine the two - LLM's for a quick route into the common jargon, which can then be used as keywords.

grumbel

3 replies

7h41m

2024-08-21 10:50:36 UTC

I don't see how one can have any hope in a Semantic Web ever succeeding when we haven't even managed to get HTML tags for extremely common Internet things: pricetags, comments, units, avatars, usernames, advertisement and so on. Even things like pagination are generally just a bunch of links, not any kind of semantic thing holding multiple documents together (<link rel> exists, but I haven't seen browsers doing anything with it). Take your average website and look at all the <div>s and <span>s and there is a whole lot more low hanging fruit one could turn semantic, but there seems little interest in even trying to.

rakoo

2 replies

6h57m

2024-08-21 11:34:15 UTC

I don't think we necessarily need new tags: they narrow down the list of possible into an immutable set and require changing the structure of your already existing content. What exists instead are microformats (http://microformats.org/wiki/microformats2), a bunch of classes you sprinkle in your current HTML to "augment" it.

ttepasse

0 replies

3h27m

2024-08-21 15:04:14 UTC

There is also RDFa and even more obscure Microdata to augment HTML elements. Google’s schema.org vocabulary originally used these before switching to JSON-LD.

The trick, as always, is to get people to use it.

_heimdall

0 replies

6h34m

2024-08-21 11:57:24 UTC

I include microformats on blog sites, but at scale the challenge with microformats is that most existing tooling doesn't consider class names at all for semantics.

Browsers, for example, completely ignore classes when building the accessibility tree for a web page. Only the HTML structure and a handful of CSS properties have an impact on accessibility.

Class names were always meant as an ease of use feature for styling, overloading them with semantic meaning could break a number of sites built over the last few decades.

bawolff

3 replies

4h42m

2024-08-21 13:48:41 UTC

If this counts as the "semantic web", then <meta name="description"... should to.

In which case we have all been on it since the mid 90s.

PaulHoule

2 replies

4h24m

2024-08-21 14:06:54 UTC

It's real RDF. You can process this with RDF tools. Certainly do SPARQL queries. Probably add a schema and have valid OWL DL and do OWL inference if the data is squeaky clean. Certainly use SPIN or Jena rules.

It leans too hard on text and doesn't have enough concepts defined as resources but what do you expect, Python didn't have a good package manager for decades because 2 + 2 = 3.9 with good vibes beats 2 + 2 = 4 with honest work and rigor for too many people.

The big trouble I have with RDF tooling is inadequate handling of ordered lists. Funny enough 90% of the time or so when you have a list you don't care about the order of the items and frequently people use a list for things that should have set semantics. On the other hand, you have to get the names of the authors of a paper in the right order or they'll get mad. There's a reasonable way to turn native JSON lists into RDF lists

https://www.w3.org/TR/json-ld11/#lists

although unfortunately this uses the slow LISP lists with O(N) item access and not the fast RDF Collections that have O(1) access. (What do you expect from M.I.T.?)

The trouble is that SPARQL doesn't support the list operations that are widespread in document-based query languages like

https://www.couchbase.com/products/n1ql/

https://docs.arangodb.com/3.11/aql/

or even Postgresql. There is a SPARQL 1.2 which has some nice additions like

https://www.w3.org/TR/sparql12-query/#func-triple

but the community badly needs a SPARQL 2 that catches up to today's query languages but the semantic web community has been so burned by pathological standards processes that anyone who can think rigorously or code their way out of a paper bag won't go near it.

A substantial advantage of RDF is that properties live in namespaces so if you want to add a new property you can do it and never stomp on anybody else's property. Tools that don't know about those properties can just ignore them, but SPARQL, RDFS and all that ought to "just work" though OWL takes some luck. That's got a downside too which is that adding namespaces to a system seems to reduce adoption by 80% in many cases because too many people think it's useless and too hard to understand.

bawolff

1 replies

3h9m

2024-08-21 15:21:43 UTC

My point is that even if technically its rdf, if all anyone does is use a few specific properties from a closed pre-agreed schema, we might as well just be using meta tags.

PaulHoule

0 replies

2h45m

2024-08-21 15:45:42 UTC

But there's the question of who is responsible for it and who sets the standards. These days the consortium behind HTML 5 is fairly quick and responsive compared to the W3C's HTML activity in the day (e.g. fight with a standards process for a few months as opposed to "talk to the hand") but schema.org can evolve without any of that.

If there's anything that sucks today it is that people feel they have to add all kinds of markup for different vendors (such as Facebook's Open Graph) I remember the Semweb folks who didn't think it was a problem that my pages had about 20k of visible markup and 150k of repeated semantic markup. It's like the folks who don't mind that an article with 5k worth of text has 50M worth of Javascript, ads, trackers and other junk.

On the other hand I have no trouble turning

   <meta name="description" content="A brief description of your webpage content.">

into

   @prefix meta: <http://example.com/my/name/space> .
   <http://example.com/some/web/page> meta:description "A brief description of your webpage content." .

where meta: is some namespace I made up if I want to access it with RDF tools without making you do anything

BiteCode_dev

2 replies

7h52m

2024-08-21 10:39:00 UTC

The article talks about JSON-LD, but there is also shema.org and open graph.

What which one should you use, and why?

Should you use several? How does that impact the site?

dangoodmanUT

1 replies

7h10m

2024-08-21 11:20:56 UTC

JSON-LD uses schema.org schema

giantrobot

0 replies

6h7m

2024-08-21 12:24:21 UTC

But very helpfully Google supports...mostly schema.org except when they don't when they feel like it.

trainyperson

1 replies

11h35m

2024-08-21 06:56:21 UTC

Are there any tools that employ LLMs to fill out the Semantic Web data? I can see that being a high-impact use case: people don’t generally like manually filling out all the fields in a schema (it is indeed “a bother”), but an LLM could fill it out for you – and then you could tweak for correctness / editorializing. Voila, bother reduced!

This would also address the two reasons why the author thinks AI is not suited to this task:

1. human stays in the loop by (ideally) checking the JSON-LD before publishing; so fewer hallucination errors

2. LLM compute is limited to one time per published content and it’s done by the publisher. The bots can continue to be low-GPU crawlers just as they are now, since they can traverse the neat and tidy JSON-LD.

——————

The author makes a good case for The Semantic Web and I’ll be keeping it in mind for the next time I publish something, and in general this will add some nice color to how I think about the web.

safety1st

0 replies

11h20m

2024-08-21 07:11:28 UTC

Bringing an LLM into the picture is just silly. There's zero need.

The author (and much of HN?) seems to be unaware that it's not just thousands of websites using JSON-LD, it's millions.

For example: install WordPress, install an SEO plugin like Yoast, and boom you're done. Basic JSON-LD will be generated expressing semantic information about all your blog posts, videos etc. It only takes a few lines of code to extend what shows up by default, and other CMSes support this took.

SEOs know all about this topic because Google looks for JSON-LD in your document and it makes a significant difference to how your site is presented in search results as well as all those other fancy UI modules that show up on Google.

Anyone who wants to understand how this is working massively, at scale, across millions of websites today, implemented consciously by thousands of businesses, should start here:

https://developers.google.com/search/docs/appearance/structu...

https://search.google.com/test/rich-results

Is this the "Semantic Web" that was dreamed of in yesteryear? Well it hasn't gone as far and as fast as the academics hoped, but does anything?

The rudimentary semantic expression is already out there on the Web, deployed at scale today. Someone creative with market pull could easily expand on this e.g. maybe someday a competitor to Google or another Big Tech expands the set of semantic information a bit if it's relevant to their business scenarios.

It's all happening, it's just happening in the way that commercial markets make things happen.

sebstefan

1 replies

10h42m

2024-08-21 07:48:54 UTC

Is that really what Discord, Whatsapp & co are using to display the embed widgets they have or is it just <meta> tags like I would expect...?

johneth

0 replies

10h10m

2024-08-21 08:21:17 UTC

There are several methods they may use:

- OpenGraph (by Facebook, probably used by Whatsapp) – https://ogp.me/

- Schema.org markup (the main point of this blog) – https://schema.org/

- oEmbed (used to embed media in another page, e.g. YouTube videos on a WordPress blog) – https://oembed.com/

ryukoposting

1 replies

5h30m

2024-08-21 13:01:09 UTC

Pardon my naivetée, but what exactly is JSON-LD doing that the HTML meta tags don't do already? My blog doesn't implement JSON-LD but if you link to my blog on popular social media sites, you still get a fancy link.

ttepasse

0 replies

3h25m

2024-08-21 15:06:23 UTC

JSON-LD / RDFa and such can use the full type hierarchy of schema.org (and other languages) and can build a tree or even a graph of data. Meta elements are limited to property/value pairs.

matheusmoreira

1 replies

6h45m

2024-08-21 11:46:14 UTC

I wish there was a better alternative to JSON-LD. I want to avoid duplication by reusing the data that's already in the page by marking them up with appropriate tags and properties. Stuff like RDF exists but is extremely complex and verbose.

ttepasse

0 replies

3h12m

2024-08-21 15:19:12 UTC

Originally you could use the schema.org vocabulary with RDFa or Microdata which embed the structured data right at the element. But than can be brittle: Markup structures change, get copy-and-pasted and editing attributes is not really great in CMS. I may not like it aesthetically but embedded JSON-LD makes some sense.

See also this comment above: https://news.ycombinator.com/item?id=41309555

makkes

1 replies

6h14m

2024-08-21 12:17:02 UTC

Semantic Web technology (RDF, RDFS, OWL, SHACL) is widely used in the European electricity industry to exchange grid models: https://www.entsoe.eu/data/cim/cim-for-grid-models-exchange/

etimberg

0 replies

4h48m

2024-08-21 13:42:59 UTC

I have experience using this back when I worked for a startup that did distribution grid optimization. The specs are unfortunately useless in practice because while the terminology is standardized the actual use of each object and how to relate them is not.

Thus, every tool makes CIM documents slightly differently and there are no guarantees that a document created in one tool will be usable in another

knallfrosch

1 replies

11h2m

2024-08-21 07:28:45 UTC

Here I was, thinking the machines would make our lives easier. Now we have to make our websites Reader-Mode friendly, ARIA[1]-labelled, rendered server-side and now semantic web on top, just so that bots and non-visitors can crawl around?

[1] This is also something the screen assist software should do, not the publisher.

MrVandemar

0 replies

6h4m

2024-08-21 12:27:02 UTC

ARIA is something that really shouldn't have been necessary, but today it is absolutely crucial that content publishers make sure is right. Because the screen assist software can't do it.

Why? Because a significant percentage of people working on web development think a webpage is composed as many <spans> and <divs> as you like, styled with CSS and the content is injected into it with JavaScript.

These people don't know what an <img> tag is, let alone alt-text, or semantic heading hierarchy. And yet, those are exactly the things that Screen Reader software understands.

druskacik

1 replies

10h12m

2024-08-21 08:19:30 UTC

There's a project [0] that parses Commoncrawl data for various schemas, it contains some interesting datasets.

[0] http://webdatacommons.org/

undefinedblog

0 replies

2h40m

2024-08-21 15:51:05 UTC

That’s a really useful link, thanks for sharing. We’re building a scrapping service and only parsing rely on native html tags and open graph metadata, based on this link we should definitely take a step forward to parse JSON-LD as well.

codelion

1 replies

7h57m

2024-08-21 10:34:08 UTC

I started this thread on the w3c list almost 20 years ago - https://lists.w3.org/Archives/Public/semantic-web/2005Dec/00...

Unfortunately, it is unlikely we will ever get something like a Semantic web. It seemed like a good idea in the beginning of 2000s but now there is honestly no need for it as it is quite cheap and easy to attach meaning to text due to the progress in LLMs and NLP.

mcswell

0 replies

45m

2024-08-21 17:46:33 UTC

Exactly. Afaik, there are certain corners of the Web that benefit from some kind of markup. I think real estate is one, where you can generate searches of the MLS on sites like Redfin or Zillow (or any realtor's site, really) such that you can set parameters: between 1000 and 1500 square feet (or meters in Europe), with a garage and no basement. That's very helpful (although I don't know whether that searching is done over indexed web pages, or on the MLS itself). But most of the Web, afaict, have nothing like that---and don't need it, because NLP can distinguish different senses of 'bank' (financial vs. river), etc.

anonymous344

1 replies

9h10m

2024-08-21 09:20:42 UTC

Well worth, for whom? as a blogger, these things are 99% for the companies making profit by scraping my content, maybe 1% of the users will need them. Or am I wrong?

_heimdall

0 replies

6h30m

2024-08-21 12:01:05 UTC

This has been my hang up as well. Providing metadata seems extremely useful and powerful, but coming into web development in the mid 10s rather than mid 00s made it more clear that the metadata would largely just help a handful of massive corporations.

I will still include JSON-LD when it make financial sense for a site. In practice that usually just means business metadata for search results and product data for any ecommerce pages.

ThinkBeat

1 replies

9h21m

2024-08-21 09:10:34 UTC

I dont like the use of a Json "script" inside an HTML page. I understand the flexibility it grants but markup tags is what HTL is based on and the design would be more consistent by using HTML tags as we have had them for decades to also handle this extra meta data.

M2Ys4U

0 replies

7h55m

2024-08-21 10:35:41 UTC

JSON-LD isn't the only way one can embed these metadata (though I think most tooling prefers it now).

For example, Microdata[0] is one in-line way to do it, and RDFa[1] is another.

[0] https://en.wikipedia.org/wiki/Microdata_(HTML)

[1] https://en.wikipedia.org/wiki/RDFa

1f60c

1 replies

7h15m

2024-08-21 11:16:28 UTC

This has been invented a number of times. Facebook's version is called Open Graph.

https://ogp.me/

ttepasse

0 replies

3h18m

2024-08-21 15:13:13 UTC

Back then Facebook said their Open Graph Protocol was only an application of RDFa – and syntax wise it seemed so.

vouaobrasil

0 replies

10h40m

2024-08-21 07:50:48 UTC

The first is that large language models (LLMs) routinely get stuff wrong. If you want bots to get it right, provide the metadata to ensure that they do.

Yet another reason NOT to use the semantic web. I don't want to help any LLMs.

taeric

0 replies

2h34m

2024-08-21 15:57:33 UTC

It is hilarious to see namespaces trying to creep into json.

I do wonder how any of this is better than using the meta tags of the html, though? Especially for such use cases as the preview. Seems the only thing that isn't really there for the preview is the image? (Well, title would come from the title tag, but still...)

renonce

0 replies

5h52m

2024-08-21 12:39:24 UTC

Looks like a perfect use case for LLM: generate that JSON-LD metadata from HTML via LLM, either by the website owner or by the crawler. If crawlers, website owners doesn’t need to do anything to enter Semantic Web and crawlers specify their own metadata format they want to extract. This promises an appealing future of Web 3.0, not by crypto, defined not by metadata but by LLMs.

renegat0x0

0 replies

11h18m

2024-08-21 07:13:10 UTC

I think that if you want your page to be well discoverable, to be well asvertised, positioned in search engines and social media you have to support standards. Like open graph protocol, or json ld.

Be nice to bots. This is advertisment after all.

Support standards even if Google does not. Other bots might not be as sofisticated.

For me, yes, it is worth the bother

rchaud

0 replies

1h8m

2024-08-21 17:22:50 UTC

Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it.

Google has been pushing JSON-LD to webmasters for better SEO for at least 5 years, if not more: https://developers.google.com/search/docs/appearance/structu...

There really isn't a need to do it as most of the relevant page metadata is already captured as part of the Open Graph protocol[0] that Twitter and Facebook popularized 10+ years ago as webmasters were attempting to set up rich link previews for URLs posted to those networks. Markup like this:

is common on most sites now, so what benefit is there for doing additional work to generate JSON-LD with the same data?

[0]https://ogp.me/

physicsguy

0 replies

4h58m

2024-08-21 13:33:34 UTC

Semantic web suffers from organisational capture. If there's a big org they get to define the standard at the expense over everyone else use cases.

peter_retief

0 replies

10h23m

2024-08-21 08:08:21 UTC

Not totally sure if it is needed, nice to have? RSS feeds are great but seen less and less.

patagnome

0 replies

7h14m

2024-08-21 11:17:03 UTC

worth the bother. "preview" on the capitalocenic web without any mention of the Link Relation Types does not a semantic web adoption make. no mention of the economic analysis and impact of monopoly, no intersectional analysis with #a11y.

if the "preview" link relation type is worth mentioning it's worth substantiating the claims about adoption. when did the big players adopt? why? what of the rest of the types and their relation to would-be "a.i." claims?

how would we write html differently and what capabilities would we expose more readily to driving by links, like carousels only if written with a11y in mind? how would our world-wild web look different if we wrote html like we know it? than only give big players a pass when we view source?

nox101

0 replies

11h31m

2024-08-21 06:59:38 UTC

No ... because the incentives to lie in metadata are too high

kvgr

0 replies

7h48m

2024-08-21 10:43:21 UTC

I was doing bachelor thesis 10 years ago on some semantic file conversions, we had a lot of projects at school. And looks like there is not much progress for end user…

jrochkind1

0 replies

5h3m

2024-08-21 13:27:53 UTC

Semantic Web information on websites is a bit of a "living document". You tend publish something, then have a look to see what people have parsed (or failed to parse) it and then you try to improve it a bit.

Hm.

jillesvangurp

0 replies

11h10m

2024-08-21 07:21:16 UTC

Did json-ld get a lot of traction for link previews? I haven't really encountered it much.

I actually implemented a simple link preview system a while ago. It uses opengraph and twitter cards meta data that is commonly added to web pages for SEO. That works pretty well.

Ironically, I did use chat gpt for helping me implement this stuff. It did a pretty good job too. It suggested some libraries I could use and then added some logic to extract titles, descriptions, icons, images, etc. with some fallbacks between various fields people use for those things. It did not suggest me to add logic for json-ld.

hmottestad

0 replies

12h10m

2024-08-21 06:21:33 UTC

Metadata in PDFs is also typically based on semantic web standards.

https://www.meridiandiscovery.com/articles/pdf-forensic-anal...

Instead of using JSON-LD it uses RDF written as XML. Still uses the same concept of common vocabularies, but instead of schema.org it uses a collection of various vocabularies including Dublin Core.

gdegani

0 replies

4h57m

2024-08-21 13:34:35 UTC

There is a lot of value on Enterprise Knowledge Graphs, applying the semantic web standards into the "self-contained" world of enterprise data, there are many large enterprises doing it, and there is an interesting video from UBS on how they consider it a competitive advantage

esbranson

0 replies

1h29m

2024-08-21 17:01:53 UTC

Arguing against standard vocabularies (part of the Semantic Web) is like arguing against standard libraries. "Cool story bro."

But it is true, if you can't make sense of your data, then the Semantic Web probably isn't for you. (It's the least of your problems.)

eadmund

0 replies

5h34m

2024-08-21 12:56:42 UTC

Embedding data as JSON as program text inside a <script> tag inside a tagged data format just seems like such a terrible hack. Among other things, it stutters: it repeats information already in the document. The microdata approach seems much less insane. I don’t know if it is recognised nearly as often.

TFA mentions it at the end: ‘There is also “microdata.” It’s very simple but I think quite hard to parse out.’ I disagree: it’s no harder to parse than HTML, and one already must parse HTML in order to correctly extract JSON-LD from a script tag (yes, one can incorrectly parse HTML, and it will work most of the time).

dsmurrell

0 replies

7h32m

2024-08-21 10:59:10 UTC

"Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it." - lol

dgellow

0 replies

7h19m

2024-08-21 11:11:57 UTC

Companies use open-graph because it gives them something in return (nice integration in other products when linking to your site). That’s nice and all but outside of this niche use case there is no incentives for a semantic web from the point of view of publishers. You just make it simpler to crawl your website (something you cannot really monetize) instead of offering a strict API you can monetize to access structured data.

conzept

0 replies

11h7m

2024-08-21 07:24:16 UTC

I think the future holds a synthesis of LLM functions with semantic entities and logic from knowledge graphs (this is called "neuro-symbolic AI"), so each topic/object can have a clear context, upon which you can start prompting the AI for the preferred action/intention.

Already implemented in part on my Conzept Encyclopedia project (using OpenAI): https://conze.pt/explore/%22Neuro-symbolic%20AI%22?l=en&ds=r...

Something like this is much easier done using the semantic web (3D interactive occurence map for an organism): https://conze.pt/explore/Trogon?l=en&ds=reference&t=link&bat...

On Conzept one or more bookmarks you create, can be used in various LLM functions. One of the next steps is to integrate a local WebGPU-based frontend LLM, and see what 'free' prompting can unlock.

JSON-LD is also created dynamically for each topic, based on Wikidata data, to set the page metadata.

bigiain

0 replies

10h39m

2024-08-21 07:51:46 UTC

I laughed at this bit:

_heimdall

0 replies

5h5m

2024-08-21 13:25:45 UTC

Monetization is the elephant in the room in my opinion.

IMDB could easily be a service entirely dedicated to hosting movie metadata as RDF or JSON-LD. They need to fund it though, and the go to seems to be advertising and API access. Advertising means needing human readable UI, not metadata, and if they put data behind an API its a tough sell to use a standardized and potentially limiting format.

Vinnl

0 replies

10h50m

2024-08-21 07:41:05 UTC

The question is: does this bring any of the purported benefits of the Semantic Web? Does it suddenly allow "agents" to understand the meaning of your web pages, or are we just complying with a set of pre-defined schemas that predefined software (or more specifically, Google, in practice) understands and knows how to render. In other words, was all the SemWeb rigmarole actually necessary, or could the same results have been achieved using any of the mentioned simpler alternatives (microdata, OpenGraph tags, or even just JSON schemas)?

Lutger

0 replies

8h31m

2024-08-21 09:59:41 UTC

Everyone is optimizing for their own local use-case. Even open-source. Standards get adopted sometimes, but only if they solve a specific problem.

There is an additional cost to making or using ontologies, making them available and publishing open data on the semantic web. The cost is quite high, the returns aren't immediate, obvious or guaranteed at all.

The vision of the semantic web is still valid. The incentives to get there are just not in place.

ChrisMarshallNY

0 replies

4h4m

2024-08-21 14:27:00 UTC

I suspect that AI training data standards will make this much more prevalent.

Just today, I am working on an experimental training/consuming app pair. The training part will leverage JSON data from a backend I designed.

CaptArmchair

0 replies

9h30m

2024-08-21 09:00:56 UTC

I'm a bit surprised that the author doesn't mention key concepts such as linked data, RDF, federation and web querying. Or even the five stars of linked open data. [1] Sure, JSON-LD is part of it, but it's just a serialization format.

The really neat part is when you start considering universal ontologies and linking to resources published on other domains. This is where your data becomes interoperable and reusable. Even better, through linking you can contextualize and enrich your data. Since linked data is all about creating graphs, creating a link in your data, or publishing data under a specific domain are acts that involves concepts like trust, authority, authenticity and so on. All those murky social concepts that define what we consider more or less objective truths.

LLM's won't replace the semantic web, nor vice versa. They are complementary to each other. Linked data technologies allow humans to cooperate and evolve domain models with a salience and flexibility which wasn't previously possible behind the walls and moats of discrete digital servers or physical buildings. LLM's work because they are based on large sets of ground truths, but those sets are always limited which makes inferring new knowledge and asserting its truthiness independent from human intervention next to impossible. LLM's may help us to expand linked data graphs, and linked data graphs fashioned by humans may help improve LLM's.

Creating a juxtaposition between both? Well, that's basically comparing apples against pears. They are two different things.

[1] https://5stardata.info/en/

627467

0 replies

9h45m

2024-08-21 08:46:26 UTC

The Semantic Web is the old Web 3.0. Before "Web 3.0" meant crypto-whatnot, it meant "machine-readable websites".

Using contemporary AI models aren't all websites machine-readable? - or potentially even more readable than semantic web unless an ai model actually does the semantic classification while reading it?