return to table of content

Show HN: I scraped 25M Shopify products to build a search engine

senecaso
35 replies
3d14h

I hope you have better luck than I did!

A few years ago, my partner and I built vendazzo.com (now defunct). It was an e-commerce search engine on products listed on Shopify shops (sound familiar? :)). At the time, we had > 100m products listed, and I don't remember how many shops we were indexing.. over 100k I think, but we had access to over a million. Overall, I think your approach is very similar to ours, but we managed to keep our costs lower. At the time, we were spending ~$550/mo, and our search times were under 300ms. We had established partnerships with a number of shops, and we had a few users, but not nearly enough. That's where the wheels came off. The site operated for over a year, but the monthly costs wore us down until we finally decided to pull the plug.

I still maintain that this is a good idea, and constantly have to fight off the urge to "try again", however, to do it properly, I think funding would be necessary, or finding some way to organically gain a lot of users.

Looking back, there are things I could have done to reduce my opex further, but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users.

DeathArrow
16 replies
3d11h

but in the end, it still wouldn't have mattered if I couldn't figure out how to acquire users

In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are. Some monetize trough ads, some have partnership with stores and you can buy directly from the search results.

I generally search first on the local Amazon equivalent, if I don't like what I see, I search on a smaller store. If I still can't find or dislike the products or prices I search Google. If I am still not contended with the results, I will go search on comparison engines.

And I also have a browser extension called Pricy who polls the comparison engines, so once I land in a product page I know which store has the better price and what was the price history through last year.

Probably many people have similar patterns. I expect people in US to search Amazon first, if it's not a very niche product they are after.

I think you can have a better monetization proposal, if instead of just search you build a sales platform, so people can directly buy after searching, without hoping to various websites.

olivermuty
9 replies
3d10h

What do you consider the local Amazon variant? And which country?

pbmonster
2 replies
3d9h

Amazon has no direct presence in Switzerland, but you can order a fraction of its products from neighboring countries. Many products are not available, mainly because nobody wants to deal with customs once the product crosses the EU boarder.

Amazon itself never moved into Switzerland in the first place for many reasons (small market, unusual customs situation, relatively high salary for warehouse workers), and in the meantime the largest Swiss supermarket chain created an Amazon clone which became hugely popular pretty much immediately: Galaxus.ch

SushiHippie
1 replies
3d7h

If you wouldn't have said that it's basically the Amazon in Switzerland, I'd have thought that this is some blogspam dropshipping site...

timthelion
0 replies
3d6h

Amazon is a blogspam drop shipping site in Europe

DeathArrow
1 replies
3d7h

Emag in Romania. I hate it, they bought most of the competition, they did a lot of anticompetitive things, but it's really easy to buy from them.

amne
0 replies
3d4h

At some point, a couple of years ago when they introduced marketplace, I actually thought they are aiming for an "exit" to Amazon. They really got the service part of e-commerce nailed down. Merchants quality is and always will be an issue, but it is the same as on Amazon.

vargr616
0 replies
3d5h

bol.com in the Netherlands

julesvr
0 replies
3d9h

hagglezon.com to compare Amazon variant prices

gsa
0 replies
3d9h

The Netherlands has plenty of them. Tweakers.net is a price tracker for electronics and such (eg: computer parts, phones, laptops etc) and usually it's easier to find a shop cheaper than Amazon. I have some go to stores for my needs because their content is organised way better than Amazon. I also find some alternatives better than Amazon because they have free next day shipping, something that's not free on Amazon.

VPenkov
0 replies
3d9h

There are alternatives throughout Europe. The Balkans have Emag, Benelux has bol.com. I think in both regions Amazon is less popular. I'm sure there are other examples.

berkes
2 replies
3d9h

Unfortunately many of these "comparison" websites have a businesses model built on affiliate fees.

It doesn't take much imagination to predict which products show up as "best" or "cheapest".

And the fairer ones have to keep playing cat and mouse with shops lowering pricing when they detect a scraper coming by. Or employ tricks to make their shipping seem free, lowering their overall price on the comparison platform.

zo1
0 replies
3d8h

Many if not all are like that. It's like everyone wants to take advantage of the lack of perfect information in the marketplace, as opposed to actually being helpful for consumers.

Semaphor
0 replies
3d9h

It doesn't take much imagination to predict which products show up as "best" or "cheapest".

Never seen a "best" outside of amazon, which does weird shit even without any affiliate fees. And "cheapest" is not really up to the site, unless they want to go under quite quickly.

wingerlang
1 replies
3d7h

In EU there are many price comparison engines with millions or billions of products. I don't know how popular they are.

Anecdotally, I guess, I'd say extremely popular. I never search for products anywhere else.

timthelion
0 replies
3d6h

Yeah, here in Czechia I always look at https://www.heureka.cz/ first.

senecaso
0 replies
3d2h

We were intentionally limiting the number of products and shops we were indexing due to opex. We needed to keep it low enough to provide ourselves with enough runway to keep things floating for longer.

pricerunner is another site which operates in a similar space. We had plans to build out the price tracking and a number of other features, so that we would appeal more to users who had your use cases. Sadly, we weren't getting enough traction. We did have regular users from the EU, but we simply couldn't seem to get in front of enough eyeballs for it to matter. At least at first, I expect that a large amount of your traffic to a new site like this has to be driven by Google, and we failed on that front as well. I'm not an SEO expert, so there were likely many things we did wrong or didn't even do which lead to this situation.

re: a sales platform, that's a pretty big challenge to take on, which would require massive investment up front. Not sure thats a viable route for most. We did have plans to address the "without hoping to various websites" problem, as we identified that as problematic for users very early on. The solution was relatively simple, but required more money to build out. We simply ran out of funds before we could get there.

bytearray
10 replies
3d14h

What strategies did you consider or implement to attract more users, and what would you do differently now to ensure better user acquisition?

senecaso
9 replies
3d14h

We had no capital, so advertising or solutions that basically involved "throwing money at the problem" were off the table for us.

We spent time posting in forums helping people find items they were looking for, and we had a few posts here on HN that generated short-lived, explosive traffic bursts. I remember those days we had posts get picked up on HN, it was always an exciting night!

We were looking at influencers and getting our name getting bloggers to talk about us, but, again, without capital, our options were very limited here. I'm sure someone with more of a marketing background would have found a bunch of ways we could have generated organic user growth, but neither me or my business partner had that skill set.

If I were to do it again, I think I would try to get someone with a marketing background involved to help gain traction. Without that, even the best product in the world will die of starvation if no one finds it.

slim
4 replies
3d13h

looks like simptoms of no market. maybe you were solving a problem already solved by amazon ? most shops on shopify also use amazon

senecaso
1 replies
3d12h

Many shops do double list, this is true. However, I don't think its a solved problem. There are many people who do not want to shop on Amazon for their own reasons. There are also people who want to shop locally, and Amazon provides no mechanism to do so (that I'm aware of). There are also many smaller shops who simply cannot afford to list on Amazon, as there are considerable fees associated with running a successful business there. It was these smaller shops who we were initially building to serve, to provide a funnel for them.

Still, there were problems with our solution that if addressed may have provided a better market fit. If we had had more runway, we would have worked to address them, but that simply wasn't in the cards.

DeathArrow
0 replies
3d10h

To me it seems like a small market. And worse, it's hard to conquer that small market since it's very fragmented. Even if you had money for advertisements, it still would have been hard.

On the plus side, though, if you had the skills to build that platform, you certainly have the skill to build a more profitable and easier to monetize platform.

berkes
0 replies
3d9h

Not in all countries though. Amazon isn't present or popular, or as omnipresent in many countries.

That's an opportunity, I guess.

DeathArrow
0 replies
3d10h

looks like simptoms of no market. maybe you were solving a problem already solved by amazon ? most shops on shopify also use amazon

FAANGS get around this by creating problems that they will offer to solve.

MuffinFlavored
3 replies
3d14h

We spent time posting in forums helping people find items they were looking for,

Did you run any analytics on how much overlap there was across Shopify sites on "similar items" (Alibaba resellers/dropshippers)?

senecaso
2 replies
3d14h

we didn't, no, but we spent a lot of time sifting through our catalog, and there was a _tremendous_ amount of crap in there. We manually curated and purged shops that were obviously just dropshipping or looked like out-right scams.

avereveard
1 replies
3d9h

Can't you sample ten random product then ask a llm to rate the shop on a scale from drop shipped to artisanal as a first approximation?

senecaso
0 replies
3d2h

I doubt it would be that easy, but, ya, using some form of automation is necessary. We devised a few rudimentary way to filter out the chaff, and it did quite well to remove the garbage. Still some would slip through, so it still required vigilance to remove them when you happen to see them.

bruce511
3 replies
3d12h

Im curious why you consider lack of users to be the problem. I would have described it as lack of revenue.

What plans did you have for generating revenue from the site? (Serious question - given your low costs it would seem like a tiny amount of revenue would gave been enough.)

senecaso
2 replies
3d12h

Our business model revolved around referrals, so lack of users directly translated to lack of revenue. While its true that even if we had millions of users but none of them were buying sponsored items we would have had a revenue problem, that wasn't the problem we were facing, as the few users we did have were in fact purchasing sponsored items.

DeathArrow
1 replies
3d11h

Then the problem seem to be the lack of users.

Have you tried having an YouTube channel, TikTok, Facebook, Twitter, blog and explain daily how you built the website, how your platform is going to help users?

senecaso
0 replies
3d1h

we did have channels on various sites, yes. However, its difficult to maintain a steady stream of content there for people to consume. Not only that, but you have the same discoverability problems as you do for the main site. Also, a blog outlining how you built the site may be of limited value. At least my experience on that front was it would generate short-lived bursts of traffic, but wouldnt generate returning users. So I think those articles were mostly appealing to technical users, and not necessarily users who were looking to do some shopping. Of course technical users do also shop, but after reading a technical article, they probably arent looking to immediately shop, and without some other mechanism putting the site in front of them again when they needed to shop, we would miss the opportunity.

grumpyviscacha
1 replies
2d18h

Wow, it's cool to see this idea trending on HN! Full disclosure, I'm one of the co-founders at https://www.marmalade.co. Speaking from personal experience, it’s been a long road getting from the universe of all Shopify products to a curated inventory that’s easy for people to shop on. While ChatGPT isn't going to replace human curation anytime soon, the AI tailwind has made it much easier to build search and recommendation systems. On our end, we've definitely caught the semantic search bug. Watch out for it - you’ll wake up one day with a cross-modal hybrid search index on pinecone and any number of models on huggingface :). However, as you rightly point out, user growth is still the key. We're working toward launching a community aspect of the platform in the coming months as a solution.

senecaso
0 replies
2d17h

You site looks good, and your results are fantastic! Job well done. I did hit a server error though, so obviously still some issues to work out, but overall, really well done. Moving to semantic search was one of my top priorities before we went under, but I struggled to justify the costs of it as we were operating on a shoestring budget.

Best of luck to you and your team on user acquisition!

pencildiver
0 replies
3d6h

Thanks for sharing this! If you're up for it, I'd love to talk more about your experience, especially the technical tooling. Working as fast as I can to understand the right way to approach the tech, as there are tradeoffs with performance and price. I'm at support @ searchagora .com

screye
34 replies
3d17h

What was the process for scraping 25M products ?

I have always used standard python tools like selenium, bs4 and the like. But I'm guessing none of these work at scale.

Could you talk about your process and key bottlenecks at that scale a little bit ? Also, how much did it cost ?

______________

A recommendation for how to improve search.

Your base captions will be pretty bad. You can use spot instances on a smaller GPU machine to run a dense captioning model (https://portal.vision.cognitive.azure.com/demo/dense-caption...) and generate captions for all your images.

Then for search, a simple vector store index would be a great retrieval solution here. It is better to do search using those as well.

Both are pretty cheap and can be done reliably within 20-30 lines of code each in python. 3rd party tools for these are pretty stable.

pencildiver
30 replies
3d17h

Great suggestions, looking into this right now. First time building something like this so definitely new to some of these tools.

For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.

Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.

Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

A few improvement that has helped so far:

- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json

- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.

Hope this helps. Still wrapping my head around all of this.

mfrye0
4 replies
3d15h

If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.

Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.

- https://www.postgresql.org/docs/current/textsearch.html - https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...

neeleshs
1 replies
3d11h

how big is the disk for the biggest instance?

mfrye0
0 replies
2d19h

Pretty small still at 500gb. It only stores hot data right now and a subset of what's important. Most of our data is in S3.

philippemnoel
0 replies
3d

You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)

LunaSea
0 replies
3d10h

I have troubles seeing how this is possible.

$220 dollars per instance gets you 8Gb of RAM which is way, way, below the index size if you are indexing billions of vectors.

tehlike
3 replies
3d15h

Disclaimer: I am building https://pricetracker.wtf

You may want to look at Hetzner, and cut your costs by about 90%.

Feel free to reach me, email in profile.

adentranter
1 replies
3d6h

hey! this is cool, I take it you are based in the US?

How long have you been working on this?

tehlike
0 replies
3d6h

On and off for a year, with more time allocated since June. Yes I am in California.

katella
0 replies
2d14h

In your footer you have a lot of links like "kitchenaid price tracker" and "best buy price tracker". Have these helped links helped?

ljm
3 replies
3d16h

2.2k/mo right off the bat is pretty steep, especially if you're paying that while the search response reliably takes over 10 seconds.

Why would you shovel 1.5k into MongoDB's pockets right off the bat? Especially when ElasticSearch is much better suited to what you're trying to do?

altdataseller
1 replies
3d12h

Sounds like someone drank the Mongo kool-aid. You absolutely do not need Mongo, let alone Mongo Atlas. 25 million documents with ecommeece products is measly and should fit in a single 600 GB server

ljm
0 replies
2d20h

Probably not even that - 25mil is nothing really. A normalised schema in an RDBMS would handle that without sweating.

dinobones
0 replies
3d15h

You could run this entire stack (yes, even for 25 million products) using Kubernetes in a $40/month Linode + Elasticsearch + Cloudflare free plan.

berkes
3 replies
3d9h

site called "Built With",

Do you have Alink. And are they any good?

stef25
2 replies
3d5h
berkes
0 replies
2d19h

I specifically asked the author if he could add some extra info on Builtwith.

I can Google. But then I don't know if its truly the site the author was talking about. And I certainly don't know his or her insights on that site.

Karl-Heinz
0 replies
3d3h

Berkes wanted to do good by sharing a provision with the OP, in case he/she buys something at builtwith.

We all know how to Google. :)

jabo
1 replies
3d16h

I'm biased, but I'd recommend exploring Typesense for search.

It's an open source alternative to Algolia + Pinecone, optimized for speed (since it's in-memory) and an out-of-the-box dev experience. E-commerce is also a very common use-case I see among our users.

Here's a live demo with 32M songs: https://songs-search.typesense.org/

Disclaimer: I work on Typesense.

keybits
0 replies
3d15h

I can also highly recommend TypeSense and have no affiliation. You'll save a lot of money and get much faster results.

dangoodmanUT
1 replies
3d5h

Yo fuck mongo just use RDS or some digitalocean DB. Or really just use opensearch/elasticsearch, or even typesense (don't bother with raft it's so broken) or meilisearch

jabo
0 replies
3d3h

We’ve interacted before on Twitter and GitHub, and I want to address your point about Raft in Typesense since you mention it explicitly:

I can confidently say that Raft in Typesense is NOT broken.

We run thousands of clusters on Typesense Cloud serving close to 2 Billion searches per month, reliably.

We have airlines using us, a few national retailers with 100s of physical stores in their POS systems, logistic companies for scheduling, food delivery apps, large entertainment sites, etc - collectively these are use cases where a downtime of even an hour could cause millions of dollars in loss. And we power these reliably on Typesense Cloud, using Raft.

For an n-node cluster, the Raft protocol only guarantees auto-recovery for a failure of up to (n-1)/2 nodes. Beyond that, manual intervention is needed. This is by design to prevent a split brain situation. This not a Typesense thing, but a Raft protocol thing.

wolfgang42
0 replies
3d11h

I’ll second the comments that $2k/month is alarmingly high, especially for the performance that you seem to be getting. When I shoved ~40M webpages into a stock ElasticSearch instance running on a 2013-era server I bought for $200 (on eBay), it handled the load when I hit the HN front page just fine. Either you’re being drastically overcharged or there’s something horribly inefficient in your setup that could probably be tweaked fairly easily to bring your prices down.

slt2021
0 replies
3d15h

managed elastic search could slash your cost by an order at least

leobg
0 replies
2d23h

I index 40M paragraphs of legal text, bm25 and vector similarity search, at < 200ms query time, on a single $80/month Hetzner server. Email in profile if you’d like to talk.

k12sosse
0 replies
3d16h

I'm currently not storing the image files, so that reduces the cost as well.

I wonder if someone catches on and replaces all your image URLs to the fuzzy testicle egg cup[0], will that negatively impact reputation?

0: http://i.imgur.com/32R3qLv.png

hipadev23
0 replies
3d16h

You’re spending $2k/mo run this?? Holy hell.

Oras
0 replies
3d3h

Take a look at TypeSense. Faster, better filtering, and much much cheaper if you’re going the cloud version

KomoD
0 replies
2d22h

Oh... no... $1500/mo?

DeathArrow
0 replies
3d11h

Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

It will probably cost you just $100 to rent a server from Hetzner and do the same thing. I would also use Redis or another kind of cache to hit the DB less.

4runner
0 replies
1d19h

Sounds like you used an incorrect instance type/size on Atlas

helsinki
0 replies
3d16h

25 million products is really not much at all to scrape.

Ninjinka
0 replies
3d17h

As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.

DeathArrow
0 replies
3d10h

I have always used standard python tools like selenium, bs4 and the like

There's nothing to scrap. You just download a JSON, the site owners kindly put on your disposal.

Scraping is a more complex process, where you have to work around rate limiting and captchas. For the tool I built I wrote tens of thousands of lines of code and I still find daily issues I have to deal with if I want to scrap a particular web page, issues I don't always have the time to solve.

misterbwong
13 replies
3d18h

What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?

pencildiver
11 replies
3d18h

Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end.

Here's an example: https://www.wildfox.com/products.json

satvikpendem
5 replies
3d18h

How did you detect that it was a Shopify store?

8372049
1 replies
3d16h

In another comment, OP wrote:

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
satvikpendem
0 replies
3d16h

Ah, that makes more sense, I used BuiltWith before.

xp84
0 replies
3d14h

There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.

Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.

stef25
0 replies
3d4h

Looked in the source of a random Shopify store, there are 200+ occurrences of "shopify", that's a clue :)

capableweb
0 replies
3d18h

Not OP but:

"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.

thomasfromcdnjs
0 replies
3d18h

ooo that is a hot tip!

qdequelen
0 replies
3d2h

Did you only get the schema.json?

fermisea
0 replies
3d18h

Oh nice, you deserve great things in life for this comment!

bomewish
0 replies
3d10h

What’s the trade off using js for this? Would it have been much faster to use go or something?

awill88
0 replies
3d18h

Excellent work

cldellow
0 replies
3d18h

(not the OP, but I have some experience with Shopify)

Shopify stores publish their product catalog at /products.json. From personal experience, you can hammer it pretty hard without being rate limited.

A challenge is that the pricing info in that endpoint is based on the stock Shopify catalog fields, and can be misleading depending on the specific theme customizations that the merchant uses.

Asparagirl
13 replies
3d18h

Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?

xnx
9 replies
3d18h

https://www.shopify.com/robots.txt lists a lot of sitemap files, which tend to be a good starting point.

prayze
4 replies
3d18h

Did this suddenly get changed? Nothing but "# ,: # ,' | # / : # --' / # \/ />/ # /" is shown now.

xnx
1 replies
3d17h

Weird. I think it did change. Google cache shows a 2229 line file: https://webcache.googleusercontent.com/search?q=cache%3Ahttp...

capableweb
0 replies
3d17h

Seems it might be looking at the referrer. Loading https://www.shopify.com/robots.txt from clicking the link shows the weird line while opening it in a private browser window shows the right one.

wizzwizz4
0 replies
3d17h

It's just your browser's HTML parser. Line 6:

  #                         / <//_\
This is being interpreted as a malformed HTML closing tag, which (according to the HTML5 parsing algorithm published by WHATWG) gets treated as a comment. The file doesn't contain any > past this point. This leaves the uncommented contents from lines 1–6:

  #                               ,:
  #                             ,' |
  #                            /   :
  #                         --'   /
  #                         \/ />/
  #                         /
Or, with whitespace collapsed:

  # ,: # ,' | # / : # --' / # \/ />/ # /
Which should be exactly what you observe.

Ref: https://html.spec.whatwg.org/multipage/parsing.html https://developer.mozilla.org/en-US/docs/Web/CSS/white-space...

calebegg
0 replies
3d17h

For some reason, "view source" gets the right list. Maybe a referer issue like someone else said.

calebegg
2 replies
3d17h

It seems sort of questionable to use the list of things to not scrape as a starting point for scraping.... I mean, I get it's not actually enforced.

xnx
0 replies
3d2h

Since ~2009 many crawlers recognize "Sitemap:" directives in robots.txt to link to sitemaps: https://en.wikipedia.org/wiki/Robots.txt#Sitemap

das_keyboard
0 replies
3d7h

Not really sure why all the answers here are flagged, but you may be mistaken.

The robots.txt does not exclusively list what not to scrape.

It provides information on which parts are allowed and wich are not (disallowed).

It also provides sitemaps for crawlers as a starting point with more information (eg. which sites are available and how often are they updated, etc.)

KomoD
0 replies
2d22h

Looks like it's just Shopify's own pages and not anything related to actual stores.

pencildiver
1 replies
3d18h

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).

russum
0 replies
2d3h

and between $100k - $1m in revenue

Does "Built With" provide that data? How accurate do you think it might be?

patatero
0 replies
3d16h

Shopify shops always have /collections, /products, and /pages in their URL. If you have a regular Shopify site, you're not allowed to change them. I don't know if Shopify Plus clients can change them.

Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com

xnx
7 replies
3d18h

Great project. If you continue to crawl the data, be sure to save it so you can detect price changes a la camelcamelcamel.

secabeen
4 replies
3d17h

For all of Amazon's faults, the fact that they tolerate CCC does drive a lot of my online purchases there. CCC used to track other sites, and was eventually blocked on all of them. If more sites want my business, showing their pricing history (either from internal data, or by letting someone build the DB) would go a long way.

moneywoes
1 replies
3d17h

is camel camel whitelisted by amazon? or can any scraper work

tehlike
0 replies
3d16h

Amazon associates program doesn't normally allow price trackers except with written approval

tehlike
0 replies
3d16h

Amazon doesn't allow price alerting/tracking on their affiliate program anymore, you need explicit written consent.

I am the owner of https://pricetracker.wtf and got the boot today.

nocoiner
0 replies
3d14h

Is it somehow known that CCC hasn’t been co-opted by Amazon? Frankly I figured Amazon would have bought them out a decade ago, but maybe the CCC founders have a stronger ethical compass than I do.

pencildiver
1 replies
3d18h

Great call! I am doing back-ups on Mongo and this is a good use-case for tracking changes. Also trying to figure out how to detect is a product is sold out or not being sold anymore.

Minor49er
0 replies
3d18h

I worked on a competitor to CamelCamelCamel years ago. We had this exact issue since people would often click through to a page where the price or availability were different from what we were showing

Ultimately, we ended up adding an interstitial page between the product listing on our site and the page on the seller's site

This interstitial checked to see if we checked the price in the last couple of minutes, and if not, it would run a quick scrape of the page to ensure that we had the most up to date information

I can't remember exactly what the messaging or behavior was when there was a difference. I think there was a message that was displayed if the prices were different. Or if the product was actually out of stock, it would pull the user back into our site with a toast explaining that the product was no longer available

Anything less aggressive than this resulted in more customers experiencing price/availability errors or simply leaving the site, and anything more aggressive resulted in angry site owners who were losing bandwidth to our bots

Also trying to figure out how to detect is a product is sold out or not being sold anymore

In these cases, either the page will say as much (eg: "Product Unavailable"), have some kind of stock or status code hidden beneath the UI to show that it's not available, or the target page will simply vanish from the web. However, none of these are guarantees. A site could say that a product has been discontinued, but the item could come back later, or under a different SKU, or whatever else

cmcconomy
7 replies
3d18h

That's funny, I made a domain-specific version of this for canadian coffee deals.

https://beangrid.mcconomy.org/

pencildiver
3 replies
3d17h

Super cool project (especially as a coffee lover myself)!

cmcconomy
2 replies
3d17h

The fun part was figuring out how I was going to put the site up without hosting ;)

ska
1 replies
3d17h

github?

cmcconomy
0 replies
3d17h

Yes - I have a daily cron-based scrape & commit job which updates the table data source CSV, along with github hosting for the static components.

boringg
2 replies
3d16h

Which coffee seems to hit the best in Canada (your take). I find the espresso in Canada hasn't been as good as the coffee brands in the US but I'm open to possibilities.

Also like the project!

wcarss
0 replies
3d15h

personal somewhat-pedestrian list: Pilot, Detour, Reunion, Propeller, Phil and Sebastian

cmcconomy
0 replies
3d14h

I enjoy Café St-Henri's Godshot when I can get it at a discount. Anything from Stereo is great, and I've enjoyed Monogram and De Mello. If I was buying at regular price, I would often get Social Coffee.

Of course I put this together because Black Friday is when I load up on (relatively) cheap coffee and chuck it in the freezer, so this time of year I always branch out and try new places and new offerings from familiar places. I built this list mostly from a reddit compilation I found, and I've been slowly updating the source url list as I learn of new canadian roasters that happen to be Shopify customers.

konschubert
5 replies
3d18h

Hey, I have a Shopify store that sells e-paper calendars / smart screens. I tried to search for it but I could not find it. What should I do so your crawler can find me?

https://shop.invisible-computers.com

shubham_sinha
1 replies
3d13h

Hi, you could drop an email to onboard@peppyhop.com and we will be happy to onboard you. Please add target geography like you would like to target Indian market or US market

crakhamster01
0 replies
3d12h

always be closing lol

pencildiver
1 replies
3d12h

You’re live on Agora:

https://www.searchagora.com/products/invisible-calendar-6266...

Thinking that we should have a page where store owners can submit their URL to be crawled.

konschubert
0 replies
3d

Cool, thanks!

pencildiver
0 replies
3d18h

Super cool product! I'm currently using a list of Shopify stores, so it's still limited (i.e. wanted to start with a relatively small list to focus on the search experience). I'll submit your URL to the crawler now. If you want to reach out to support @ searchagora.com , I'd love to get your feedback as a Shopify store owner.

ashvardanian
5 replies
3d14h

Cool project!

As you scale, you may benefit from these two projects I maintain, and the Big Tech uses :)

https://github.com/unum-cloud/usearch - for faster search

https://github.com/unum-cloud/uform - for cheaper multi-lingual multi-modal embeddings

Feel free to reach out with feedback and feature requests!

pencildiver
1 replies
3d5h

Can't believe I missed this. Taking a look at both repos now. The further I get into this space, the less I feel like I know. Appreciate you sharing, will reach out!

ashvardanian
0 replies
3d2h

Don’t worry, our solutions aren’t well known. They are either used by enthusiasts, or the Big Tech, and the latter don’t like mentioning that :)

dangoodmanUT
1 replies
3d5h

I've been following your work for a while, was really excited to play with UDisk but I guess that got dropped in favor of AI solutions?

ashvardanian
0 replies
3d2h

Due to constrained resources we had to prioritize the smaller and simpler projects - USearch, UForm, UCall, and on the personal side - StringZilla, and SimSIMD. That’s a lot for a team without revenue and VC funding.

I am still actively thinking about open-sourcing UDisk. Thanks for keeping tabs on us :)

gajus
0 replies
3d13h

This is cool

ttt3ts
4 replies
3d17h

Built the same thing a while back while collecting a lead list for sales. Not bothered to keep data updated but was a fun thing to build in a couple days. (disclaimer mobile experience is meh cause it was a fun project)

https://zensear.ch

How did you find list of all Shopify stores? I ended up just checking every .com, .net, etc as I didn't find an easy way to figure it out directly from shopify.

slimebot80
2 replies
3d12h

Nice. But can I ask what motivated you? I don't see any affiliate details in the links - do you monetise at all?

ttt3ts
0 replies
3d4h

Wanted to play with typesense.

The tech behind building something like this isn't hard. Marketing and traffic is. No point in monetizing with no users.

alvarome
0 replies
3d3h

It looks like from the "become a merchant" you can directly boost your products which I assume will prioritize them and hence drive more traffic to your Shopify store

8372049
0 replies
3d16h

In another comment, OP writes:

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
taimurayaz
4 replies
3d18h
pencildiver
0 replies
3d18h

Yeah absolutely. I hadn't heard of Shop until today but the value proposition is definitely similar. In the next week, I'll add other e-commerce platforms like BigCommerce, WooCommerce, support for custom built sites, etc. to really differentiate the user experience.

gnabgib
0 replies
3d18h

But seems to have filters (lots of liquor stores use Shopify) - shop.app shows only candy and swag[0], while searchagora shows ~130k results for the actual product [1]

[0]: https://shop.app/search/results?query=Baileys+Irish+Cream+Li... [1]: https://www.searchagora.com/search?query=Baileys%20Irish%20C...

gardenhedge
0 replies
3d17h

Oh cool. That works a lot better than OPs

dinkleberg
0 replies
3d18h

Was about to share the same link, it seems like competing against Shopify would prove quite the challenge.

The real way to differentiate IMO is with a targeted UX for different niches rather than the one search engine to satisfy all queries.

callmeed
4 replies
3d11h

I built this a couple years ago (now defunct) for the same reason :) The public JSON endpoints on shopify stores make it pretty easy to get the data. You mentioned using Mongo but it sounds expensive. I honestly think you could do this with just elastic or even postgres full text search and save money.

Here's a pro tip + feature you should implement: Shopify has a semi-hidden hack where you can link directly to checkout of a product if you know the variant ID. You could add a BUY NOW button to your site without forcing the user to navigate the original site or checkout flow. Example: https://hapaboardshop.com/cart/42165521907955 (it also supports quantities and coupon codes)

A word of caution: more products isn't necessarily better. I definitely found there to be a long tail of really bad shopify stores and products. IMO it's better to curate or audit the stores you index–otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

senecaso
1 replies
3d1h

I didnt know about the link to checkout. That's a slightly nicer user experience for sure. Still, its confusing for users who want to do more shopping at the same time. I had users who clicked on a number of items, clicked "add to cart" in each one (all different shops), and then couldn't figure out how to checkout on the main site afterwards! Obviously people were looking for a more complete one-stop-shopping experience than I was providing at the time.

callmeed
0 replies
2d21h

I mean a single checkout from multiple shopify stores isn't really possible (at least by 3rd parties)

My hypothesis is that, if you could drive traffic to your site and offer a fast checkout experience, there's probably multiple ways to monetize that. Driving the traffic is the hard part.

pencildiver
0 replies
23h53m

Thanks for the heads up! I spent some time trying to get the cart route to work. Doesn't seem to be supported anymore (link you sent leads to a 404 page). Tried it with every combination of Product ID, Variant ID, etc. Let me know if you have any ideas on how to get this to work. It would be a great feature to add to Agora.

And I agree on quality over quantity. Writing a script to remove all stores that are shutdown, products that are sold out, and a few other characteristics. Heavily focusing on the search algorithm and data quality now.

DeathArrow
0 replies
3d11h

otherwise you risk your site being littered with kitchy t-shirts or drop-shipping garbage.

You mean like Amazon?

asdadsdad
4 replies
3d16h

cool project. You might have notice, but there's a non-trivial amount of fraud on shopify (fake shops, info stealers, etc). Might be interesting to look at that dataset and explore a bit =) it's quite fascinating

pencildiver
2 replies
3d16h

I've definitely noticed that already. Any advice on how to spot that?

Another challenge is that there are products sold on the original site and third-party marketplaces, both of which could sell on Shopify. So need to find a way to automatically detect the type of store.

asdadsdad
1 replies
3d4h

you might check this for inspiration (https://seguranca-informatica.pt/shopping-trap-the-online-st...).

I used to have a huge IOC collection, but now stopped tracking them.

things like HTML markup, pricing patterns, IPs might inform on specific clusters of fraudsters.

I don't think Shopify cares tbh

pencildiver
0 replies
3d4h

Super helpful, thank you.

smcin
0 replies
3d16h

What are the telltales, for spotting those?

And how aggressively does Shopify verify/police them?

TekMol
4 replies
3d10h

The Terms page goes to "Jaggi Enterprises", "A Modern Investment Fund. We buy, build, and invest in software companies with recurring revenue.".

So maybe this is not really something a guy built for his wife, but some anonymous startup that googled "Which terms rank best on Hacker News" and then wrote the "I did ... my wife .." story?

ltbarcly3
3 replies
3d10h

Jaggi is a fake it until you make it fake portfolio. Most of the companies it runs are just lorem ipsum fake sites. I think it is likely true that this is a solo dev.

TekMol
1 replies
3d9h

You mean the site is not owned by Jaggi?

Then why would the terms and privacy links go to Jaggi?

pencildiver
0 replies
3d6h

OP here. Yup, I am in the process of starting a holding company LLC for my software products and small investments. Just went ahead and deleted 2 from the Investments page that are not launched yet but still in-development (just had landing pages up for those). Wasn't planning on releasing the Jaggi site yet, as I'm still wrapping my head around the holding company structure / it's new to me.

Agora has been a side project of mine. TBH in retrospect, I wish I would have given this post more thought as the servers / search performance wasn't prepared for any significant traffic. So definitely didn't game HN.

SwedishExpat
0 replies
3d10h
RagnarD
4 replies
3d11h

Is this really within the TOS of Shopify?

jesterson
3 replies
3d10h

Does it matter? Or you can't do anything not explicitly allowed by law?

Shopify is the company spotted in so many shenanigans, so anything that undermines it's business I would personally welcome very well.

RagnarD
2 replies
3d6h

It matters when you get sued.

jesterson
1 replies
3d5h

Living in fear is pathetic

RagnarD
0 replies
2d11h

Ignoring reality is stupid.

quickthrower2
3 replies
3d18h

I like it.

I need to be able to filter search to if it will deliver to my country.

It desperately needs some indication that your action is being processed, like a spinner, when you search.

pencildiver
2 replies
3d17h

Absolutely. Working on a "ships to" filter and enhancing the 'price' filter.

Also fixing the loading experience as we speak. Wasn't expecting this level of traffic so didn't account for slow server speed with the front-end experience.

stef25
0 replies
3d4h

Or even "ships from" / "located in".

I'm in Europe and don't want to deal with custom hassles or delays from shipping. Etsy and Reverb both have this option which I never fail to use.

quickthrower2
0 replies
3d16h

Thanks! It is reasonably fast but slow enough that a cue is needed so I know the input triggered.

pitched
3 replies
3d15h

Shopify has tried a few times to build a tool like this but hasn’t ever managed to get any traction. I think that missing any curation at all could be what eventually kills it. Their current attempt is https://shop.app and a query for red shoes is mostly red shoes.

hackideiomat
1 replies
3d10h

a query for red shoes is mostly red shoes

well I get mostly black shoes lol

Edit: ah no, they just use half a page for shoe shops first with black shoes as logo??

dangoodmanUT
0 replies
3d5h

ads baby

senecaso
0 replies
3d14h

Ya, curation is sadly required in the Shopify ecosystem. There are millions of shops, there is a tonne of garbage. Its also difficult (but not impossible) to properly classify items so that you can better target results for a given query. One of the first problems that anyone attempting this will run into is the amount of mature content available on Shopify shops. Innocent queries turn up many NSFW images that may offend some users, so you have to be able to get on top of that one pretty quick.

I remember in once case, I found what appeared to be an escort service listing "models" on Shopify. It was super creepy. I needed to get in front of that one pretty quick as well, as it was turning up in results.

glohbalrob
3 replies
3d18h

wow! Nice work. I've been trying to build an index of shopify stores. Did you search for all domains pointing to shopify's name servers?

selcuka
2 replies
3d18h

I don't think that would work as many people also use Cloudflare etc.

You may try using BuiltWith which is a paid service:

https://trends.builtwith.com/websitelist/Shopify

ttt3ts
1 replies
3d16h

It works fine. Just issue a HEAD request when you are unsure and rotate proxies a lot l. Takes a bit of infra but definitely possible.

selcuka
0 replies
3d3h

I mean simply querying nameservers won't work.

usrme
2 replies
3d11h

Maybe I'm clearly ignorant, but how does this differ from Klarna (https://www.klarna.com)?

tmikaeld
0 replies
3d11h

Do you mean pricerunner?

There's similar price comparison sites, but they don't index every store available.

You basically have to submit your listing.

spiderfarmer
0 replies
3d11h

Is there a similarity at all? One is a search engine, the other a leeching "buy now pay later" scheme as a service.

muratsu
2 replies
3d18h

Agora also doesn't return red shoes for the search query "red shoes". Seems like you haven't fully solved the problem yet :)

From a technical perspective, crawling 25M products is impressive but the search itself doesn't provide much value to me. I already use large e-commerce sites (amazon, wallmart, ...) and targeted ones (Nordstrom, SSENSE, ...). Sure I may not be searching through all the shopify, wix stores but I need to know why that's valuable to me to begin with. Perhaps understanding the value prop of SMBs and educating me about it would be a better positioning for Agora than simply being a search engine.

pencildiver
1 replies
3d18h

Definitely have not solved the problem yet! The search algorithm prioritizes the brand called "Red Wing Shoe" so still figuring out ways to show real 'red shoes'. Have been thinking about passing the images through a detection tool and tag them to enhance the search experience.

Re: Value Proposition. Absolutely, I think focusing on the SMB-angle and 'local shopping' will help direct users better. I'll definitely take this into account.

paulddraper
0 replies
3d18h

Best of luck on your marriage

krauses
2 replies
3d16h

What's your revenue model? I see you expanded on the details of your $1.5K monyhly cost, but failing to see how you make money? Affiliates fees?

pencildiver
1 replies
3d15h

Right now, charging Shopify store owners $99 / product / month to give them a 'verified' tag and boost their product in search results. Currently not making money on affiliate fees.

I wanted to first prove that people would actually use this / find value in it. Fortunately a few merchants have reached out already via email to talk through the business model so this will likely evolve as we learn more.

jaipilot747
0 replies
3d15h

What are you verifying?

dns_snek
2 replies
3d17h

I'm sorry, but I have to question where this heartfelt story about looking out for your wife is in any way real?

The website certainly doesn't look like a side project, it has a fully fledged system for merchants to advertise on Agora for a fee, an affiliate system offering $50 commissions to onboard merchants and the ToS and Privacy policy link to a website with the following mission statement:

We buy, build, and invest in software companies with recurring revenue and product-led growth.
pencildiver
1 replies
3d17h

OP here. Yup, haven't launched the holding company yet but idea is to have an LLC for all my software projects. Still a work-in-progress in both thinking and execution. Agora specifically came from a personal need and is obviously still an MVP project.

Spun up a Merchant Page and Affiliate Program page in a few hours on Webflow using a template. There is a merchant dashboard built but the 'affiliate program' is a test.

ruune
0 replies
3d17h

How is "Jaggi Enterprises LCC" involved, where for example the terms lead to?

difradev
2 replies
2d8h

Amazing job! I've one question: how did you find the price of every products? I mean, every product page has a different id or class that identify a price. Do you use a regex?

pencildiver
1 replies
2d8h

Thanks! Actually a lot easier than you'd expect. Not touching anything on the front-end of the Shopify store.

Every Shopify store has a public JSON file at [Base URL]/products.json with 'price' as a field. Example here: https://wildfox.com/products.json

One thing I messed up on originally was not pulling the 'currency' field which is actually in a different public JSON file called 'meta.json'. Example here: https://wildfox.com/meta.json

Separately, this was primary reason to only start with US stores: to make sure the currency shows up correctly and to purposely limit the initial audience to keep loading times reasonable. Working on adding all Shopify stores in the world now (a list of about 5 million active stores from what I have found).

difradev
0 replies
2d8h

Clear! Thanks!! :)

1vuio0pswjnm7
2 replies
3d14h

"There's about 25 million products on Agora right now."

How many stores are represented in index.

senecaso
0 replies
3d14h

https://www.searchagora.com/about

Seems he is indexing nearly 650k shops.

alvarome
0 replies
3d3h

If you check the "about" section you can see how many merchants were added, over 640k at the moment

yoru-sulfur
1 replies
3d14h

For those unaware, Shopify already has platform wide search. You can use https://shop.app/ (or the app), and it also has some chatbot thing that can offer suggestions

senecaso
0 replies
3d14h

Yes, this has been available for a few years now. Initially, they only indexed a very small number of shops, so it was less useful. Based on a few queries, it seems like the are still using some form of text-based search with rank boosting. Seems like they still aren't searching their entire base of shops, but they have increased the number of shops for sure, and they seem to be continuing to invest in the product, which is nice. It seems more useful now than it did the last time I checked!

twothamendment
1 replies
3d16h

Searching is slow (kinda expected that right now), but after clicking a product and then hitting back, I have to wait for the search again.

Not at computer so I didn't check the headers, but maybe allow the client to cache the response for a short time so it doesn't need to load search results again.

pencildiver
0 replies
3d16h

Just upgraded the storage and put in a few fixes so it's working a bit faster now. Working on caching some responses locally as we speak. Great idea.

treesciencebot
1 replies
3d18h

This is amazing for finding cute collectibles from my favorite TV show that I would otherwise not noticed among random t-shirt and other "slap the picture and call it co-branded" products! I'm not super sure how long it is going to be around, but I think I'm gonna keep playing with it for a while.

pencildiver
0 replies
3d18h

Really happy to hear this. I'll do my best to keep this around :)

system2
1 replies
3d16h

How are you planning to monetize this? You mentioned you are spending around $2K just to run it. Is there a commission strategy or ads? Or populate with your products at one point so you sell your own thing?

alvarome
0 replies
3d3h

It looks like from the "become a merchant" you can directly boost your products which I assume will prioritize them and hence drive more traffic to your Shopify store

shadowbanned4
1 replies
3d16h

This isn't worth the cost or effort. Shopify already has an internal tool with this functionality that they are planning to publicize.

6510
0 replies
3d15h

There is no need to limit it to that, most shops have some kind of product feed.

sanketgoyal11
1 replies
3d6h

How did you find the list of shopify stores and names?

pencildiver
0 replies
3d6h

Found a list on a site called "Built With". Mentioned this in another comment but I think it's meant for building sales outreach email lists.

Once you have the store URL, you can get general store information at [Base URL]/meta.json

Here's an example: https://wildfox.com/meta.json

rocauc
1 replies
3d15h

Really neat. I tried your search for red shoes, and I found some, er, unexpected imagery on page 1.

One thing you could do is add semantic search so when a user searches "red shoes," the index returns images that look like red shoes even if the metadata doesn't say anything about color or item types. To do this, I'd use a model like CLIP. Here's an example of using CLIP and Supabase to do semantic image search: https://blog.roboflow.com/how-to-use-semantic-search-supabas...

pencildiver
0 replies
3d14h

Awesome, thanks for the suggestion / link! Actually left another comment about potentially doing semantic image search to improve results so wrapping my head around it now.

noduerme
1 replies
3d16h

This is great - just a couple UI things bugging me. 1. When clicking "Open" on a product, the user should be able to open that in a separate tab. Currently that's not possible; I'm sure because it's being delivered in a single page (can't check now because you're getting hugged to death by HN).

2. When the server's slow, as it just was, there should be some kind of waiter / loader to immediately show the user that the "Open" click was sent on a product. Otherwise people will keep clicking it (or worse, clicking other products) and there's no indication that it's loading.

3. Once a product is open, it's not clear how to get back from it. I see the "X" in the corner, but doing that seems to take me back to a blank search page, not to my search results. The back button also doesn't take me back to the search results...

pencildiver
0 replies
3d16h

Thanks for sharing this! Definitely wasn't expecting this level of traffic so didn't account for some front-end loading experiences. Implementing these now.

For 3, thinking to let the back button work the same as the "x", that way a user can return to where they are in a search result regardless of what they click on.

moneywoes
1 replies
3d17h

where did you find a list of shopify stores to scrape

8372049
0 replies
3d16h

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
mandeepj
1 replies
3d18h

Do you plan to add filters: price etc?

I was about to 'reviews' as well in the above list but decided not to as they are not always trustworthy. Now AI is so advanced, that it can be used to detect fake reviews and ignore them from sampling.

pencildiver
0 replies
3d18h

Yes. There's a very basic price range filter right now. Working on adding a ships to, location, and a few others. Open to any ideas that would help in the shopping experience.

There are 'reviews' now and made the decision to only let authenticated users leave users so they are more trustworthy (i.e. thinking is that adding more friction will lead to higher quality reviews).

jross225
1 replies
3d4h

heh, I used to work on the data team at Shopify. I built something similar to search internal dbs for secret santa gifts based on some weird criteria. Scraping might have a large margin of error because a lot of products tend to be ephemeral.

Neat project though!

pencildiver
0 replies
3d4h

Agreed on the large margin of error. Working on a bot to store and convert the images to webp to improve performance. Having the bot do a check for any images that don't exist and removing those listing. Will likely also need to triangulate this with a 404 check. Recently added an option for users to mark a product as "sold out" on the search results which will help as well.

Unrelated, but what was the "weird criteria" for the secret santa exchange? Half joking but also helps with figuring out filters :)

joshuamcginnis
1 replies
3d18h

I love your approach; you found a problem and developed a solution for it. And then you got the courage to share with the larger technical community. Good on you.

There's obviously some rough edges (multiple duplicate products, issues with product links linking to empty pages, and no results for broad terms), but don't let that stop you. I'm certain they can all be fixed.

Keep going! At the least, you'll come out of this with an excellent project in your portfolio.

pencildiver
0 replies
3d18h

Thank you, that means a lot. It has definitely been a whirlwind of emotions since posting on HN but glad I did. It's definitely an MVP so going to work fast to improve it.

jonnycoder
1 replies
3d17h

Awesome! It would be good to listen to the enter key when typing in a search query. Your privacy and terms links point to what appears to be the saas code framework you used (just a guess). I was looking for your contact/email so I can ask you some questions.

pencildiver
0 replies
3d17h

Enter key should work, but the loading speed is very slow right now. Fixing that right now :)

You can reach out at support @ searchagora.com. Would love to talk!

jasonlbaptiste
1 replies
3d14h

You should def give Algolia and Typesense a try. You can get 10k in free Algolia credits for the first year too via Secret (startup deals site).

pencildiver
0 replies
3d14h

Will do. Thanks for the heads up!

hipadev23
1 replies
3d18h
owlninja
0 replies
3d14h

I see some Red Wing shoe products first, then some bearings, some shoes, then someone in lingerie with red heels, and then more red shoes. I know Google is the horse to beat right now but they showed me nothing but red shoes to buy plus a few ancillary results for such a generic search - and if I was a good husband I would add a few words to narrow my search.

ganesha727
1 replies
3d9h

Idea! Shopify has a ton of resellers that sell junk from China. If you figure out how to avoid them, your life would be 10x easier.

berkes
0 replies
3d9h

I was wondering on a tech solution for this lately.

What I now do when I shop, is that I compare the images (and descriptions) of listings on Amazon, bol.com or Shopify with listings on Ali and Temu.

If they're exactly the same, run. If they differ slightly, look closer (and most likely run). I guess automating this could make for a solution to at least detect cheap resellers.

freefruit
1 replies
3d14h

Could you make it so, that I can easily open a product in a new tab. I like to compare lots of products at the same time.

pencildiver
0 replies
3d14h

Absolutely. Currently on the search results page, you can click on the title / price area to open the product URL in a new tab. Open to any suggestions as well. So the product URL is accessible from the search results page or if you click on the product image to open the product listing and click on 'visit product'. Let me know if that makes sense and if you have any suggestions to make it better!

dangoodmanUT
1 replies
3d5h

Super cool!!!

pencildiver
0 replies
3d4h

Thanks! Laughed audibly at your comment in the other thread about Mongo haha

Looking into all options right now. A tradeoff between current stability, price, and performance.

bluepnume
1 replies
3d16h

Amazing! Does it have an api?

pencildiver
0 replies
3d15h

Not yet! Definitely something to consider as I upgrade the architecture. Would it be helpful to have API access to all products on Agora for your own app?

b2bsaas00
1 replies
3d18h

Basically it’s Amazon

quickthrower2
0 replies
3d18h

Antizon

alvarome
1 replies
3d3h

I'm a Shopify store owner myself. I saw there is a $99 per month to get your product verified, how would this compete in terms of CPC with a traditional channel such as google ads or meta ads?

minastirith
0 replies
3d3h

Not the OP but counting on an average of $3 dollars per click in the US across those channels, I'd say this pricing is way more effective with the amount of searches the site is getting.

Redster
1 replies
3d17h

Have Swedish family. Searched dala because family wants traditional Christmas ornaments. Sure enough, there were several results that were 10x cheaper than what I could find on the first page of Big Search Company. Great job!

pencildiver
0 replies
3d16h

Amazing, glad you were able to find it. I also just learned about what a "Dala" is :)

IcyHordr
1 replies
2d3h

So cool, good luck in the marriage, you made a very cool thing!!

pencildiver
0 replies
2d3h

Thank you! Tied the knot back in May, so both marriage and Agora are new to me. Open to advice on either.

EvanAnderson
1 replies
3d15h

Aside: The ending of the 1948 "The Red Shoes" was funny to me, but I think I was a little loopy after slogging thru it. I don't know if I recommend it or not.

pencildiver
0 replies
3d15h

I definitely need to watch it now.

virtuosarmo
0 replies
3d4h

I believe Shopify built their own app / website where you can search for products exclusively from Shopify merchants. https://shop.app/

thih9
0 replies
3d18h

When I search for “op-1”, partial match like “Frontier Co-op Turkey Rub, Organic 1 lb. -- Frontier Co-op” gets ranked higher than “teenage engineering op-1”. I would expect the opposite.

shubham_sinha
0 replies
3d13h

I am building similar thing at https://peppyhop.com . Currently it’s restricted to Indian geography. We onboard stores that have shopify or woocommerce backend. We have also developed our custom ML model to better classify products.

sails
0 replies
3d11h

Clicking an item could show you similar items before it takes you to the item (or have capability for similar)

quaxar
0 replies
3d17h

Great site. Having built a search engine that needed to handle product data on a similar scale, it's not an easy thing to manage.

Some observations:

- Don't use infinite scrolling, it's an outdated UI practice that leads to bad user experience. It also makes the footer entirely unviewable.

- Clicking on a product card image does not reliably open up the product. I have to randomly click on it a few times (Chrome, Brave)

- Clicking on product card image and title leads to different actions, this is a bit unexpected, should show some hint of the difference.

- The product page pop up will reset the search list when closed, this messes up my search navigation, breaks the flow of browsing.

qdequelen
0 replies
3d2h

Hey, I'm the CEO of Meilisearch. If your issue is performance, I would love to give you a try with Meilisearch. You'll be able to create an "as you type" experience with our engine that responds in less than 50ms!

pencildiver
0 replies
2d4h

HN— Not sure if anyone will see this but I wanted to thank you all for the support. Although I haven't slept much since going live, it has been amazing getting early feedback from the community.

Agora is still in MVP stage but getting better by the day. Just pushed a big update: fixing an image shifting bug, a blur effect on loading, Redis for caching, brand pages, architecture fixes, and several other things. Currently working on improving the relevancy algorithm, adding all ~5 million Shopify stores, and then adding WooCommerce stores over the next few days.

If you have any suggestions or ideas, reach out to me at support @ searchagora .com :)

nox100
0 replies
3d9h

I have no clue how to implement a search but maybe some words are more important than others.

I searched for "mens dress shirts button long sleeve" and after about 6 results it was all women's clothing.

moneywoes
0 replies
3d17h

how did you avoid ip based blocking? rotating proxies?

minastirith
0 replies
3d3h

Love it! Some improvements are needed on search but is an amazing MVP, I'll use this for my late christmas shopping

kacyjames
0 replies
3d2h

Any Unicode input (Japanese or Greek text for example) currently causes a 500 error.

joshdance
0 replies
1d21h

Amazing. Why doesn't Shopify built this natively?

jillesvangurp
0 replies
3d9h

There are a few conferences dedicated to ecommerce search. Mices is pretty good. I did not go there this year but I know some of the people behind it. Good community and lots of stuff happening.

Two points here.

- 25 million is really not a lot for most search engines. Something like Elasticsearch can easily deal with that if you deal with it properly. And there are plenty of equally capable solutions. I have worked with logging clusters that processes log entries by those numbers on a daily basis. A modestly sized cluster goes a long way for that. Bare metal is cheaper than cloud for this. But a couple of simple servers with decent CPUs and memory and SSDs should go a long way here. Start worrying once you hit a few hundred GB of storage used. Anything below that is easy to deal with.

- The key challenge with this volume is not performance but search quality. Building a competitive search engine is hard. You might have thousands of potential matches out of millions for any given query and your job is to pick the best 3, 5, 10 (whatever fits on your screen) ones. This is hard.

So, what makes for a good answer is the key question to answer. All the naive solutions for this problem put you at the bottom of the market in terms of competitiveness. If you can't do better, you are just another low quality search engine not quite solving the problem. The bar is high these days for a good search engine and most of the better ecommerce companies have highly skilled search teams working on this.

ganesha727
0 replies
3d9h

Gg

ctocoder
0 replies
3d1h

how did you get a list of the 25 million stores to crawl?

connectingu
0 replies
3d2h

Incredible. Would love to connect with you. Where can I find you LOL

connectingu
0 replies
3d2h

Incredible. Where can I connect with you? Want to pick your brain & swap some thoughts :)

codetrotter
0 replies
3d10h

On the page where you show details about the product, I would like to have it include the same product from other Shopify stores by doing an image similarly comparison.

And then highlight how the price compares.

For example, here are some pretty crazy red shoes. But they are too expensive for me. Would be interesting to see if this is the only store selling these shoes, or if someone else has the same shoes much cheaper.

https://www.searchagora.com/products/vasco-4-47fb0f87-5b89-4...

bsbechtel
0 replies
3d17h

I searched for 'pão francês' and my store was the #1 result. I think you're doing it right! :)

bomewish
0 replies
3d11h

Why not manticore as backend? Much better perf than ES, less memory intense, sql syntax. Just fantastic all round!!

Wajid2502
0 replies
2d20h

Great idea

Canada
0 replies
3d11h

Worked well for me, great job. I searched for something I've been looking for and found some interesting options I haven't seen before.